TWI874059B - A deep learning model system and method thereof - Google Patents
A deep learning model system and method thereof Download PDFInfo
- Publication number
- TWI874059B TWI874059B TW112149805A TW112149805A TWI874059B TW I874059 B TWI874059 B TW I874059B TW 112149805 A TW112149805 A TW 112149805A TW 112149805 A TW112149805 A TW 112149805A TW I874059 B TWI874059 B TW I874059B
- Authority
- TW
- Taiwan
- Prior art keywords
- data
- drug
- protein
- prediction
- model
- Prior art date
Links
Images
Landscapes
- Investigating Or Analysing Biological Materials (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
本發明是關於一種深度學習模型系統及其方法,更詳而言之,係有關於一種應用於藥物標靶交互作用預測排序的環境中的深度學習模型系統及其方法,係屬於使用深度學習的藥物標靶交互作用預測排序的技術領域。The present invention relates to a deep learning model system and a method thereof, and more specifically, to a deep learning model system and a method thereof applied in the environment of drug-target interaction prediction and ranking, and belongs to the technical field of drug-target interaction prediction and ranking using deep learning.
藥物發現到上市所往往需要耗費大量時間以及金錢,以新藥研究並開發到登記上市來說,時間成本就需要花費平均13~15年時間,而金錢成本則是花費平均20~30億。根據統計,這些成本還處在上升趨勢。為降低金錢與時間成本下,現階段製藥界是以藥物重組取代部分新藥開發。藥物重組的精神源自於經濟利弊的成本考量與分析,透過為現有藥物確認新的臨床用途來加快藥物開發過程。藥物重組的成功及日益增加的應用可以被視為多靶藥理學的延伸之一,它代表了藥物發現中從單標靶形式轉變成多標靶形式的表現。然而,不論在藥物發現以及藥物重組中,識別藥物與標靶間的交互作用(Drug-Target Interaction,DTI)皆是重要的步驟之一。因此識別DTI在相關領域的研究中成為至關重要的課題。然而,目前的識別方法仍然仰賴網路實驗室(Web lab)實驗,觀察與紀錄藥物與目標分子(例如酶、細胞表面受體、G蛋白偶聯受體(GPCR)、配體門控離子通道、核受體等)的相互作用來證明其療效。其衍生的時間與金錢成本仍然是許多藥物研究中的重大障礙。It often takes a lot of time and money to develop a drug and bring it to market. For new drug research and development to registration and marketing, it takes an average of 13 to 15 years, and the cost is an average of 2 to 3 billion. According to statistics, these costs are still on the rise. In order to reduce the cost of money and time, the pharmaceutical industry is currently replacing some new drug development with drug reorganization. The spirit of drug reorganization comes from the cost consideration and analysis of economic pros and cons, and accelerates the drug development process by confirming new clinical uses for existing drugs. The success and increasing application of drug reorganization can be regarded as one of the extensions of multi-target pharmacology, which represents the transformation from a single-target form to a multi-target form in drug discovery. However, whether in drug discovery or drug reorganization, identifying the interaction between drugs and targets (DTI) is one of the important steps. Therefore, identifying DTI has become a crucial topic in research in related fields. However, current identification methods still rely on web lab experiments to observe and record the interaction between drugs and target molecules (such as enzymes, cell surface receptors, G protein-coupled receptors (GPCRs), ligand-gated ion channels, nuclear receptors, etc.) to prove their efficacy. The resulting time and financial costs are still major obstacles in many drug research.
隨著智慧計算技術的日新月異,以計算模擬方式對於現有的藥物進行重新利用和探索新的相互作用開始引起關注。在稍早期的研究中,研究人員利用對接模擬軟體,像Autodock Vina、GOLD、ZDOCK、RDOCK、HADDOCK來針對藥物與標靶進行預測和排序對接處,除了預測和排序對接處,對接模擬軟體也能預測結合親和力,但對接模擬軟體因為考量計算時間成本,簡化許多計算公式,使許多對於親和力重要的資訊並未考量,像水溶劑或氫鍵強度等因子,導致對接模擬軟體預測結合親和力沒有太好的表現。雖然可以結合分子動力學模擬軟體,像Molecular Dynamics simulation(MD simulation)或Quantum-Mechanics (QM) 模擬出蛋白質的動態資訊並分析結合位點,加上模擬藥物結合階段並提供了對DTI親和力的深入理解,而量化了整个過程中的能量變化,藉此提高預測能力但是此舉必須花費許多時間成本。除了模擬軟體之外,有一些研究人員探討利用數據庫中大量藥物標靶交互作用資訊與電腦科學領域結合,並嘗試改進現有計算成本與提高預測效果。對於藥物標靶交互作用中還有許多不成對的小分子化合物待探索且有發展新藥物的可能性,目前許多蛋白質實驗數據庫(例如BindingDB、PCI-DB……等)已存有許多已知交互作用,然而這些已知藥物標靶卻很少經過驗證,儘管化合物數據庫(例如ChEMBL、PubChem)存有大量的化合物數據提供藥物資訊,依然存在許多未知的交互作用等待發現。以近期發展而言,DTI識別從早期二元分類擴展到迴歸問題。早期許多研究專注於二元分類,像是Bleakley and Yamanishi利用二分局部模型(bipartite local models, BLMs)預測人類中四類常用於藥學上的蛋白質目標(酶、GPCR、離子通道、核受體)與藥物間的交互作用,並且提供這四類的資料集給往後研究交互作用的學者使用;van Laarhoven等人利用高斯交互作用譜函數(Gaussian Interaction Profile kernel)預測藥物與標靶交互作用的二元分類;Hakime Öztürk等人利用加權最近鄰高斯相互作用頻譜(Weighted Nearest Neighbor-Gaussian Interaction Profile, WNN-GIP),基於藥物一維資料型態(簡化分子輸入線性輸入規範(simplified molecular input line entry specification, SMILES)相似函數,找出化合物之間的相似度等等相關研究。更近期,開始有研究擴展到探索迴歸問題,相較於二元分類中須自行定門檻值將藥物標靶對分成有無抑制效果,並無強弱親和力濃度資訊,而直接預測結合親和力(Binding Affinity)能提供研究人員精確的抑制濃度進行後續實驗。結合親和力提供有關藥物-標靶(DT)對之間相互作用強度的信息,根據不同格式和數學式推算,親和力表示分為解離常數(Kd)、抑制常數(Ki)或最大抑制濃度的一半(IC50)等度量表示。IC50取決於標靶等和配體的濃度,而低濃度IC50表示強結合親和力。同樣,低Ki值表示強結合親和力。Kd和Ki值通常用pKd或pKi表示,即解離常數或抑制常數的負對數。T Pahikkala等人利用迴歸模型稱為克羅內克正則化最小二乘法(Kronecker regularized least-squares method, KronRLS)針對現實應用藥物標靶對例子探討四種不同訓練和驗證集切法,並預測藥物標靶交互作用親和力Kd和Ki常數; HakimeÖztürk等人利用卷積神經網絡(Convolutional neural network)稱為DeepDTA,並用於蛋白質和藥物序列字元資料預測藥物與蛋白質交互作用的親和力Kd和Ki常數並排序,並在2019年改善DeepDTA,增加不同形態的蛋白質與藥物序列資料,稱WIDEDTA; Jiménez, J.等人利用卷積神經網絡(CNN)用於藥物與標靶對接3D結構資料,預測藥物與標靶交互作用的接合處;Mostafa Karimi等人於2019年將CNN和深度類神經網絡(DNN)結合用於PDBbindDB數據庫的蛋白質序列和STITCH數據庫的藥物SMILES序列,以預測藥物標靶交互作用親和力IC50、Kd以及Ki常數。With the rapid development of intelligent computing technology, the reuse of existing drugs and the exploration of new interactions through computational simulation have begun to attract attention. In earlier studies, researchers used docking simulation software such as Autodock Vina, GOLD, ZDOCK, RDOCK, and HADDOCK to predict and sort the docking sites between drugs and targets. In addition to predicting and sorting the docking sites, docking simulation software can also predict binding affinity. However, docking simulation software simplifies many calculation formulas due to the consideration of computational time cost, so that many important information for affinity is not considered, such as factors such as water solvent or hydrogen bond strength, resulting in poor performance of docking simulation software in predicting binding affinity. Although it is possible to combine molecular dynamics simulation software, such as Molecular Dynamics simulation (MD simulation) or Quantum-Mechanics (QM), to simulate the dynamic information of proteins and analyze binding sites, and to simulate the drug binding stage and provide a deep understanding of DTI affinity, and quantify the energy changes in the entire process, thereby improving the prediction ability, this must take a lot of time and cost. In addition to simulation software, some researchers have explored the use of a large amount of drug-target interaction information in the database and combined it with computer science, and tried to improve the existing computing costs and improve the prediction effect. There are still many unpaired small molecule compounds to be explored in drug-target interactions and there is the possibility of developing new drugs. Currently, many protein experimental databases (such as BindingDB, PCI-DB, etc.) already have many known interactions, but these known drug targets have rarely been verified. Although compound databases (such as ChEMBL, PubChem) have a large amount of compound data to provide drug information, there are still many unknown interactions waiting to be discovered. In terms of recent developments, DTI recognition has expanded from early binary classification to regression problems. Many early studies focused on binary classification. For example, Bleakley and Yamanishi used bipartite local models (BLMs) to predict the interactions between four types of protein targets commonly used in pharmacology (enzymes, GPCRs, ion channels, and nuclear receptors) and drugs in humans, and provided data sets of these four types for future researchers to use in studying interactions. van Laarhoven et al. used the Gaussian Interaction Profile kernel to predict the binary classification of drug-target interactions. Hakime Öztürk et al. used the Weighted Nearest Neighbor-Gaussian Interaction Profile (WNN-GIP) to predict the binary classification of drug-target interactions based on the one-dimensional drug data type (simplified molecular input line entry specification, SMILES) similarity function, find the similarity between compounds, and other related research. More recently, some research has expanded to explore regression problems. Compared with binary classification, which requires self-determining thresholds to classify drug-target pairs into those with or without inhibitory effects, there is no information on strong or weak affinity concentrations, and the binding affinity is directly predicted. Affinity) can provide researchers with accurate inhibition concentrations for subsequent experiments. Binding affinity provides information about the strength of the interaction between drug-target (DT) pairs. Depending on different formats and mathematical formulas, affinity is expressed as dissociation constant (Kd), inhibition constant (Ki) or half of the maximum inhibitory concentration (IC50). IC50 depends on the concentration of the target and ligand, and a low concentration IC50 indicates a strong binding affinity. Similarly, a low Ki value indicates a strong binding affinity. Kd and Ki values are usually expressed as pKd or pKi, which is the negative logarithm of the dissociation constant or inhibition constant. T Pahikkala et al. used a regression model called Kronecker regularized least-squares method (Kronecker regularized least-squares method, KronRLS) explored four different training and validation set cutting methods for real-world drug-target pairing examples, and predicted the drug-target interaction affinity Kd and Ki constants; Hakime Öztürk et al. used a convolutional neural network called DeepDTA, and used it for protein and drug sequence character data to predict and sort the affinity Kd and Ki constants of drug-protein interactions, and improved DeepDTA in 2019, adding protein and drug sequence data of different forms, called WIDEDTA; Jiménez, J. et al. used a convolutional neural network (CNN) for drug-target docking 3D structural data to predict the junction of drug-target interactions; Mostafa In 2019, Karimi et al. combined CNN and deep neural network (DNN) to use protein sequences in the PDBbindDB database and drug SMILES sequences in the STITCH database to predict drug-target interaction affinity IC50, Kd, and Ki constants.
截至現在的研究當中,許多研究所使用的預處理根據自身考量的因素而不同(像蛋白質序列預處理、藥物字符串預處理等等),且少許研究考量到結構資訊。另外,許多研究都是基於Yamanishi等人所提供的四種常用於藥學上的蛋白質數據集。In the studies conducted so far, many studies have used different preprocessing methods based on their own considerations (such as protein sequence preprocessing, drug string preprocessing, etc.), and few studies have considered structural information. In addition, many studies are based on the four protein datasets commonly used in pharmacology provided by Yamanishi et al.
於非專利文獻"2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)"的論文"AttentionDTA: prediction of drug–target binding affinity using attention model"(作者Qichang Zhao)的技術特徵為: 隨著生物技術進步以及藥理界對於新藥的發現和擴展,對於現有的藥物進行重新利用和探索新的相互作用也引起研究人員的關注,與傳統的基於建模方法(例如對接或分子動力學模擬)比較,現在大數據崛起,而電腦交互作用的預測也受到大數據驅使而發展,以近期發展,DTI 識別從早期二元分類擴展到迴歸問題。早期許多研究專注於二元分類,像是Bleakley and Yamanishi,2009 年利用二分局部模型(bipartite local models, BLMs)預測人類中四類重要的藥物標靶交互作用的二元分類,並且提供這四類的資料集給往後研究交互作用的學者使用; van Laarhoven 等人,2011 年利用高斯交互作用譜函數(Gaussian Interaction Profile kernel)預測藥物與標靶交互作用的二元分類。 Hakime Öztürk 等人,2016 年利用加權最近鄰高斯相互作用頻譜(Weighted Nearest Neighbor-Gaussian Interaction Profile, WNN-GIP),基於藥物一維資料型態SMILES 相似函數,找出化合物之間的相似度等等相關研究,而近期因數據崛起,許多研究擴展到探索迴歸問題,像是T Pahikkala 等人,2014 年利用迴歸模型稱為克羅內克正則化最小二乘法(Kronecker regularized least-squares method, KronRLS)針對現實應用例子將藥物標靶分類成四種分法,並預測藥物標靶交互作用親和力並排序;Hakime Öztürk 等人2018年利用卷積神經網絡(Convolutional neural network)稱為DeepDTA,並用於蛋白質和藥物序列字元資料預測藥物與蛋白質交互作用的親和力並排序,並在2019年改善DeepDTA,增加不同形態的蛋白質與藥物序列資料,稱WIDEDTA。 Jiménez, J.等人2018 年利用卷積神經網絡(CNN)用於藥物與標靶對接3D 結構資料,預測藥物與標靶交互作用的接合處;Mostafa Karimi 等人,2019 年,結合CNN和RNN,用於PDBindDB數據庫的蛋白質序列和STITCH數據庫的藥物SMILES序列,預測藥物標靶交互作用親和力。結合親和力提供了有關藥物-標靶(DT)對之間相互作用強度的信息,根據不同格式和數學式推算,親和力表示分為解離常數(Kd),抑制常數(Ki)或最大抑制濃度的一半(IC50)等度量表示。IC50 取決於標靶等和配體的濃度,而低濃度IC50 表示強結合親和力。同樣,低Ki 值表示強結合親和力。 Kd 和Ki 值通常用pKd 或pKi 表示,即解離常數或抑制常數的負對數。截至現在的研究當中,許多研究都是相關於、參考、使用Yamanishi 等人研究所提供的四種數據集,例如,該非專利之參考文獻。 最重要的是,該非專利之文獻並未揭露Word2vec的技術特徵,更遑論利用Word2vec (用來產生詞向量的相關模型(word2vec))來學習一維的蛋白質序列與配體SMILES 字符串,Word2vec 應用於深度學習自然語言處理(NLP)方法。 The technical features of the paper "AttentionDTA: prediction of drug–target binding affinity using attention model" (author Qichang Zhao) in the non-patent document "2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)" are: With the advancement of biotechnology and the discovery and expansion of new drugs in the pharmacology community, the reuse of existing drugs and the exploration of new interactions have also attracted the attention of researchers. Compared with traditional modeling-based methods (such as docking or molecular dynamics simulation), big data is now on the rise, and the prediction of computer interactions is also driven by big data. With recent developments, DTI recognition has expanded from early binary classification to regression problems. Many early studies focused on binary classification, such as Bleakley and Yamanishi, who used bipartite local models (BLMs) in 2009 to predict the binary classification of four important types of drug-target interactions in humans, and provided these four types of data sets for future interaction researchers; van Laarhoven et al., who used the Gaussian Interaction Profile kernel in 2011 to predict the binary classification of drug-target interactions. In 2016, Hakime Öztürk et al. used the weighted nearest neighbor-Gaussian interaction profile (WNN-GIP) to find the similarity between compounds based on the SMILES similarity function of the one-dimensional drug data type. Recently, with the rise of data, many studies have expanded to explore regression problems. For example, T Pahikkala et al. used a regression model called Kronecker regularized least-squares method (KronRLS) in 2014 to classify drug targets into four types for real-world application examples, and predicted and ranked drug-target interaction affinity; Hakime Öztürk et al. used convolutional neural network (Convolutional neural In 2018, Jiménez, J. et al. used convolutional neural network (CNN) to dock 3D structure data of drugs and targets to predict the junction of drug-target interactions; Mostafa Karimi et al., in 2019, combined CNN and RNN to predict the affinity of drug-target interactions using protein sequences in the PDBindDB database and drug SMILES sequences in the STITCH database. Binding affinity provides information about the strength of the interaction between a drug-target (DT) pair, and is expressed in terms of the dissociation constant (Kd), inhibition constant (Ki), or half maximal inhibitory concentration (IC50) depending on the format and mathematical formula used. The IC50 depends on the concentration of the target and ligand, and low concentrations of IC50 indicate strong binding affinity. Similarly, low Ki values indicate strong binding affinity. Kd and Ki values are usually expressed as pKd or pKi, which are the negative logarithms of the dissociation constant or inhibition constant. Many studies to date have been related to, referenced, or used the four datasets provided by the Yamanishi et al. study, for example, the non-patent reference. Most importantly, the non-patent document does not disclose the technical features of Word2vec, let alone using Word2vec (a related model used to generate word vectors (word2vec)) to learn one-dimensional protein sequences and ligand SMILES strings. Word2vec is applied to deep learning natural language processing (NLP) methods.
所以,如何能不使用以下之習知技術的缺點: 許多研究專注於二元分類,像是Bleakley and Yamanishi利用二分局部模型預測人類中四類常用於藥學上的蛋白質目標(酶、GPCR、離子通道、核受體)與藥物間的交互作用,並且提供這四類的資料集給往後研究交互作用的學者使用;van Laarhoven等人利用高斯交互作用譜函數預測藥物與標靶交互作用的二元分類;Hakime Öztürk等人利用加權最近鄰高斯相互作用頻譜,基於藥物一維資料型態(簡化分子輸入線性輸入規範(simplified molecular input line entry specification, SMILES)相似函數,找出化合物之間的相似度等等相關研究;另,更近期,開始有研究擴展到探索迴歸問題,相較於二元分類中須自行定門檻值將藥物標靶對分成有無抑制效果,並無強弱親和力濃度資訊,而直接預測結合親和力(Binding Affinity)均是待解決的問題。 So, how can we avoid using the following shortcomings of the known techniques: Many studies focus on binary classification, such as Bleakley and Yamanishi using a bipartite local model to predict the interactions between four types of protein targets commonly used in medicine (enzymes, GPCRs, ion channels, nuclear receptors) and drugs in humans, and provide these four types of data sets for future researchers to use in studying interactions; van Laarhoven et al. used Gaussian interaction spectrum functions to predict the binary classification of drug-target interactions; Hakime Öztürk et al. used weighted nearest neighbor Gaussian interaction spectrum based on the one-dimensional data type of drugs (simplified molecular input line entry specification, SMILES) similarity function, find the similarity between compounds and other related research; more recently, some research has begun to expand to explore regression problems. Compared with binary classification, which requires self-determining thresholds to classify drug-target pairs into those with or without inhibitory effects, there is no strong or weak affinity concentration information, and direct prediction of binding affinity is a problem to be solved.
本發明之主要目的便是在於提供一種深度學習模型系統及其方法,係應用於藥物標靶交互作用預測排序的環境中,利用本發明之深度學習系統以進行深度學習模型方法時,首先,進行資料處理動作,對至少一種藥物資料(藥物字符串資料(配體簡化分子線性輸入規範(SMILES))與蛋白質資料(蛋白質序列資料(Protein Sequence, 例如,一維之蛋白質序列特徵))分別進行一資料處理,產生分別對應該至少一種藥物資料與蛋白質資料的至少一處理結果,在此,利用Word2vec來學習該至少一蛋白質序列資料(Protein Sequence,例如,一維之蛋白質序列特徵)與配體SMILES 字符串(至少一藥物字符串資料(配體簡化分子線性輸入規範(SMILES)),Word2vec 應用於深度學習自然語言處理(NLP)方法之一,而使用自然語言方法來探索並加深生命語言的理解,依循蛋白質氨基酸序列和SMILES 字符串中編碼的文法來學習並得到這些序列和字符串的低維度特徵向量;進而,進行訓練動作,將該處理結果導入一預測親和力模型進行訓練,在此,將這些表示形式與蛋白質配體對接交互特徵予以組合並放入深度模型當中進行預測;繼之,進行產生預測結果動作,通過該預測親和力模型的訓練產生一預測結果;以及,對該預測結果進行排序。The main purpose of the present invention is to provide a deep learning model system and method thereof, which are applied in the environment of drug-target interaction prediction and ranking. When the deep learning system of the present invention is used to perform the deep learning model method, first, a data processing operation is performed to perform a data processing on at least one drug data (drug string data (ligand simplified molecular linear input specification (SMILES)) and protein data (protein sequence data (Protein Sequence, for example, one-dimensional protein sequence features)) respectively, and at least one processing result corresponding to the at least one drug data and protein data is generated. Here, Word2vec is used to learn the at least one protein sequence data (Protein Sequence, for example, one-dimensional protein sequence features) and the ligand SMILES. String (at least one drug string data (ligand simplified molecular linear input specification (SMILES)), Word2vec is applied to one of the deep learning natural language processing (NLP) methods, and uses natural language methods to explore and deepen the understanding of life language, following the grammar encoded in the protein amino acid sequence and the SMILES string to learn and obtain low-dimensional feature vectors of these sequences and strings; then, a training action is performed, and the processing results are introduced into a prediction affinity model for training, where these representations are combined with protein ligand docking interaction features and put into the deep model for prediction; then, a prediction result generation action is performed, and a prediction result is generated through the training of the prediction affinity model; and the prediction results are sorted.
本發明之再一目的便是在於提供一種深度學習模型系統及其方法,係應用於藥物標靶交互作用預測排序的環境中,揭露Word2vec的技術特徵,利用Word2vec (用來產生詞向量的相關模型(word2vec))來學習一維的蛋白質序列與配體SMILES 字符串,Word2vec 應用於深度學習自然語言處理(NLP)方法;於本發明中,除了蛋白質一級結構(序列)與藥物字符串中具有氨基酸特性與化學式資訊,蛋白質與藥物三維結構中也具備極其重要的資訊能讓模型學習更全面的藥物標靶對接特徵,因此可利用對接模擬軟體中的蛋白質(標靶)3D結構及藥物(配體)的自動生成的結合模式來計算並得到蛋白質與藥物之間基於結構的相互作用特徵,利用這些特徵結合其他資料型態,如一維的蛋白質序列資料及藥物SMILES字符串,並提出結合這些資料型態來預測與建立深度學習模型的蛋白質-藥物相互作用的結合親和力的方法,為此本發明利用用來產生詞向量的相關模型(word2vec)來學習一維的蛋白質序列與配體SMILES字符串,用來產生詞向量的相關模型應用於深度學習自然語言處理(NLP)方法之一,透過現有自然語言方法探索並加深生命語言的理解,應用深度學習自然語言處理(NLP)方法之一的Word2vec模型,對一維的蛋白質序列與藥物SMILES字符串學習,依循蛋白質胺基酸序列及SMILES字符串中編碼的文法來學習並得到這些序列和字符串映射至特徵向量,並把轉換後的向量與相互作用特徵組合,輸入至深度模型當中進行預測。另外,也提出利用Multi-Modal來學習這些資料表示和相互作用特徵並預測結合親和力,並且跟常見機器學習模型像是隨機森林(Random forest)、梯度提升(Gradient Boost)進行比較,以結果來看結合不同型態的資料表現無論是均方誤差(Mean Square Error,MSE)又或者一致性指數(Concordance Index, C-index)排序比單一使用一種型態資料的結果提高至10~15%。往後希望藉由結合更多型態的資料來預測更準確且排序佳的藥物標靶相互作用,並且建立平台去針對許多蛋白質家族以及多藥理進行多重分析,探索其中未知特性。Another object of the present invention is to provide a deep learning model system and method thereof, which is applied in the context of drug-target interaction prediction and ranking, discloses the technical features of Word2vec, and uses Word2vec (a related model for generating word vectors (word2vec)) to learn one-dimensional protein sequences and ligand SMILES strings. Applied to deep learning natural language processing (NLP) methods; in the present invention, in addition to the amino acid characteristics and chemical formula information in the primary structure (sequence) of the protein and the drug string, the three-dimensional structure of the protein and the drug also has extremely important information that allows the model to learn more comprehensive drug-target docking features. Therefore, the protein (target) 3D structure and the automatically generated binding pattern of the drug (ligand) in the docking simulation software can be used to calculate and obtain the structure-based interaction characteristics between the protein and the drug. These features are combined with other data types, such as one-dimensional protein sequence data and drug SMILES strings, and a method for combining these data types to predict and establish the binding affinity of protein-drug interactions in deep learning models is proposed. Method, for this purpose, the present invention utilizes a correlation model (word2vec) for generating word vectors to learn one-dimensional protein sequences and ligand SMILES strings. The correlation model for generating word vectors is applied to one of the deep learning natural language processing (NLP) methods. Through existing natural language methods, the understanding of life language is explored and deepened. The Word2vec model, one of the deep learning natural language processing (NLP) methods, is applied to learn one-dimensional protein sequences and drug SMILES strings. The grammar encoded in the protein amino acid sequence and the SMILES string is followed to learn and obtain the mapping of these sequences and strings to feature vectors, and the converted vectors are combined with interaction features and input into the deep model for prediction. In addition, it is proposed to use Multi-Modal to learn these data representations and interaction characteristics and predict binding affinity, and compare with common machine learning models such as Random Forest and Gradient Boost. The results show that the performance of combining different types of data, whether it is Mean Square Error (MSE) or Concordance Index (C-index), is improved by 10~15% compared with the results of using only one type of data. In the future, we hope to predict more accurate and better ranked drug-target interactions by combining more types of data, and establish a platform to perform multiple analyses on many protein families and polypharmacology to explore unknown characteristics.
本發明之又一目的便是在於提供一種深度學習模型系統及其方法,係應用於藥物標靶交互作用預測排序的環境中,本發明並無使用上述習知技術所使用的數據集,考量多重和超家族分析等因素,採用Daniel A Bachovchin 等人研究的藥物標靶資料集,而本發明之技術特徵中所使用的標靶是針對酶活性抑制效果探討,其酶為絲氨酸水解酶(Serine hydrolase),絲氨酸水解酶是生物中已知最大且多樣化酶種類之一,在哺乳生物中,約佔全部蛋白質的1%,而在人類界當中,就佔有約240 個成員。 絲氨酸水解酶在生物過程中扮演許多關鍵的角色,像是病理生理過程中的血液穩態,神經信號傳導和葡萄糖穩態,也在病毒以及細菌也有重要的功用,像是細菌和病毒感染,生命週期以及毒性和耐藥性。200 多種人類絲氨酸水解酶中生理化合物以及功能當中,許多並未有良好的特性以及多數缺少選擇性的活性抑制劑,嘗試將絲氨酸水解酶做高通量,多重分析,探討其特性,並製作分析平台,其中使用96 種絲氨酸水解酶成員跟52 種藥物做實驗觀察並紀錄其中抑制濃度以及彼此交互作用後的特性,以提高更多人類絲氨酸水解酶的抑制劑選擇性。 本發明與上述之習知技術、文獻不同,考量到傳統對接模擬軟體像Autodock Vina,GOLD, ZDOCK, RDOCK, HADDOCK,除了預測和排序對接處,對接模擬軟體也能預測結合親和力,但對接模擬軟體因為考量計算時間成本,簡化許多計算公式,導致許多對於親和力重要的資訊並未考量,像是水溶劑或氫鍵強度等因子,導致對街模擬軟體預測結合親和力沒有太好的表現,雖然可以結合molecular dynamics simulation(MD simulation)或Quantum-Mechanics (QM) 增加準確率,但是此舉必須花費許多時間成本。 於本發明中,利用對接模擬軟體中的蛋白質3D 結構和配體的自動生成的結合模式來計算並得到蛋白質和配體之間基於結構的相互作用特徵,利用這些特徵結合其他資料型態,如一維的蛋白質序列資料和配體SMILES 字符串,並提出結合這些資料型態來預測和深度學習模型的蛋白質-配體相互作用的結合親和力的方法,為此申請人利用Word2vec(用來產生詞向量的相關模型(word2vec))來學習一維的蛋白質序列與配體SMILES 字符串,Word2vec 應用於深度學習自然語言處理(NLP)方法之一,而使用現有自然語言方法來探索並加深生命語言的理解,依循蛋白質氨基酸序列和SMILES 字符串中編碼的文法來學習並得到這些序列和字符串的低維度特徵向量,並將這些表示形式與相互作用特徵組合再一起放入深度模型當中進行預測,另外,也提出利用Multi-Modal來學習這些資料表示和相互作用特徵並預測結合親和力,並且跟常見機器學習模型像是隨機森林(Random forest)、梯度提升(Gradient Boost)進行比較,以結果來看結合不同型態的資料表現無論是均方誤差(Mean Square Error,MSE)又或者一致性指數(Concordance Index, C-index)排序比單一使用一種型態資料的結果提高至10~15%。往後希望藉由結合更多型態的資料來預測更準確且排序佳的藥物標靶相互作用,並且建立平台去針對許多蛋白質家族以及多藥理進行多重分析,探索其中未知特性。 Another purpose of the present invention is to provide a deep learning model system and method thereof, which are applied in the environment of drug-target interaction prediction and ranking. The present invention does not use the data set used by the above-mentioned known technology. Considering factors such as multiple and superfamily analysis, the drug target data set studied by Daniel A Bachovchin et al. is adopted. The target used in the technical feature of the present invention is to explore the inhibitory effect of enzyme activity. The enzyme is serine hydrolase. Serine hydrolase is one of the largest and most diverse enzyme species known in organisms. In mammals, it accounts for about 1% of all proteins, and in the human kingdom, it accounts for about 240 members. Serine hydrolases play many key roles in biological processes, such as blood homeostasis, neural signaling and glucose homeostasis in pathological and physiological processes. They also have important functions in viruses and bacteria, such as bacterial and viral infections, life cycles, toxicity and drug resistance. Among the more than 200 physiological compounds and functions of human serine hydrolases, many do not have good characteristics and most lack selective active inhibitors. We try to perform high-throughput and multiplex analysis of serine hydrolases, explore their characteristics, and make an analysis platform. We use 96 serine hydrolase members and 52 drugs for experimental observation and record the inhibition concentration and the characteristics after interaction with each other, so as to improve the selectivity of inhibitors for more human serine hydrolases. The present invention is different from the above-mentioned prior art and literature. Considering that traditional docking simulation software such as Autodock Vina, GOLD, ZDOCK, RDOCK, and HADDOCK can predict binding affinity in addition to predicting and sorting docking points, docking simulation software simplifies many calculation formulas due to the consideration of calculation time cost, resulting in many important information for affinity not being considered, such as factors such as water solvent or hydrogen bond strength, resulting in poor performance of docking simulation software in predicting binding affinity. Although it can be combined with molecular dynamics simulation (MD simulation) or Quantum-Mechanics (QM) to increase accuracy, this must take a lot of time and cost. In the present invention, the protein 3D structure and the automatically generated binding pattern of the ligand in the docking simulation software are used to calculate and obtain the structure-based interaction characteristics between the protein and the ligand. These characteristics are combined with other data types, such as one-dimensional protein sequence data and ligand SMILES strings, and a method for combining these data types to predict the binding affinity of protein-ligand interaction with a deep learning model is proposed. For this purpose, the applicant uses Word2vec (a related model used to generate word vectors (word2vec)) to learn one-dimensional protein sequences and ligand SMILES strings. Word2vec is applied to one of the deep learning natural language processing (NLP) methods, and uses existing natural language methods to explore and deepen the understanding of life language. According to the protein amino acid sequence and SMILES The grammar encoded in the string is used to learn and obtain the low-dimensional feature vectors of these sequences and strings, and these representations are combined with the interaction features and put into the deep model for prediction. In addition, it is also proposed to use Multi-Modal to learn these data representations and interaction features and predict binding affinity, and compare them with common machine learning models such as Random Forest and Gradient Boost. The results show that the performance of combining different types of data, whether it is the Mean Square Error (MSE) or the Concordance Index (C-index) ranking, is improved by 10~15% compared with the results of using only one type of data. In the future, we hope to combine more types of data to predict more accurate and better-ranked drug-target interactions, and establish a platform to perform multiple analyses on many protein families and polypharmacology to explore unknown properties.
根據以上所述之目的,本發明提供一種深度學習模型系統,該深度學習模型系統包含資料處理服務模組、預測親和力模型模組、以及排序模型模組。According to the above-mentioned purpose, the present invention provides a deep learning model system, which includes a data processing service module, a predicted affinity model module, and a ranking model module.
資料處理服務模組,該資料處理服務模組對至少一種藥物資料(藥物字符串資料(配體簡化分子線性輸入規範(SMILES))與蛋白質資料(蛋白質序列資料(Protein Sequence, 例如,一維之蛋白質序列特徵))分別進行資料處理,產生分別對應該至少一種藥物資料與蛋白質資料的至少一處理結果; 該資料處理服務模組對至少一藥物資料與蛋白質資料分別進行資料處理,產生分別對應該至少一種藥物資料與蛋白質資料的至少一該處理結果,此處的藥物資料與蛋白質資料包含藥物的簡化分子線性輸入規範(Simplified molecular input line entry specification, SMILES)D_drug、蛋白質序列D_protein與藥物蛋白結構資料D_drug_protein,該資料處理可以是通過執行自然語言處理(word2vec)模型的資料處理裝置產生該處理結果,該處理結果為至少一種序列資料,並與藥物蛋白結構資料D_drug_protein整合。A data processing service module, wherein the data processing service module processes at least one drug data (drug string data (SMILES)) and protein data (protein sequence data (Protein Sequence, for example, one-dimensional protein sequence features)) respectively, and generates at least one processing result corresponding to the at least one drug data and protein data respectively; the data processing service module processes at least one drug data and protein data respectively, and generates at least one processing result corresponding to the at least one drug data and protein data respectively, wherein the drug data and protein data here include the simplified molecular input line entry specification (SMILES) of the drug. SMILES) D_drug, protein sequence D_protein and drug protein structure data D_drug_protein, the data processing can be generated by a data processing device that executes a natural language processing (word2vec) model, and the processing result is at least one sequence data, and is integrated with the drug protein structure data D_drug_protein.
在此,利用Word2vec來學習該至少一蛋白質序列資料特徵(Protein Sequence,例如,一維之蛋白質序列特徵)與藥物字符串資料SMILES(配體SMILES 字符串(至少一藥物字符串資料(配體簡化分子線性輸入規範(SMILES)),Word2vec 應用於深度學習自然語言處理(NLP)方法之一,而使用自然語言方法來探索並加深生命語言的理解,依循蛋白質序列資料特徵氨基酸序列與藥物字符串資料SMILES中編碼的文法來學習並得到這些蛋白質序列和藥物字符串的低維度特徵向量。Here, Word2vec is used to learn the at least one protein sequence data feature (Protein Sequence, for example, one-dimensional protein sequence feature) and the drug string data SMILES (ligand SMILES string (at least one drug string data (ligand simplified molecular linear input specification (SMILES))). Word2vec is applied to one of the deep learning natural language processing (NLP) methods, and uses natural language methods to explore and deepen the understanding of life language. The low-dimensional feature vectors of these protein sequences and drug strings are learned and obtained according to the grammar encoded in the protein sequence data feature amino acid sequence and the drug string data SMILES.
資料處理服務模組所處理的每筆資料包含蛋白質配體對接交互特徵、蛋白質序列、配體簡化分子線性輸入規範SMILES(亦即,藥物資料(藥物字符串資料))的三種類型,分別從不同平台取得;蛋白質配體對接交互特徵是由Autodock Vina 軟體進行模擬後抽取特徵,例如,39 個蛋白質配體對接交互特徵名稱。蛋白質序列是資料中的ProteinID 利用UniProt 資料庫進行比對並取得序列,UniProt 為蛋白質序列與功能全面資訊資料庫,其中諸多內容來自基因組的計劃,且包含了大量來自研究文獻和許多研究關於蛋白質的生物學功能資訊。配體簡化分子線性輸入規範SMILES是利用資料中的CID,與PubChem 資料庫比對並取得序列,PubChem 資料庫提供生物活動相關的有機小分子生物活性數據,是交由美國國家衛生研究院(NIH)的下屬機構:「美國國家生物技術信息中心」(NCBI)負責維護。PubChem 屬於NIH「分子圖書館」( Molecular Libraries Roadmap Initiative)的一部份,可由www 免費進入查詢資訊,亦可利用FTP 將其內數以百萬筆的化學資料庫免費下載。PubChem 資料庫可查詢各種有機化學分子,包括藥物、化工原料皆收錄在其中。Each data processed by the data processing service module includes three types: protein-ligand docking interaction features, protein sequences, and ligand simplified molecular linear input specifications SMILES (i.e., drug data (drug string data)), which are obtained from different platforms; protein-ligand docking interaction features are extracted by simulation by Autodock Vina software, for example, 39 protein-ligand docking interaction feature names. Protein sequences are ProteinIDs in the data that are compared and obtained using the UniProt database. UniProt is a comprehensive database of protein sequences and functions, many of which come from genome projects and contain a large amount of information on the biological functions of proteins from research literature and many studies. The SMILES uses the CID in the data to compare with the PubChem database and obtain the sequence. The PubChem database provides biological activity data of small organic molecules related to biological activities. It is maintained by the National Center for Biotechnology Information (NCBI), a subsidiary of the National Institutes of Health (NIH). PubChem is part of the NIH "Molecular Libraries Roadmap Initiative". Information can be accessed for free at www. You can also use FTP to download millions of chemical databases for free. The PubChem database can query various organic chemical molecules, including drugs and chemical raw materials.
藥物資料包含配體(藥物)簡化分子線性輸入規範SMILES D_drug,而蛋白質序列資料(蛋白質資料)包含蛋白質序列D_protein,而蛋白質配體對接交互特徵包含藥物蛋白結構資料D_drug_protein,該資料處理服務模組的該資料處理可以是通過執行自然語言處理word2vec模型而產生該處理結果,該處理結果為至少一種序列資料,並與蛋白質配體對接交互特徵之藥物蛋白結構資料D_drug_protein予以整合。The drug data includes a ligand (drug) simplified molecular linear input specification SMILES D_drug, and the protein sequence data (protein data) includes a protein sequence D_protein, and the protein-ligand docking interaction feature includes drug protein structure data D_drug_protein. The data processing of the data processing service module can be generated by executing a natural language processing word2vec model to generate the processing result, and the processing result is at least one sequence data, and is integrated with the drug protein structure data D_drug_protein of the protein-ligand docking interaction feature.
資料處理服務模組利用Word2vec的技術特徵,使用Word2vec (用來產生詞向量的相關模型(word2vec))來學習一維的蛋白質序列與配體SMILES 字符串,Word2vec 應用於深度學習自然語言處理(NLP)方法;於本發明中,除了蛋白質一級結構(序列)與藥物字符串中具有氨基酸特性與化學式資訊,蛋白質與藥物三維結構中也具備極其重要的資訊能讓模型學習更全面的藥物標靶對接特徵,因此可利用對接模擬軟體中的蛋白質(標靶)3D結構及藥物(配體)的自動生成的結合模式來計算並得到蛋白質與藥物之間基於結構的相互作用特徵,利用這些特徵結合其他資料型態,如一維的蛋白質序列資料及藥物SMILES字符串,並提出結合這些資料型態來預測與建立深度學習模型的蛋白質-藥物相互作用的結合親和力的方法,為此本發明利用用來產生詞向量的相關模型(word2vec)來學習一維的蛋白質序列與配體SMILES字符串,用來產生詞向量的相關模型應用於深度學習自然語言處理(NLP)方法之一,透過現有自然語言方法探索並加深生命語言的理解,應用深度學習自然語言處理(NLP)方法之一的Word2vec模型,對一維的蛋白質序列與藥物SMILES字符串學習,依循蛋白質胺基酸序列及SMILES字符串中編碼的文法來學習並得到這些序列和字符串映射至特徵向量,並把轉換後的向量與相互作用特徵組合,輸入至預測親和力模型模組當中進行預測。The data processing service module uses the technical features of Word2vec and uses Word2vec (a related model used to generate word vectors (word2vec)) to learn the one-dimensional protein sequence and ligand SMILES string. Applied to deep learning natural language processing (NLP) methods; in the present invention, in addition to the amino acid characteristics and chemical formula information in the primary structure (sequence) of the protein and the drug string, the three-dimensional structure of the protein and the drug also has extremely important information that allows the model to learn more comprehensive drug-target docking features. Therefore, the protein (target) 3D structure and the automatically generated binding pattern of the drug (ligand) in the docking simulation software can be used to calculate and obtain the structure-based interaction characteristics between the protein and the drug. These features are combined with other data types, such as one-dimensional protein sequence data and drug SMILES strings, and a method for combining these data types to predict and establish the binding affinity of protein-drug interactions for deep learning models is proposed. This invention uses a related model (word2vec) for generating word vectors to learn one-dimensional protein sequences and ligand SMILES strings. The related model for generating word vectors is applied to one of the deep learning natural language processing (NLP) methods. The existing natural language methods are used to explore and deepen the understanding of life language. The Word2vec model, one of the deep learning natural language processing (NLP) methods, is applied to learn one-dimensional protein sequences and drug SMILES strings. The grammar encoded in the protein amino acid sequence and the SMILES string is followed to learn and obtain these sequences and strings mapped to feature vectors, and the converted vectors are combined with interaction features and input into the prediction affinity model module for prediction.
本發明與上述之習知技術、文獻不同,考量到傳統對接模擬軟體像Autodock Vina,GOLD, ZDOCK, RDOCK, HADDOCK,除了預測和排序對接處,對接模擬軟體也能預測結合親和力,但對接模擬軟體因為考量計算時間成本,簡化許多計算公式,導致許多對於親和力重要的資訊並未考量,像是水溶劑或氫鍵強度等因子,導致對街模擬軟體預測結合親和力沒有太好的表現,雖然可以結合molecular dynamics simulation(MD simulation)或Quantum-Mechanics (QM) 增加準確率,但是此舉必須花費許多時間成本。The present invention is different from the above-mentioned prior art and literature. Considering that traditional docking simulation software such as Autodock Vina, GOLD, ZDOCK, RDOCK, and HADDOCK can predict binding affinity in addition to predicting and sorting docking locations, docking simulation software simplifies many calculation formulas due to the consideration of calculation time cost, resulting in many important information for affinity not being considered, such as factors such as water solvent or hydrogen bond strength, resulting in the docking simulation software not having a very good performance in predicting binding affinity. Although it can be combined with molecular dynamics simulation (MD simulation) or Quantum-Mechanics (QM) to increase the accuracy, this must take a lot of time and cost.
於本發明中,利用對接模擬軟體中的蛋白質3D 結構和配體的自動生成的結合模式來計算並得到蛋白質和配體之間基於結構的相互作用特徵,利用這些特徵結合其他資料型態,如一維的蛋白質序列資料和配體SMILES 字符串,並提出結合這些資料型態來預測和深度學習模型的蛋白質-配體相互作用的結合親和力的方法,為此利用Word2vec(用來產生詞向量的相關模型(word2vec))來學習一維的蛋白質序列與配體SMILES 字符串,Word2vec 應用於深度學習自然語言處理(NLP)方法之一,而使用現有自然語言方法來探索並加深生命語言的理解,依循蛋白質氨基酸序列和SMILES 字符串中編碼的文法來學習並得到這些序列和字符串的低維度特徵向量,並將這些表示形式與相互作用特徵組合再一起放入深度模型當中進行預測。In the present invention, the protein 3D structure and the automatically generated binding pattern of the ligand in the docking simulation software are used to calculate and obtain the structure-based interaction characteristics between the protein and the ligand. These characteristics are combined with other data types, such as one-dimensional protein sequence data and ligand SMILES strings, and a method for combining these data types to predict the binding affinity of protein-ligand interaction with a deep learning model is proposed. For this purpose, Word2vec (a related model used to generate word vectors (word2vec)) is used to learn the one-dimensional protein sequence and ligand SMILES string. Word2vec is applied to one of the deep learning natural language processing (NLP) methods, and the existing natural language methods are used to explore and deepen the understanding of the language of life. According to the protein amino acid sequence and SMILES The grammar encoded in the strings is learned to obtain low-dimensional feature vectors of these sequences and strings, and these representations are combined with interaction features and then put into the deep model for prediction.
預測親和力模型模組,在此,其中,資料處理服務模組利用Word2vec的技術特徵,使用Word2vec (用來產生詞向量的相關模型(word2vec))來學習蛋白質序列資料,例如,一維的蛋白質序列、以及配體(藥物)簡化分子線性輸入規範SMILES之字符串,Word2vec 應用於深度學習自然語言處理(NLP)方法;於本發明中,除了蛋白質一級結構(序列)與藥物字符串中具有氨基酸特性與化學式資訊,蛋白質與藥物三維結構中也具備極其重要的資訊能讓該預測親和力模型模組學習更全面的藥物標靶對接特徵(蛋白質配體對接交互特徵)。A prediction affinity model module, wherein the data processing service module utilizes the technical features of Word2vec, and uses Word2vec (a related model (word2vec) for generating word vectors) to learn protein sequence data, for example, a one-dimensional protein sequence and a string of a ligand (drug) simplified molecular linear input specification SMILES, and Word2vec is applied to a deep learning natural language processing (NLP) method; in the present invention, in addition to the amino acid characteristics and chemical formula information in the primary structure (sequence) of the protein and the drug string, the three-dimensional structure of the protein and the drug also has extremely important information that allows the prediction affinity model module to learn more comprehensive drug-target docking features (protein-ligand docking interaction features).
因此,利用資料處理服務模組及/或預測親和力模型模組,可利用對接模擬軟體中的蛋白質(標靶)3D結構及藥物(配體)的自動生成的結合模式來計算並得到蛋白質與藥物之間基於結構的相互作用特徵(蛋白質配體對接交互特徵),利用這些特徵(蛋白質配體對接交互特徵)結合其他資料型態,例如,一維的蛋白質序列資料及藥物SMILES字符串,並提出結合這些資料型態來預測與建立深度學習模型的蛋白質-藥物相互作用的結合親和力的方法,為此本發明利用用來產生詞向量的相關模型Word2vec來學習一維的蛋白質序列與配體SMILES字符串,用來產生詞向量的相關模型應用於深度學習自然語言處理(NLP)方法之一,透過現有自然語言方法探索並加深生命語言的理解,應用深度學習自然語言處理(NLP)方法之一的Word2vec模型,對一維的蛋白質序列與藥物SMILES字符串學習,依循蛋白質胺基酸序列及SMILES字符串中編碼的文法來學習並得到這些序列和字符串映射至特徵向量,並把轉換後的向量與相互作用特徵組合,輸入至預測親和力模型模組當中進行預測。Therefore, by using the data processing service module and/or the affinity prediction model module, the protein (target) 3D structure and the automatically generated binding pattern of the drug (ligand) in the docking simulation software can be used to calculate and obtain the structure-based interaction characteristics (protein-ligand docking interaction characteristics) between the protein and the drug, and these characteristics (protein-ligand docking interaction characteristics) are combined with other data types, such as one-dimensional protein sequence data and drug SMILES strings, and a method for combining these data types to predict and establish a deep learning model for protein-drug interaction binding affinity is proposed. For this purpose, the present invention uses the Word2 related model used to generate word vectors vec is used to learn one-dimensional protein sequences and ligand SMILES strings. The related model used to generate word vectors is applied to one of the deep learning natural language processing (NLP) methods. Through existing natural language methods, we explore and deepen the understanding of life language. The Word2vec model, one of the deep learning natural language processing (NLP) methods, is applied to learn one-dimensional protein sequences and drug SMILES strings. The grammar encoded in the protein amino acid sequence and SMILES string is learned and mapped to feature vectors. The converted vectors are combined with interaction features and input into the prediction affinity model module for prediction.
另外,於預測親和力模型模組中,也提出利用Multi-Modal來學習這些資料表示和相互作用特徵並預測結合親和力,並且跟常見機器學習模型像是隨機森林(Random forest)、梯度提升(Gradient Boost)進行比較,以結果來看結合不同型態的資料表現無論是均方誤差(Mean Square Error, MSE)又或者一致性指數(Concordance Index, C-index)排序比單一使用一種型態資料的結果提高至10~15%。往後希望藉由結合更多型態的資料來預測更準確且排序佳的藥物標靶相互作用,並且建立平台去針對許多蛋白質家族以及多藥理進行多重分析,探索其中未知特性。In addition, in the affinity prediction model module, it is also proposed to use Multi-Modal to learn these data representations and interaction characteristics and predict binding affinity, and compare them with common machine learning models such as Random Forest and Gradient Boost. The results show that the performance of combining different types of data, whether it is the Mean Square Error (MSE) or the Concordance Index (C-index), is improved by 10~15% compared with the results of using only one type of data. In the future, we hope to predict more accurate and better ranked drug-target interactions by combining more types of data, and establish a platform to perform multiple analyses on many protein families and polypharmacology to explore unknown characteristics.
本發明之技術特徵中所使用的標靶是針對酶活性抑制效果探討,其酶為絲氨酸水解酶(Serine hydrolase),絲氨酸水解酶是生物中已知最大且多樣化酶種類之一,在哺乳生物中,約佔全部蛋白質的1%,而在人類界當中,就佔有約240 個成員。 絲氨酸水解酶在生物過程中扮演許多關鍵的角色,像是病理生理過程中的血液穩態,神經信號傳導和葡萄糖穩態,也在病毒以及細菌也有重要的功用,像是細菌和病毒感染,生命週期以及毒性和耐藥性。200 多種人類絲氨酸水解酶中生理化合物以及功能當中,許多並未有良好的特性以及多數缺少選擇性的活性抑制劑,嘗試將絲氨酸水解酶做高通量,多重分析,探討其特性,並製作分析平台,其中使用96 種絲氨酸水解酶成員跟52 種藥物做實驗觀察並紀錄其中抑制濃度以及彼此交互作用後的特性,以提高更多人類絲氨酸水解酶的抑制劑選擇性。 The target used in the technical features of the present invention is to explore the inhibitory effect of enzyme activity, and the enzyme is serine hydrolase. Serine hydrolase is one of the largest and most diverse enzyme species known in organisms. In mammals, it accounts for about 1% of all proteins, and in humans, there are about 240 members. Serine hydrolases play many key roles in biological processes, such as blood homeostasis, neural signaling and glucose homeostasis in pathological and physiological processes. They also have important functions in viruses and bacteria, such as bacterial and viral infections, life cycles, toxicity and drug resistance. Among the more than 200 physiological compounds and functions of human serine hydrolases, many do not have good characteristics and most lack selective active inhibitors. We try to conduct high-throughput, multiplex analysis of serine hydrolases, explore their characteristics, and create an analysis platform. We use 96 serine hydrolase members and 52 drugs to conduct experimental observations and record the inhibition concentration and the characteristics after interaction with each other, so as to improve the selectivity of inhibitors for more human serine hydrolases.
預測親和力模型模組,將至少一該處理結果導入該預測親和力模型模組進行訓練,以通過該預測親和力模型模組的預測親和力模型的訓練產生一預測結果,也就是說,該序列資料與結構資料被導入可以是卷積神經模型/深度神經模型的預測親和力模型模組進行一訓練以產生一預測結果,也就是將上述資料處理產生的藥物向量與蛋白質向量進行處理產生該預測結果,利用資料處理服務模組及/或預測親和力模型模組將所產生之該預測結果與藥物蛋白結構資料D_drug_protein整合。The prediction affinity model module imports at least one of the processing results into the prediction affinity model module for training, so as to generate a prediction result through the training of the prediction affinity model of the prediction affinity model module, that is, the sequence data and the structure data are imported into the prediction affinity model module which can be a convolutional neural model/deep neural model for training to generate a prediction result, that is, the drug vector and the protein vector generated by the above data processing are processed to generate the prediction result, and the prediction result generated is integrated with the drug protein structure data D_drug_protein by using the data processing service module and/or the prediction affinity model module.
排序模型模組,利用該排序模型模組對該預測結果進行排序;將該預測結果與藥物蛋白結構資料D_drug_protein整合的結果導入深度網路模型的子網路,在此,例如,預測親和力模型模組為深度網路模型,所產生的多式架構模型接著被導入由子網路形成的預測親和力模型模組,以進行處理產生均方誤差D_MSE,預測親和力模型模組將D_MSE傳送至該排序模型模組;接著,通過該排序模型模組來進行排序並取得一致性指數D_CI。A sorting model module is used to sort the prediction results; the prediction results are integrated with the drug protein structure data D_drug_protein and the results are introduced into the sub-network of the deep network model. Here, for example, the predicted affinity model module is a deep network model, and the generated multi-modal architecture model is then introduced into the predicted affinity model module formed by the sub-network to be processed to generate the mean square error D_MSE, and the predicted affinity model module transmits the D_MSE to the sorting model module; then, the sorting is performed through the sorting model module to obtain the consistency index D_CI.
在此,於一實施例中,該資料處理是將該藥物資料與蛋白質資料根據其資料型態進行處理,產生的該處理結果為至少一種序列資料與至少一種結構資料,且該預測親和力模型模組為一深度網路模型。Here, in one embodiment, the data processing is to process the drug data and protein data according to their data types, and the processing results generated are at least one sequence data and at least one structure data, and the predicted affinity model module is a deep network model.
資料處理服務模組利用Word2vec的技術特徵,使用Word2vec (用來產生詞向量的相關模型(word2vec))來學習一維的蛋白質序列與配體SMILES 字符串,Word2vec 應用於深度學習自然語言處理(NLP)方法;於本發明中,除了蛋白質一級結構(序列)與藥物字符串中具有氨基酸特性與化學式資訊,蛋白質與藥物三維結構中也具備極其重要的資訊能讓模型學習更全面的藥物標靶對接特徵,因此可利用對接模擬軟體中的蛋白質(標靶)3D結構及藥物(配體)的自動生成的結合模式來計算並得到蛋白質與藥物之間基於結構的相互作用特徵,The data processing service module uses the technical features of Word2vec and uses Word2vec (a related model (word2vec) used to generate word vectors) to learn one-dimensional protein sequences and ligand SMILES strings. Word2vec is applied to deep learning natural language processing (NLP) methods. In the present invention, in addition to the amino acid characteristics and chemical formula information in the primary structure (sequence) of the protein and the drug string, the three-dimensional structure of the protein and the drug also has extremely important information that allows the model to learn more comprehensive drug-target docking features. Therefore, the protein (target) 3D structure and the automatically generated binding mode of the drug (ligand) in the docking simulation software can be used to calculate and obtain the structure-based interaction characteristics between the protein and the drug.
於本發明中,利用對接模擬軟體中的蛋白質3D 結構和配體的自動生成的結合模式來計算並得到蛋白質和配體之間基於結構的相互作用特徵,利用這些特徵結合其他資料型態,如一維的蛋白質序列資料和配體SMILES 字符串,並提出結合這些資料型態來預測和深度學習模型的蛋白質-配體相互作用的結合親和力的方法,為此利用Word2vec(用來產生詞向量的相關模型(word2vec))來學習一維的蛋白質序列與配體SMILES 字符串,Word2vec 應用於深度學習自然語言處理(NLP)方法之一,而使用現有自然語言方法來探索並加深生命語言的理解,依循蛋白質氨基酸序列和SMILES 字符串中編碼的文法來學習並得到這些序列和字符串的低維度特徵向量。In the present invention, the protein 3D structure and the automatically generated binding pattern of the ligand in the docking simulation software are used to calculate and obtain the structure-based interaction characteristics between the protein and the ligand. These characteristics are combined with other data types, such as one-dimensional protein sequence data and ligand SMILES strings, and a method for combining these data types to predict the binding affinity of protein-ligand interaction with a deep learning model is proposed. For this purpose, Word2vec (a related model used to generate word vectors (word2vec)) is used to learn the one-dimensional protein sequence and ligand SMILES string. Word2vec is applied to one of the deep learning natural language processing (NLP) methods, and the existing natural language methods are used to explore and deepen the understanding of the language of life. According to the protein amino acid sequence and SMILES The grammar encoded in the strings is learned to obtain low-dimensional feature vectors of these sequences and strings.
爲使熟悉該項技藝人士瞭解本發明之目的、特徵及功效,茲藉由下述具體實施例,並配合所附之圖式,對本發明詳加說明如後:In order to make those familiar with the art understand the purpose, features and effects of the present invention, the present invention is described in detail by the following specific embodiments and the accompanying drawings:
蛋白質配體對接(Protein-ligand Pair Docking)是分子建模領域中一種預測蛋白質與目標結合為穩定的複合物時一個分子對應於另一個分子的取向的方法,一種常用於結構的藥物設計方法,而取向出來的結果又可以用於評分功能來預測兩個分子之間的親和力強度。而蛋白質配體對接技術早已活用於預測新藥以及藥物重組,常見的模擬軟體有Autodock、ZDOCK、RDOCK、HADDOCK。然而對接會為每一個配體產生多個不同的位置並且根據評分功能與不同化合物進一步比對並排名,然而簡化的計算公式導致預測對接親和力可能會不準,儘管準確的預測對接位置,因此導致這並不是最佳預測親和力的工具。本發明收集Autodock Vina中所採用對接評估方法得到的特徵,作為本發明深度模型預測的資料特徵。我們共有39個對接特徵作為深度模型的輸入。Protein-ligand Pair Docking is a method in the field of molecular modeling that predicts the orientation of one molecule relative to another when a protein binds to a target to form a stable complex. It is a commonly used method for structural drug design, and the orientation results can be used to predict the affinity strength between two molecules through scoring functions. Protein-ligand docking technology has long been used to predict new drugs and drug recombination. Common simulation software includes Autodock, ZDOCK, RDOCK, and HADDOCK. However, docking will generate multiple different positions for each ligand and further compare and rank them with different compounds based on the scoring function. However, the simplified calculation formula may lead to inaccurate predictions of docking affinity, despite accurate predictions of docking positions, so this is not the best tool for predicting affinity. The present invention collects the features obtained by the docking evaluation method used in Autodock Vina as data features for the deep model prediction of the present invention. We have a total of 39 docking features as inputs to the deep model.
肽或蛋白質的胺基酸序列(或殘基序列)為蛋白質一級結構,是肽或蛋白質中胺基酸的線性序列,一級結構上的胺基酸間可交互作用,利用醯胺鍵上的C=O鍵與胺基形成氫鍵。基於每個末端上的游離基團的性質,多肽鏈的兩個末端稱為羧基末端(C-末端)及胺基末端(N-末端)。殘基的計數始終始於N末端(NH 2-基團),該末端是胺基不參與肽鍵的末端。蛋白質的一級結構由與蛋白質相對應的基因決定。DNA中特定的核苷酸序列被轉錄為mRNA,核醣體會在稱為翻譯的過程中讀取該序列。胰島素中的胺基酸序列由Frederick Sanger發現,確定蛋白質具有確定的胺基酸序列。蛋白質的序列對該蛋白質而言是唯一的,並且定義了蛋白質的結構及功能。蛋白質的序列可以透過諸如埃德曼降解法或串聯質譜的方法來確定。但是,通常經常使用遺傳密碼直接從基因序列中讀取。討論蛋白質時,強烈建議使用“胺基酸殘基”一詞,因為當形成肽鍵時,水分子會丟失,因此蛋白質由胺基酸殘基組成。通常也將翻譯後修飾(例如磷酸化和糖基化)視為一級結構的一部分,並且無法從基因中讀取。例如,胰島素由2條鏈中的51個胺基酸組成。一條鏈具有31個胺基酸,另一條鏈具有20個胺基酸。The amino acid sequence (or residue sequence) of a peptide or protein is the primary structure of the protein. It is the linear sequence of amino acids in the peptide or protein. The amino acids in the primary structure can interact with each other and form hydrogen bonds with amine groups using the C=O bond on the amide bond. Based on the nature of the free groups at each end, the two ends of the polypeptide chain are called the carboxyl terminus (C-terminus) and the amino terminus (N-terminus). The counting of residues always starts at the N terminus (NH 2-group), which is the end where the amine group does not participate in the peptide bond. The primary structure of the protein is determined by the gene corresponding to the protein. The specific nucleotide sequence in the DNA is transcribed into mRNA, and the ribosome reads the sequence in a process called translation. The amino acid sequence in insulin was discovered by Frederick Sanger, establishing that proteins have a definite amino acid sequence. The sequence of a protein is unique to that protein and defines the structure and function of the protein. The sequence of a protein can be determined by methods such as Edman degradation or tandem mass spectrometry. However, it is more often read directly from the gene sequence using the genetic code. When discussing proteins, it is strongly recommended to use the term "amino acid residues" because when a peptide bond is formed, a water molecule is lost, so the protein is composed of amino acid residues. Post-translational modifications such as phosphorylation and glycosylation are also usually considered part of the primary structure and cannot be read from the gene. For example, insulin is composed of 51 amino acids in 2 chains. One chain has 31 amino acids and the other has 20 amino acids.
蛋白質序列通常以一串英文字母表示,列出從胺基末端開始一直到羧基末端的胺基酸,而字母代表著胺基酸,以三個字母代碼或者單一字母代碼表示,而其字母表示有20種天然胺基酸,以及混合物或不常見的胺基酸(類似於核酸符)。Protein sequences are usually represented by a string of letters, listing the amino acids from the amino terminus to the carboxyl terminus, with the letters representing the amino acid, represented by a three-letter code or a single letter code, with the letters representing the 20 naturally occurring amino acids, as well as mixtures or unusual amino acids (similar to nucleic acid symbols).
簡化分子線性輸入規範(Simplified Molecular Input Line Entry Specification,SMILES)是一種簡單而全面的化學語言,且是一種行符號形式的規範,可以使用表示原子和鍵符號的ASCII字符串明確描述化學物質的結構或指定分子和反應。SMILES包含與擴展化學元素表中相同的信息,但相較起來具有多個優點。在應用上,可以用任何象徵分子或反應的SMILES字符串進行化學上正確且易於理解的描述,如果規範化表示唯一的字符串,故可用做特定化學結構的通用標示符串。大多數分子編輯器都可以導入SMILES字符串,以轉換回分子的二維圖形和三維模型。最初的SMILES規範始於1980年代,由David Weininger發起,使用了圖的概念,其中結點為原子,邊為鍵表示分子,括號用於表示分支點,數字標籤表示環連接點。基本的SMILES語法還包括同位素信息,關於雙鍵的構型和手徵性,這些語法稱為異構SMILES。從開發後,SMILES已由多數人,尤其日光化學信息系統有限公司(Daylight Chemical Information Systems Inc.),進行修改和擴展,不僅包括新功能,還包括兩種其他化學語言:SMARTS®,SMILES的擴展,允許以不同的特異性水平指定分子模式和特性以用於亞結構搜索; SMIRKS®,是反應SMARTS®的受限版本,涉及定義通用反應的原子鍵模式變化。而在2007年,開源化學組織開發了一個稱為OpenSMILES的開放標準。其他線性符號包括Wiswesser線符號(WLN),ROSDAL和SYBYL線符號(SLN)。The Simplified Molecular Input Line Entry Specification (SMILES) is a simple yet comprehensive chemical language, a line notation specification that can be used to unambiguously describe the structure of a chemical substance or specify molecules and reactions using ASCII strings representing atoms and bond symbols. SMILES contains the same information as the extended table of elements, but has several advantages over it. In practice, any SMILES string representing a molecule or reaction can be used to give a chemically correct and understandable description, and if the specification represents a unique string, it can be used as a universal identifier string for a specific chemical structure. Most molecular editors can import SMILES strings to convert back into two-dimensional graphics and three-dimensional models of the molecule. The original SMILES specification, originated in the 1980s by David Weininger, used the concept of a graph with nodes as atoms, edges as bonds to represent molecules, brackets to indicate branch points, and numeric labels to indicate ring junctions. The basic SMILES syntax also includes isotopic information, regarding the configuration and chirality of double bonds; these syntaxes are called isomeric SMILES. Since its development, SMILES has been modified and extended by many, notably Daylight Chemical Information Systems Inc., to include not only new functionality but also two other chemical languages: SMARTS®, an extension of SMILES that allows molecular patterns and properties to be specified at varying levels of specificity for use in substructure searches; and SMIRKS®, a restricted version of reaction SMARTS® involving variations of atomic bonding patterns to define general reactions. In 2007, the Open Source Chemistry Group developed an open standard called OpenSMILES. Other linear notations include the Wiswesser Line Notation (WLN), ROSDAL, and the SYBYL Line Notation (SLN).
第1圖為一系統示意圖,用以顯示說明本發明之深度學習模型系統之系統架構、以及運作情形。如第1圖中所示之,深度學習模型系統1包含資料處理服務模組2、預測親和力模型模組3、以及排序模型模組4。FIG. 1 is a system schematic diagram for illustrating the system architecture and operation of the deep learning model system of the present invention. As shown in FIG. 1 , the deep learning model system 1 includes a data
資料處理服務模組2、預測親和力模型模組3、以及排序模型模組4係為硬體、韌體、以及軟體的至少其中之一,配合深度學習模型系統1所在之電子裝置的處理器(未圖示之)而進行動作,在此,該電子裝置可為,例如,個人電腦、伺服器、AI 伺服器、Android手機、iPhone、平板電腦、iPad等。The data
資料處理服務模組2,該資料處理服務模組2對至少一藥物資料與蛋白質資料分別進行資料處理,產生分別對應該至少一種藥物資料與蛋白質資料的至少一處理結果,此處的藥物資料與蛋白質資料包含藥物的簡化分子線性輸入規範(Simplified molecular input line entry specification, SMILES)D_drug、蛋白質序列D_protein與藥物蛋白結構資料D_drug_protein,該資料處理可以是通過執行自然語言處理(word2vec)模型的資料處理服務模組2產生該處理結果,該處理結果為至少一種序列資料,並與藥物蛋白結構資料D_drug_protein整合。A data
預測親和力模型模組3,資料處理服務模組2將該處理結果導入該預測親和力模型模組3進行訓練,以通過該預測親和力模型模組3的訓練產生一預測結果,其中,該處理結果為至少一種序列資料與至少一種結構資料,也就是說,該序列資料與結構資料被導入可以是卷積神經模型/深度神經模型的預測親和力模型模組3進行一訓練以產生一預測結果,也就是將上述資料處理產生的藥物向量與蛋白質向量進行處理產生預測結果,並與藥物蛋白結構資料D_drug_protein整合。Predictive
對該預測結果進行排序,會將該預測結果與藥物蛋白結構資料D_drug_protein整合的結果導入深度網路模型的子網路,產生的多式架構模型接著被導入由子網路形成的預測親和力模型模組3,以進行處理產生均方誤差D_MSE。The prediction results are sorted, and the results of integrating the prediction results with the drug protein structure data D_drug_protein are introduced into the sub-network of the deep network model. The generated multi-modal architecture model is then introduced into the prediction
排序模型模組4將接收來自於預測親和力模型模組3的均方誤差D_MSE,並通過排序模型模組4來進行排序並取得一致性指數D_CI;該資料處理是將該藥物資料與蛋白質資料根據其資料型態進行處理,產生的該處理結果為至少一種序列資料與至少一種結構資料,且該預測親和力模型模組3為一深度網路模型。The
本發明中,除蛋白質序列和藥物字符串外,考量了蛋白質與藥物交互的結構資訊,將這三類資訊應用於模型,且依照蛋白質特性與藥物字符串轉換的規則進行預處理。另外,許多研究都是基於Yamanishi等人所提供的四種常用於藥學上的蛋白質數據集,而在本發明中,考量多重及超家族分析等因素,並非使用上述作者所使用的數據集,是採用Daniel A Bachovchin等人研究的藥物標靶資料集。該研究中所使用的標靶是針對絲氨酸水解酶(Serine hydrolase)活性抑制效果進行探討。絲胺酸水解酶是生物中已知數最多且多樣化的酶種類之一,在哺乳生物中,約佔全部蛋白質的1%,而在人類當中,就佔有約240個蛋白質成員。絲胺酸水解酶在生物過程中扮演許多關鍵的角色,像病理生理過程中的血液穩態、神經信號傳導及葡萄糖穩態,也在病毒以及細菌也有重要的功用,像細菌及病毒感染、生命週期以及毒性與耐藥性。在過去的研究中有研究員提出這200多種人類絲胺酸水解酶中生理化合物,並嘗試將絲胺酸水解酶做高通量、多重分析,探討其特性,並製作分析平台,其中使用96種絲胺酸水解酶成員及52種藥物做實驗觀察並紀錄其中抑制濃度以及彼此交互作用後的特性,以提高更多人類絲胺酸水解酶的抑制劑選擇性。In the present invention, in addition to protein sequences and drug strings, structural information of protein-drug interactions is considered, and these three types of information are applied to the model, and pre-processed according to the rules of protein property and drug string conversion. In addition, many studies are based on four protein data sets commonly used in pharmacology provided by Yamanishi et al., while in the present invention, factors such as multiple and superfamily analysis are considered, and the data sets used by the above authors are not used, but the drug target data set studied by Daniel A Bachovchin et al. is used. The target used in this study is to explore the inhibitory effect of serine hydrolase activity. Serine hydrolase is one of the most numerous and diverse enzyme types known in organisms. In mammals, it accounts for about 1% of all proteins, and in humans, it accounts for about 240 protein members. Serine hydrolases play many key roles in biological processes, such as blood homeostasis, neural signaling and glucose homeostasis in pathophysiological processes. They also have important functions in viruses and bacteria, such as bacterial and viral infections, life cycles, toxicity and drug resistance. In previous studies, researchers have proposed more than 200 physiological compounds in human serine hydrolases, and tried to perform high-throughput and multiplex analysis of serine hydrolases to explore their characteristics and create an analytical platform, which used 96 serine hydrolase members and 52 drugs for experimental observation and recorded the inhibition concentration and the characteristics after interaction with each other, in order to improve the selectivity of inhibitors for more human serine hydrolases.
本發明中,除了蛋白質一級結構(序列)與藥物字符串中具有氨基酸特性與化學式資訊,蛋白質與藥物三維結構中也具備極其重要的資訊能讓模型學習更全面的藥物標靶對特徵,因此利用傳統對接模擬軟體中的蛋白質(標靶)3D結構及藥物(配體)的自動生成的結合模式來計算並得到蛋白質與藥物之間基於結構的相互作用特徵,利用這些特徵結合其他資料型態,如一維的蛋白質序列資料及藥物SMILES字符串,並提出結合這些資料型態來預測與建立深度學習模型的蛋白質-藥物相互作用的結合親和力的方法,為此本發明利用用來產生詞向量的相關模型Word2vec來學習一維的蛋白質序列與配體SMILES字符串,用來產生詞向量的相關模型應用於深度學習自然語言處理(NLP)方法之一,透過現有自然語言方法探索並加深生命語言的理解,應用深度學習自然語言處理(NLP)方法之一的Word2vec模型,對一維的蛋白質序列與藥物SMILES字符串學習,依循蛋白質胺基酸序列及SMILES字符串中編碼的文法來學習並得到這些序列和字符串映射至特徵向量,並把轉換後的向量與相互作用特徵組合,輸入至深度模型當中進行預測。In the present invention, in addition to the amino acid characteristics and chemical formula information in the primary structure (sequence) of the protein and the drug string, the three-dimensional structure of the protein and the drug also has extremely important information that allows the model to learn more comprehensive drug-target pair characteristics. Therefore, the protein (target) 3D structure and the automatically generated binding pattern of the drug (ligand) in the traditional docking simulation software are used to calculate and obtain the structure-based interaction characteristics between the protein and the drug. These characteristics are combined with other data types, such as one-dimensional protein sequence data and drug SMILES strings, and a method for combining these data types to predict and establish the binding affinity of protein-drug interactions in a deep learning model is proposed. For this purpose, the present invention uses The Word2vec model used to generate word vectors is used to learn one-dimensional protein sequences and ligand SMILES strings. The Word2vec model used to generate word vectors is applied to one of the deep learning natural language processing (NLP) methods. Through existing natural language methods, we explore and deepen our understanding of life languages. We apply the Word2vec model, one of the deep learning natural language processing (NLP) methods, to learn one-dimensional protein sequences and drug SMILES strings. We follow the grammar encoded in the protein amino acid sequence and SMILES string to learn and map these sequences and strings to feature vectors, and combine the converted vectors with the interaction features and input them into the deep model for prediction.
另外,也提出利用Multi-Modal來學習這些資料表示和相互作用特徵並預測結合親和力,並且跟常見機器學習模型像是隨機森林(Random forest)、梯度提升(Gradient Boost)進行比較,以結果來看結合不同型態的資料表現無論是均方誤差(Mean Square Error, MSE)又或者一致性指數(Concordance Index, C-index)排序比單一使用一種型態資料的結果提高至10~15%。往後希望藉由結合更多型態的資料來預測更準確且排序佳的藥物標靶相互作用,並且建立平台去針對許多蛋白質家族以及多藥理進行多重分析,探索其中未知特性。In addition, it is proposed to use Multi-Modal to learn these data representations and interaction characteristics and predict binding affinity, and compare them with common machine learning models such as Random Forest and Gradient Boost. The results show that combining different types of data improves the mean square error (MSE) or consistency index (C-index) ranking by 10~15% compared with the results of using only one type of data. In the future, we hope to predict more accurate and better ranked drug-target interactions by combining more types of data, and establish a platform to perform multiple analyses on many protein families and polypharmacology to explore unknown characteristics.
第2圖為一流程圖,用以顯示說明利用如第1圖中之本發明之深度學習模型系統以進行深度學習模型方法的流程步驟。FIG. 2 is a flow chart illustrating the process steps of performing a deep learning model method using the deep learning model system of the present invention as shown in FIG. 1 .
於步驟101,進行資料處理動作;資料處理服務模組2對至少一種藥物資料(藥物字符串資料(配體簡化分子線性輸入規範(SMILES))與蛋白質資料(蛋白質序列資料(Protein Sequence, 例如,一維之蛋白質序列特徵))分別進行一資料處理,產生分別對應該至少一種藥物資料與蛋白質資料的至少一處理結果,在此,利用Word2vec來學習該至少一蛋白質序列資料(Protein Sequence,例如,一維之蛋白質序列特徵)與配體SMILES 字符串(至少一藥物字符串資料(配體簡化分子線性輸入規範(SMILES)),Word2vec 應用於深度學習自然語言處理(NLP)方法之一,而使用自然語言方法來探索並加深生命語言的理解,依循蛋白質氨基酸序列和SMILES 字符串中編碼的文法來學習並得到這些序列和字符串的低維度特徵向量,並進到步驟102。In
在此,於步驟101, 資料處理服務模組2對至少一藥物資料與蛋白質資料分別進行資料處理,產生分別對應該至少一種藥物資料與蛋白質資料的至少一處理結果,此處的藥物資料與蛋白質資料包含藥物的簡化分子線性輸入規範(Simplified molecular input line entry specification, SMILES)D_drug、蛋白質序列D_protein與藥物蛋白結構資料D_drug_protein,該資料處理可以是通過執行自然語言處理(word2vec)模型的資料處理服務模組2產生該處理結果,該處理結果為至少一種序列資料,並與藥物蛋白結構資料D_drug_protein整合。Here, in
以Word2Vec而言,本發明使用skip-gram 模型,所以只探討Skip-gram, 在Skip-gram 模型中,距離當前單詞較遠的單詞的權重較小,而接近單詞的權重較高,它在於分析單位小,所以對於訓練數據量可以提升,訓練表現也較佳,也具有處理低頻單詞的優勢。在本發明所使用的架構當中,利用了Asgari 和 Mofrad 等人提出針對蛋白質序列的Word2Vec 模型,稱為ProtVec,以及Öztürk1 和Ozrimli 等人提出針對藥物簡化分子線性輸入規範(SMILES)序列的Word2Vec 模型,稱為SMILEsVec,SMILEsVec 和ProtVec 的架構皆屬於Skip-gram 模型。As for Word2Vec, the present invention uses the skip-gram model, so only Skip-gram is discussed. In the Skip-gram model, the weight of words farther away from the current word is smaller, while the weight of words closer to the current word is higher. The reason is that the analysis unit is small, so the amount of training data can be increased, the training performance is also better, and it also has the advantage of processing low-frequency words. The framework used in the present invention utilizes the Word2Vec model for protein sequences proposed by Asgari and Mofrad et al., called ProtVec, and the Word2Vec model for drug simplified molecular linear input specification (SMILES) sequences proposed by Öztürk1 and Ozrimli et al., called SMILEsVec. The frameworks of SMILEsVec and ProtVec both belong to the Skip-gram model.
於步驟102,進行訓練/產生預測結果動作;將該處理結果導入一預測親和力模型模組3進行訓練,在此,將這些表示形式與蛋白質配體對接交互特徵予以組合並放入預測親和力模型模組3的深度模型當中進行預測;以及,進行產生預測結果動作,通過該預測親和力模型模組3的訓練產生一預測結果,並進到步驟103。In
在此,於步驟102中,將該處理結果導入一預測親和力模型模組3進行訓練,以通過該預測親和力模型模組3的訓練產生一預測結果,也就是說,該序列資料與結構資料被導入可以是卷積神經模型/深度神經模型的預測親和力模型模組3進行一訓練以產生一預測結果,也就是將上述資料處理產生的藥物向量與蛋白質向量進行處理產生預測結果,並與藥物蛋白結構資料D_drug_protein整合。Here, in
其中,將建構預測親和力模型模組3的一個快速且精確地預測藥物與標靶(蛋白質)之間的交互作用(親和力)深度學習模型。利用Autodock 中得到39 個Features 以及從Word2vec 模型學習的蛋白質和藥物的序列資料的表示得到藥物和標靶(蛋白質)之間的交互作用大小,並將所建立的模型與其他相關研究以及其他機器學習模型比較,且利用交叉驗證方法來挑出表現最佳的模型參數,並利用測試集得到數據結果。Among them, a deep learning model for predicting
以深度神經網路而言,深層網絡在訓練時,會有過擬和(overfitting)和訓練時間過長的問題產生,而Hinton 提出了Dropout 去解決此問題,在訓練過程階段中按照一定的機率暫時將部份的隱含層中的神經元取捨掉,使該神經元乘上0 使停止運算,這麼一來,這樣網絡會變得比較瘦,如此一來可以解決網絡的耗時問題。另外,網絡在訓練時,每一層的輸入會受到前面的層參數的影響,而輸入參數分佈會在訓練過程中不斷變化,這麼一來,每一層都要去適應新的分佈,使得網絡訓練變得更為複雜,而這種訓練過程中數據分佈的變化稱為內部共變量位移(Internal Covariate Shift),這個問題涵蓋的範圍不止有調整模型的超參數(hyperparameter)以外,還有初始化參數和調整學習速率,來解決此問題,而2015 年S.Ioffe 提出了批正規化BN(Batch Normalization)的方法,而BN 方法將正規化作為網絡模型架構的一部分,在每一個mini-batch 的數據進行前向傳遞演算法(Forward-propagation)的時候,對每一層進行正規化的處理,將分佈的平均值以及標準差調製0 和1,最後再將數據做擴展和平移。In the case of deep neural networks, problems such as overfitting and long training time may occur during training. Hinton proposed Dropout to solve this problem. During the training process, some neurons in the hidden layer are temporarily discarded with a certain probability, so that the neurons are multiplied by 0 to stop computing. In this way, the network will become thinner, which can solve the problem of time-consuming network. In addition, when the network is being trained, the input of each layer will be affected by the parameters of the previous layer, and the input parameter distribution will continue to change during the training process. In this way, each layer must adapt to the new distribution, making the network training more complicated. This change in data distribution during training is called internal covariate shift. This problem covers not only adjusting the hyperparameters of the model, but also initializing parameters and adjusting the learning rate. To solve this problem, S. Ioffe proposed the batch normalization (BN) method in 2015. The BN method takes regularization as part of the network model architecture and uses it in each mini-batch. When the data is forward-propagated, each layer is normalized, the mean and standard deviation of the distribution are modulated to 0 and 1, and finally the data is expanded and translated.
以本發明而言,所使用的任何模型架構中,BN 層是放在激勵函數層之前。In the context of the present invention, in any model architecture used, the BN layer is placed before the activation function layer.
以卷積神經網路CNN而言,在本發明使用的CNN 架構中,使用的卷基層維度是一維的,一維的卷積神經網路適用於時間序列分析,像是所使用的蛋白質和藥物資料集中,蛋白質氨基酸序列和藥物簡化分子線性輸入規範(SMILEs)皆屬於一維的,利用一維卷積層從中學習特徵並萃取,卷積層是一組平行的特徵圖(feature map),以圖片來說,卷積通過使用輸入數據的小方塊學習圖像特徵來保留像素之間的關係,它利用在輸入圖像上滑動不同的卷積核並執行一定的運算而組成。這是一種數學運算,需要兩個輸入,例如圖像矩陣和濾波器或內核,因此一張特徵圖中的元素都是通過一個卷積反覆運算得出的。In the case of the convolutional neural network (CNN), in the CNN architecture used in the present invention, the dimension of the convolutional base layer used is one-dimensional. The one-dimensional convolutional neural network is suitable for time series analysis. For example, in the protein and drug data sets used, the protein amino acid sequence and the drug simplified molecular linear input norm (SMILEs) are both one-dimensional, and features are learned and extracted using a one-dimensional convolutional layer. The convolutional layer is a set of parallel feature maps. For example, in terms of images, the convolution learns image features using small blocks of input data to retain the relationship between pixels. It is composed of sliding different convolution kernels on the input image and performing certain operations. This is a mathematical operation that requires two inputs, such as an image matrix and a filter or kernel, so that each element in a feature map is derived by repeatedly performing a convolution.
以多模態學習(Multi-Modal)而言,本發明使用多模態學習是偏向多模態表示學習,單一模態表示學習是將資料表示為模型可以處理的數學向量或者進一步抽象更為高層的特徵向量,在許多研究及應用中,深度學習模型對於單一型態資料表示的監督學習來說非常成功,而多模態表示學習是利用多模態來學習資料中更好的特徵表示,並且結合各個模態的特徵以模型更能學習到資料。於本發明中,除了蛋白質藥物交互作用下的特徵數據外,另外還有蛋白質序列以及藥物字符串的數據,多模態學習在過去研究中,已成功地應用在整合視頻和音頻於深度學習模型當中或成功應用於語音合成,但是,很少有研究通過深度學習將交互作用特徵和蛋白質標靶序列資料結合,以預測結合親和力,在這裏,本發明合併一個DNN(用於交互作用特徵)和一個CNN(用於蛋白質序列和藥物字符串),此方法是在兩個模型最終輸出層進行合併,並在確認最終預測時進一步利用多層隱含層串聯。對於多模態學習中,整合不同的資料型態可從不同方面受益。In terms of multimodal learning, the present invention uses multimodal learning that is biased towards multimodal representation learning. Single-modal representation learning represents data as mathematical vectors that can be processed by the model or further abstracts higher-level feature vectors. In many studies and applications, deep learning models are very successful in supervised learning of single-type data representations, while multimodal representation learning uses multimodality to learn better feature representations in the data and combines the features of each modality so that the model can better learn the data. In the present invention, in addition to the feature data under protein-drug interaction, there is also data of protein sequence and drug string. Multimodal learning has been successfully applied to integrating video and audio in deep learning models or successfully applied to speech synthesis in past studies. However, few studies have combined interaction features and protein target sequence data through deep learning to predict binding affinity. Here, the present invention combines a DNN (for interaction features) and a CNN (for protein sequence and drug string). This method is to merge the final output layers of the two models and further utilize multiple hidden layers in series when confirming the final prediction. For multimodal learning, integrating different data types can benefit from different aspects.
於步驟103,進行排序動作;對該預測結果進行排序,由該預測親和力模型模組進行處理所產生之均方誤差D_MSE傳送至該排序模型模組,接著通過該排序模型模組來進行排序並取得一致性指數D_CI。在此,對該預測結果進行排序,也就是將該預測結果與藥物蛋白結構資料D_drug_protein整合的結果導入深度網路模型的子網路,產生的多式架構模型接著被導入由子網路形成的預測親和力模型模組3,以進行處理產生均方誤差D_MSE,接著通過排序模型模組4來進行排序並取得一致性指數D_CI。在此另一實施例中,該資料處理是將該藥物資料與蛋白質資料根據其資料型態進行處理,產生的該處理結果為至少一種序列資料與至少一種結構資料,且該預測親和力模型模組3為一深度網路模型,然而上述只是用於舉例,本發明不限於上述另一實施例的實作方式。In
在此,於本發明之評估模型方式中,使用了兩種方法來檢測及比較不同模型的預測能力,這兩種方法為一致性指數D_CI(index of Concordance, C-index)以及均方誤差D_MSE(平均誤差平方(mean squared error, MSE)。Here, in the evaluation model method of the present invention, two methods are used to detect and compare the prediction capabilities of different models. These two methods are the consistency index D_CI (index of Concordance, C-index) and the mean squared error D_MSE (mean squared error, MSE).
以一致性指數D_CI而言,由於網絡模型預測的是蛋白質配體交互作用下的交互親和力,而親和力是連續值,本發明使用一致性指數(CI)作為預測準確性的評估指標,以數學式來說,一組配對數據上的CI 是具有不同標籤值的兩個隨機抽取的藥物-配體對的預測以正確順序進行的概率。As for the consistency index D_CI, since the network model predicts the interaction affinity under the protein-ligand interaction, and the affinity is a continuous value, the present invention uses the consistency index (CI) as an evaluation indicator of prediction accuracy. Mathematically speaking, the CI on a set of paired data is the probability that the prediction of two randomly selected drug-ligand pairs with different label values is performed in the correct order.
以均方誤差D_MSE而言,平均誤差平方是用來觀察真實標籤值與預測值之差平方的期望值,其值越小代表模型預測越準確。In terms of mean square error D_MSE, the mean square error is used to observe the expected value of the square of the difference between the true label value and the predicted value. The smaller the value, the more accurate the model prediction.
於本發明中,為了評估模型預測藥物標靶交互作用預測和排序能力,將使用隨機森林(Random forest, RF)和 梯度上升(Gradient Boost, GB)與本發明建立的模型進行比較。In the present invention, in order to evaluate the model's ability to predict and rank drug-target interactions, random forest (RF) and gradient boost (GB) are used to compare with the model established in the present invention.
第3圖為一示意圖,用以顯示說明利用本發明之深度學習模型系統的一實施例的系統架構、以及運作情形。FIG. 3 is a schematic diagram showing the system architecture and operation status of an embodiment of the deep learning model system of the present invention.
資料處理服務模組2、預測親和力模型模組3、以及排序模型模組4係為硬體、韌體、以及軟體的至少其中之一,配合深度學習模型系統1所在之電子裝置的處理器(未圖示之)而進行動作,在此,該電子裝置可為,例如,個人電腦、伺服器、AI 伺服器、Android手機、iPhone、平板電腦、iPad等。The data
預測親和力模型模組3,資料處理服務模組2將該處理結果導入該預測親和力模型模組3進行訓練,以通過該預測親和力模型模組3的訓練產生一預測結果,也就是說,該序列資料與結構資料被導入可以是卷積神經模型/深度神經模型的預測親和力模型模組3進行一訓練以產生一預測結果,也就是將上述資料處理產生的藥物向量與蛋白質向量進行處理產生預測結果,並與藥物蛋白結構資料D_drug_protein整合。Predictive
對該預測結果進行排序,會將該預測結果與藥物蛋白結構資料D_drug_protein整合的結果導入深度網路模型的子網路,產生的多式架構模型接著被導入由子網路形成的預測親和力模型模組3,以進行處理產生均方誤差D_MSE。The prediction results are sorted, and the results of integrating the prediction results with the drug protein structure data D_drug_protein are introduced into the sub-network of the deep network model. The generated multi-modal architecture model is then introduced into the prediction
排序模型模組4將接收來自於預測親和力模型模組3的均方誤差D_MSE,並通過排序模型模組4來進行排序並取得一致性指數D_CI;該資料處理是將該藥物資料與蛋白質資料根據其資料型態進行處理,產生的該處理結果為至少一種序列資料與至少一種結構資料,且該預測親和力模型模組3為一深度網路模型。The
資料處理服務模組2,該資料處理服務模組2對至少一種藥物資料(藥物字符串資料(配體簡化分子線性輸入規範(SMILES))與蛋白質資料(蛋白質序列資料(Protein Sequence, 例如,一維之蛋白質序列特徵))分別進行資料處理,產生分別對應該至少一種藥物資料與蛋白質資料的至少一處理結果; 該資料處理服務模組2對至少一藥物資料與蛋白質資料分別進行資料處理,產生分別對應該至少一種藥物資料與蛋白質資料的至少一該處理結果,此處的藥物資料與蛋白質資料包含藥物的簡化分子線性輸入規範(Simplified molecular input line entry specification, SMILES)D_drug 12、蛋白質序列D_protein 11與藥物蛋白結構資料D_drug_protein 13,該資料處理可以是通過執行自然語言處理Word2vec模型的資料處理服務模組而產生該處理結果,該處理結果為至少一種序列資料,並與藥物蛋白結構資料D_drug_protein 13整合。The data
在此,利用Word2vec來學習該至少一蛋白質序列資料特徵(Protein Sequence,例如,一維之蛋白質序列特徵)與藥物字符串資料SMILES(配體SMILES 字符串(至少一藥物字符串資料(配體簡化分子線性輸入規範(SMILES)),Word2vec 應用於深度學習自然語言處理(NLP)方法之一,而使用自然語言方法來探索並加深生命語言的理解,依循蛋白質序列資料特徵氨基酸序列與藥物字符串資料SMILES中編碼的文法來學習並得到這些蛋白質序列和藥物字符串的低維度特徵向量。Here, Word2vec is used to learn the at least one protein sequence data feature (Protein Sequence, for example, one-dimensional protein sequence feature) and the drug string data SMILES (ligand SMILES string (at least one drug string data (ligand simplified molecular linear input specification (SMILES))). Word2vec is applied to one of the deep learning natural language processing (NLP) methods, and uses natural language methods to explore and deepen the understanding of life language. The low-dimensional feature vectors of these protein sequences and drug strings are learned and obtained according to the grammar encoded in the protein sequence data feature amino acid sequence and the drug string data SMILES.
資料處理服務模組2所處理的每筆資料包含蛋白質配體對接交互特徵、蛋白質序列、配體簡化分子線性輸入規範SMILES(亦即,藥物資料(藥物字符串資料))的三種類型,分別從不同平台取得;蛋白質配體對接交互特徵是由Autodock Vina 軟體進行模擬後抽取特徵,例如,39 個蛋白質配體對接交互特徵名稱。蛋白質序列是資料中的ProteinID 利用UniProt 資料庫進行比對並取得序列,UniProt 為蛋白質序列與功能全面資訊資料庫,其中諸多內容來自基因組的計劃,且包含了大量來自研究文獻和許多研究關於蛋白質的生物學功能資訊。配體簡化分子線性輸入規範SMILES是利用資料中的CID,與PubChem 資料庫比對並取得序列,PubChem 資料庫提供生物活動相關的有機小分子生物活性數據,是交由美國國家衛生研究院(NIH)的下屬機構:「美國國家生物技術信息中心」(NCBI)負責維護。PubChem 屬於NIH「分子圖書館」( Molecular Libraries Roadmap Initiative)的一部份,可由www 免費進入查詢資訊,亦可利用FTP 將其內數以百萬筆的化學資料庫免費下載。PubChem 資料庫可查詢各種有機化學分子,包括藥物、化工原料皆收錄在其中。Each data processed by the data
藥物資料包含配體(藥物)簡化分子線性輸入規範SMILES D_drug 12,而蛋白質序列資料(蛋白質資料)包含蛋白質序列D_protein 11,而蛋白質配體對接交互特徵包含藥物蛋白結構資料D_drug_protein 13,該資料處理服務模組2的該資料處理可以是通過執行自然語言處理Word2vec模型而產生該處理結果,該處理結果為至少一種序列資料,並與蛋白質配體對接交互特徵之藥物蛋白結構資料D_drug_protein13予以整合。The drug data includes the ligand (drug) simplified molecular linear input specification SMILES
資料處理服務模組2利用Word2vec的技術特徵,使用Word2vec (用來產生詞向量的相關模型(word2vec))來學習一維的蛋白質序列與配體SMILES 字符串,Word2vec 應用於深度學習自然語言處理(NLP)方法;於本發明中,除了蛋白質一級結構(序列)與藥物字符串中具有氨基酸特性與化學式資訊,蛋白質與藥物三維結構中也具備極其重要的資訊能讓模型學習更全面的藥物標靶對接特徵,因此可利用對接模擬軟體中的蛋白質(標靶)3D結構及藥物(配體)的自動生成的結合模式來計算並得到蛋白質與藥物之間基於結構的相互作用特徵,利用這些特徵結合其他資料型態,如一維的蛋白質序列資料及藥物SMILES字符串,並提出結合這些資料型態來預測與建立深度學習模型的蛋白質-藥物相互作用的結合親和力的方法,為此本發明利用用來產生詞向量的相關模型Word2vec來學習一維的蛋白質序列與配體SMILES字符串,用來產生詞向量的相關模型應用於深度學習自然語言處理(NLP)方法之一,透過現有自然語言方法探索並加深生命語言的理解,應用深度學習自然語言處理(NLP)方法之一的Word2vec模型,對一維的蛋白質序列與藥物SMILES字符串學習,依循蛋白質胺基酸序列及SMILES字符串中編碼的文法來學習並得到這些序列和字符串映射至特徵向量,並把轉換後的向量與相互作用特徵組合,輸入至預測親和力模型模組3當中進行預測。Data
本發明與上述之習知技術、文獻不同,考量到傳統對接模擬軟體像Autodock Vina、GOLD、ZDOCK、RDOCK、HADDOCK,除了預測和排序對接處,對接模擬軟體也能預測結合親和力,但對接模擬軟體因為考量計算時間成本,簡化許多計算公式,導致許多對於親和力重要的資訊並未考量,像是水溶劑或氫鍵強度等因子,導致對街模擬軟體預測結合親和力沒有太好的表現,雖然可以結合molecular dynamics simulation(MD simulation)或Quantum-Mechanics (QM) 增加準確率,但是此舉必須花費許多時間成本。The present invention is different from the above-mentioned prior art and literature. Considering that traditional docking simulation software such as Autodock Vina, GOLD, ZDOCK, RDOCK, and HADDOCK can predict and sort docking points, docking simulation software can also predict binding affinity. However, docking simulation software simplifies many calculation formulas due to the consideration of calculation time cost, resulting in many important information for affinity not being considered, such as factors such as water solvent or hydrogen bond strength, resulting in the docking simulation software not having a very good performance in predicting binding affinity. Although it can be combined with molecular dynamics simulation (MD simulation) or Quantum-Mechanics (QM) to increase the accuracy, this must take a lot of time and cost.
於本發明中,利用對接模擬軟體中的蛋白質3D 結構和配體的自動生成的結合模式來計算並得到蛋白質和配體之間基於結構的相互作用特徵,利用這些特徵結合其他資料型態,如一維的蛋白質序列資料和配體SMILES 字符串,並提出結合這些資料型態來預測和深度學習模型的蛋白質-配體相互作用的結合親和力的方法,為此利用Word2vec(用來產生詞向量的相關模型(word2vec))來學習一維的蛋白質序列與配體SMILES 字符串,Word2vec 應用於深度學習自然語言處理(NLP)方法之一,而使用現有自然語言方法來探索並加深生命語言的理解,依循蛋白質氨基酸序列和SMILES 字符串中編碼的文法來學習並得到這些序列和字符串的低維度特徵向量,並將這些表示形式與相互作用特徵組合再一起放入深度模型當中進行預測。In the present invention, the protein 3D structure and the automatically generated binding pattern of the ligand in the docking simulation software are used to calculate and obtain the structure-based interaction characteristics between the protein and the ligand. These characteristics are combined with other data types, such as one-dimensional protein sequence data and ligand SMILES strings, and a method for combining these data types to predict the binding affinity of protein-ligand interaction with a deep learning model is proposed. For this purpose, Word2vec (a related model used to generate word vectors (word2vec)) is used to learn the one-dimensional protein sequence and ligand SMILES string. Word2vec is applied to one of the deep learning natural language processing (NLP) methods, and the existing natural language methods are used to explore and deepen the understanding of the language of life. According to the protein amino acid sequence and SMILES The grammar encoded in the strings is learned to obtain low-dimensional feature vectors of these sequences and strings, and these representations are combined with interaction features and then put into the deep model for prediction.
預測親和力模型模組3,在此,其中,資料處理服務模組2利用Word2vec的技術特徵,使用Word2vec (用來產生詞向量的相關模型(word2vec))來學習蛋白質序列資料,例如,一維的蛋白質序列、以及配體(藥物)簡化分子線性輸入規範SMILES之字符串,Word2vec 應用於深度學習自然語言處理(NLP)方法;於本發明中,除了蛋白質一級結構(序列)與藥物字符串中具有氨基酸特性與化學式資訊,蛋白質與藥物三維結構中也具備極其重要的資訊能讓該預測親和力模型模組3學習更全面的藥物標靶對接特徵(蛋白質配體對接交互特徵)。Affinity
另,預測親和力模型模組3,資料處理服務模組2將該處理結果導入該預測親和力模型模組3進行訓練,以通過該預測親和力模型模組3的訓練產生一預測結果,也就是說,該序列資料與結構資料被導入可以是卷積神經模型/深度神經模型的預測親和力模型模組3進行一訓練以產生一預測結果,也就是將上述資料處理產生的藥物向量與蛋白質向量進行處理產生預測結果,並與藥物蛋白結構資料D_drug_protein 13整合。In addition, the prediction
在此,利用資料處理服務模組2及/或預測親和力模型模組3,可利用對接模擬軟體中的蛋白質(標靶)3D結構及藥物(配體)的自動生成的結合模式來計算並得到蛋白質與藥物之間基於結構的相互作用特徵(蛋白質配體對接交互特徵),利用這些特徵(蛋白質配體對接交互特徵)結合其他資料型態,例如,一維的蛋白質序列資料及藥物SMILES字符串,並提出結合這些資料型態來預測與建立深度學習模型的蛋白質-藥物相互作用的結合親和力的方法,為此本發明利用用來產生詞向量的相關模型Word2vec來學習一維的蛋白質序列與配體SMILES字符串,用來產生詞向量的相關模型應用於深度學習自然語言處理(NLP)方法之一,透過現有自然語言方法探索並加深生命語言的理解,應用深度學習自然語言處理(NLP)方法之一的Word2vec模型,對一維的蛋白質序列與藥物SMILES字符串學習,依循蛋白質胺基酸序列及SMILES字符串中編碼的文法來學習並得到這些序列和字符串映射至特徵向量,並把轉換後的向量與相互作用特徵組合,輸入至預測親和力模型模組3當中進行預測。Here, using the data
另外,於預測親和力模型模組3中,也提出利用Multi-Modal來學習這些資料表示和相互作用特徵並預測結合親和力,並且跟常見機器學習模型像是隨機森林(Random forest)、梯度提升(Gradient Boost)進行比較,以結果來看結合不同型態的資料表現無論是均方誤差(Mean Square Error, MSE)又或者一致性指數(Concordance Index, C-index)排序比單一使用一種型態資料的結果提高10~15%。希望藉由結合更多型態的資料來預測更準確且排序佳的藥物標靶相互作用,並且建立平台去針對許多蛋白質家族以及多藥理進行多重分析,探索其中未知特性。In addition, in the affinity
本發明之技術特徵中所使用的標靶是針對酶活性抑制效果探討,其酶為絲氨酸水解酶(Serine hydrolase),絲氨酸水解酶是生物中已知最大且多樣化酶種類之一,在哺乳生物中,約佔全部蛋白質的1%,而在人類界當中,就佔有約240 個成員。 絲氨酸水解酶在生物過程中扮演許多關鍵的角色,像是病理生理過程中的血液穩態,神經信號傳導和葡萄糖穩態,也在病毒以及細菌也有重要的功用,像是細菌和病毒感染,生命週期以及毒性和耐藥性。200 多種人類絲氨酸水解酶中生理化合物以及功能當中,許多並未有良好的特性以及多數缺少選擇性的活性抑制劑,嘗試將絲氨酸水解酶做高通量,多重分析,探討其特性,並製作分析平台,其中使用96 種絲氨酸水解酶成員跟52 種藥物做實驗觀察並紀錄其中抑制濃度以及彼此交互作用後的特性,以提高更多人類絲氨酸水解酶的抑制劑選擇性。 The target used in the technical features of the present invention is to explore the inhibitory effect of enzyme activity, and the enzyme is serine hydrolase. Serine hydrolase is one of the largest and most diverse enzyme species known in organisms. In mammals, it accounts for about 1% of all proteins, and in humans, there are about 240 members. Serine hydrolases play many key roles in biological processes, such as blood homeostasis, neural signaling and glucose homeostasis in pathological and physiological processes. They also have important functions in viruses and bacteria, such as bacterial and viral infections, life cycles, toxicity and drug resistance. Among the more than 200 physiological compounds and functions of human serine hydrolases, many do not have good characteristics and most lack selective active inhibitors. We try to conduct high-throughput, multiplex analysis of serine hydrolases, explore their characteristics, and create an analysis platform. We use 96 serine hydrolase members and 52 drugs to conduct experimental observations and record the inhibition concentration and the characteristics after interaction with each other, so as to improve the selectivity of inhibitors for more human serine hydrolases.
預測親和力模型模組3,將至少一該處理結果導入該預測親和力模型模組3進行訓練,以通過該預測親和力模型模組3的預測親和力模型的訓練產生一預測結果,也就是說,該序列資料與結構資料被導入可以是卷積神經模型/深度神經模型的預測親和力模型模組3進行一訓練以產生一預測結果,也就是將上述資料處理產生的藥物向量與蛋白質向量進行處理產生該預測結果,利用資料處理服務模組2及/或預測親和力模型模組3將所產生之該預測結果與藥物蛋白結構資料D_drug_protein 13整合。The prediction
排序模型模組4,利用該排序模型模組4對該預測結果進行排序;將該預測結果與藥物蛋白結構資料D_drug_protein 13整合的結果導入深度網路模型的子網路,在此,例如,預測親和力模型模組3為深度網路模型,所產生的多式架構模型接著被導入由子網路形成的預測親和力模型模組3,以進行處理產生均方誤差D_MSE,預測親和力模型模組將D_MSE傳送至該排序模型模組4;接著,通過該排序模型模組4來進行排序並取得一致性指數D_CI。The
在此,於一實施例中,該資料處理是將該藥物資料與蛋白質資料根據其資料型態進行處理,產生的該處理結果為至少一種序列資料與至少一種結構資料,且該預測親和力模型模組3為一深度網路模型。Here, in one embodiment, the data processing is to process the drug data and protein data according to their data types, and the processing results generated are at least one sequence data and at least one structure data, and the predicted
資料處理服務模組2利用Word2vec的技術特徵,使用Word2vec (用來產生詞向量的相關模型(word2vec))來學習一維的蛋白質序列與配體SMILES 字符串,Word2vec 應用於深度學習自然語言處理(NLP)方法;於本發明中,除了蛋白質一級結構(序列)與藥物字符串中具有氨基酸特性與化學式資訊,蛋白質與藥物三維結構中也具備極其重要的資訊能讓模型學習更全面的藥物標靶對接特徵,因此可利用對接模擬軟體中的蛋白質(標靶)3D結構及藥物(配體)的自動生成的結合模式來計算並得到蛋白質與藥物之間基於結構的相互作用特徵。Data
於本發明中,利用對接模擬軟體中的蛋白質3D 結構和配體的自動生成的結合模式來計算並得到蛋白質和配體之間基於結構的相互作用特徵,利用這些特徵結合其他資料型態,如一維的蛋白質序列資料和配體SMILES 字符串,並提出結合這些資料型態來預測和深度學習模型的蛋白質-配體相互作用的結合親和力的方法,為此利用Word2vec(用來產生詞向量的相關模型(word2vec))來學習一維的蛋白質序列與配體SMILES 字符串,Word2vec 應用於深度學習自然語言處理(NLP)方法之一,而使用現有自然語言方法來探索並加深生命語言的理解,依循蛋白質氨基酸序列和SMILES 字符串中編碼的文法來學習並得到這些序列和字符串的低維度特徵向量。In the present invention, the protein 3D structure and the automatically generated binding pattern of the ligand in the docking simulation software are used to calculate and obtain the structure-based interaction characteristics between the protein and the ligand. These characteristics are combined with other data types, such as one-dimensional protein sequence data and ligand SMILES strings, and a method for combining these data types to predict the binding affinity of protein-ligand interaction with a deep learning model is proposed. For this purpose, Word2vec (a related model used to generate word vectors (word2vec)) is used to learn the one-dimensional protein sequence and ligand SMILES string. Word2vec is applied to one of the deep learning natural language processing (NLP) methods, and the existing natural language methods are used to explore and deepen the understanding of the language of life. According to the protein amino acid sequence and SMILES The grammar encoded in the strings is learned to obtain low-dimensional feature vectors of these sequences and strings.
第4圖為一流程圖,用以顯示說明利用如第3圖中之本發明之深度學習模型系統的一實施例以進行深度學習模型方法的一流程步驟。FIG. 4 is a flow chart illustrating a process step of performing a deep learning model method using an embodiment of the deep learning model system of the present invention as shown in FIG. 3 .
於步驟201,進行資料處理動作;資料處理服務模組2對至少一種藥物資料(藥物字符串資料(配體簡化分子線性輸入規範(SMILES))與蛋白質資料(蛋白質序列資料(Protein Sequence, 例如,一維之蛋白質序列特徵))分別進行一資料處理,產生分別對應該至少一種藥物資料與蛋白質資料的至少一處理結果,在此,利用Word2vec來學習該至少一蛋白質序列資料(Protein Sequence,例如,一維之蛋白質序列特徵)與配體SMILES 字符串(至少一藥物字符串資料(配體簡化分子線性輸入規範(SMILES)),Word2vec 應用於深度學習自然語言處理(NLP)方法之一,而使用自然語言方法來探索並加深生命語言的理解,依循蛋白質氨基酸序列和SMILES 字符串中編碼的文法來學習並得到這些序列和字符串的低維度特徵向量,並進到步驟202。In
在此,於步驟201, 資料處理服務模組2對至少一藥物資料與蛋白質資料分別進行資料處理,產生分別對應該至少一種藥物資料與蛋白質資料的至少一處理結果,此處的藥物資料與蛋白質資料包含藥物的簡化分子線性輸入規範(Simplified molecular input line entry specification, SMILES)D_drug 12、蛋白質序列D_protein 11與藥物蛋白結構資料D_drug_protein 13,該資料處理可以是通過執行自然語言處理(word2vec)模型的資料處理服務模組2產生該處理結果,該處理結果為至少一種序列資料,並與藥物蛋白結構資料D_drug_protein 13整合。Here, in
以Word2Vec而言,本發明使用skip-gram 模型,所以只探討Skip-gram, 在Skip-gram 模型中,距離當前單詞較遠的單詞的權重較小,而接近單詞的權重較高,它在於分析單位小,所以對於訓練數據量可以提升,訓練表現也較佳,也具有處理低頻單詞的優勢。在本發明所使用的架構當中,利用了Asgari 和 Mofrad 等人提出針對蛋白質序列的Word2Vec 模型,稱為ProtVec,以及Öztürk1 和Ozrimli 等人提出針對藥物簡化分子線性輸入規範(SMILES)序列的Word2Vec 模型,稱為SMILEsVec,SMILEsVec 和ProtVec 的架構皆屬於Skip-gram 模型。As for Word2Vec, the present invention uses the skip-gram model, so only Skip-gram is discussed. In the Skip-gram model, the weight of words farther away from the current word is smaller, while the weight of words closer to the current word is higher. The reason is that the analysis unit is small, so the amount of training data can be increased, the training performance is also better, and it also has the advantage of processing low-frequency words. The framework used in the present invention utilizes the Word2Vec model for protein sequences proposed by Asgari and Mofrad et al., called ProtVec, and the Word2Vec model for drug simplified molecular linear input specification (SMILES) sequences proposed by Öztürk1 and Ozrimli et al., called SMILEsVec. The frameworks of SMILEsVec and ProtVec both belong to the Skip-gram model.
於步驟202,進行訓練/產生預測結果動作;將該處理結果導入一預測親和力模型模組3進行訓練,在此,將這些表示形式與蛋白質配體對接交互特徵予以組合並放入預測親和力模型模組3的深度模型當中進行預測;以及,進行產生預測結果動作,通過該預測親和力模型模組3的訓練產生一預測結果,並進到步驟203。In
在此,於步驟202中,將該處理結果導入一預測親和力模型模組3進行訓練,以通過該預測親和力模型模組3的訓練產生一預測結果,也就是說,該序列資料與結構資料被導入可以是卷積神經模型/深度神經模型的預測親和力模型模組3進行一訓練以產生一預測結果,也就是將上述資料處理產生的藥物向量與蛋白質向量進行處理產生預測結果,並與藥物蛋白結構資料D_drug_protein 13整合。Here, in
其中,將建構預測親和力模型模組3的一個快速且精確地預測藥物與標靶(蛋白質)之間的交互作用(親和力)深度學習模型。利用Autodock 中得到39 個Features 以及從Word2vec 模型學習的蛋白質和藥物的序列資料的表示得到藥物和標靶(蛋白質)之間的交互作用大小,並將所建立的模型與其他相關研究以及其他機器學習模型比較,且利用交叉驗證方法來挑出表現最佳的模型參數,並利用測試集得到數據結果。Among them, a deep learning model for predicting
以深度神經網路而言,深層網絡在訓練時,會有過擬和(overfitting)和訓練時間過長的問題產生,而Hinton 提出了Dropout 去解決此問題,在訓練過程階段中按照一定的機率暫時將部份的隱含層中的神經元取捨掉,使該神經元乘上0 使停止運算,這麼一來,這樣網絡會變得比較瘦,如此一來可以解決網絡的耗時問題。另外,網絡在訓練時,每一層的輸入會受到前面的層參數的影響,而輸入參數分佈會在訓練過程中不斷變化,這麼一來,每一層都要去適應新的分佈,使得網絡訓練變得更為複雜,而這種訓練過程中數據分佈的變化稱為內部共變量位移(Internal Covariate Shift),這個問題涵蓋的範圍不止有調整模型的超參數(hyperparameter)以外,還有初始化參數和調整學習速率,來解決此問題,而2015 年S.Ioffe 提出了批正規化BN(Batch Normalization)的方法,而BN 方法將正規化作為網絡模型架構的一部分,在每一個mini-batch 的數據進行前向傳遞演算法(Forward-propagation)的時候,對每一層進行正規化的處理,將分佈的平均值以及標準差調製0 和1,最後再將數據做擴展和平移。In the case of deep neural networks, problems such as overfitting and long training time may occur during training. Hinton proposed Dropout to solve this problem. During the training process, some neurons in the hidden layer are temporarily discarded with a certain probability, so that the neurons are multiplied by 0 to stop computing. In this way, the network will become thinner, which can solve the problem of time-consuming network. In addition, when the network is being trained, the input of each layer will be affected by the parameters of the previous layer, and the input parameter distribution will continue to change during the training process. In this way, each layer must adapt to the new distribution, making the network training more complicated. This change in data distribution during training is called internal covariate shift. This problem covers not only adjusting the hyperparameters of the model, but also initializing parameters and adjusting the learning rate. To solve this problem, S. Ioffe proposed the batch normalization (BN) method in 2015. The BN method takes regularization as part of the network model architecture and uses it in each mini-batch. When the data is forward-propagated, each layer is normalized, the mean and standard deviation of the distribution are modulated to 0 and 1, and finally the data is expanded and translated.
以本發明而言,所使用的任何模型架構中,BN 層是放在激勵函數層之前。In the context of the present invention, in any model architecture used, the BN layer is placed before the activation function layer.
以卷積神經網路CNN而言,在本發明使用的CNN 架構中,使用的卷基層維度是一維的,一維的卷積神經網路適用於時間序列分析,像是所使用的蛋白質和藥物資料集中,蛋白質氨基酸序列和藥物簡化分子線性輸入規範(SMILEs)皆屬於一維的,利用一維卷積層從中學習特徵並萃取,卷積層是一組平行的特徵圖(feature map),以圖片來說,卷積通過使用輸入數據的小方塊學習圖像特徵來保留像素之間的關係,它利用在輸入圖像上滑動不同的卷積核並執行一定的運算而組成。這是一種數學運算,需要兩個輸入,例如圖像矩陣和濾波器或內核,因此一張特徵圖中的元素都是通過一個卷積反覆運算得出的。In the case of the convolutional neural network (CNN), in the CNN architecture used in the present invention, the dimension of the convolutional base layer used is one-dimensional. The one-dimensional convolutional neural network is suitable for time series analysis. For example, in the protein and drug data sets used, the protein amino acid sequence and the drug simplified molecular linear input norm (SMILEs) are both one-dimensional, and features are learned and extracted using a one-dimensional convolutional layer. The convolutional layer is a set of parallel feature maps. For example, in terms of images, the convolution learns image features using small blocks of input data to retain the relationship between pixels. It is composed of sliding different convolution kernels on the input image and performing certain operations. This is a mathematical operation that requires two inputs, such as an image matrix and a filter or kernel, so that each element in a feature map is derived by repeatedly performing a convolution.
以多模態學習(Multi-Modal)而言,本發明使用多模態學習是偏向多模態表示學習,單一模態表示學習是將資料表示為模型可以處理的數學向量或者進一步抽象更為高層的特徵向量,在許多研究及應用中,深度學習模型對於單一型態資料表示的監督學習來說非常成功,而多模態表示學習是利用多模態來學習資料中更好的特徵表示,並且結合各個模態的特徵以模型更能學習到資料。於本發明中,除了蛋白質藥物交互作用下的特徵數據外,另外還有蛋白質序列以及藥物字符串的數據,多模態學習在過去研究中,已成功地應用在整合視頻和音頻於深度學習模型當中或成功應用於語音合成,但是,很少有研究通過深度學習將交互作用特徵和蛋白質標靶序列資料結合,以預測結合親和力,在這裏,本發明合併一個DNN(用於交互作用特徵)和一個CNN(用於蛋白質序列和藥物字符串),此方法是在兩個模型最終輸出層進行合併,並在確認最終預測時進一步利用多層隱含層串聯。對於多模態學習中,整合不同的資料型態可從不同方面受益。In terms of multimodal learning, the present invention uses multimodal learning that is biased towards multimodal representation learning. Single-modal representation learning represents data as mathematical vectors that can be processed by the model or further abstracts higher-level feature vectors. In many studies and applications, deep learning models are very successful in supervised learning of single-type data representations, while multimodal representation learning uses multimodality to learn better feature representations in the data and combines the features of each modality so that the model can better learn the data. In the present invention, in addition to the feature data under protein-drug interaction, there is also data of protein sequence and drug string. Multimodal learning has been successfully applied to integrating video and audio in deep learning models or successfully applied to speech synthesis in past studies. However, few studies have combined interaction features and protein target sequence data through deep learning to predict binding affinity. Here, the present invention combines a DNN (for interaction features) and a CNN (for protein sequence and drug string). This method is to merge the final output layers of the two models and further utilize multiple hidden layers in series when confirming the final prediction. For multimodal learning, integrating different data types can benefit from different aspects.
於步驟203,進行排序動作;對該預測結果進行排序,也就是將該預測結果與藥物蛋白結構資料D_drug_protein整合的結果導入深度網路模型的子網路,產生的多式架構模型接著被導入由子網路形成的預測親和力模型模組3,以進行處理產生均方誤差D_MSE,接著通過排序模型模組4來進行排序並取得一致性指數D_CI。在此另一實施例中,該資料處理是將該藥物資料與蛋白質資料根據其資料型態進行處理,產生的該處理結果為至少一種序列資料與至少一種結構資料,且該預測親和力模型模組3為一深度網路模型,然而上述只是用於舉例,本發明不限於上述另一實施例的實作方式。In
在此,於本發明之評估模型方式中,使用了兩種方法來檢測及比較不同模型的預測能力,這兩種方法為一致性指數D_CI(index of Concordance, C-index)以及均方誤差D_MSE(平均誤差平方(mean squared error, MSE)。Here, in the evaluation model method of the present invention, two methods are used to detect and compare the prediction capabilities of different models. These two methods are the consistency index D_CI (index of Concordance, C-index) and the mean squared error D_MSE (mean squared error, MSE).
以一致性指數D_CI而言,由於網絡模型預測的是蛋白質配體交互作用下的交互親和力,而親和力是連續值,本發明使用一致性指數(CI)作為預測準確性的評估指標,以數學式來說,一組配對數據上的CI 是具有不同標籤值的兩個隨機抽取的藥物-配體對的預測以正確順序進行的概率。As for the consistency index D_CI, since the network model predicts the interaction affinity under the protein-ligand interaction, and the affinity is a continuous value, the present invention uses the consistency index (CI) as an evaluation indicator of prediction accuracy. Mathematically speaking, the CI on a set of paired data is the probability that the prediction of two randomly selected drug-ligand pairs with different label values is performed in the correct order.
以均方誤差D_MSE而言,平均誤差平方是用來觀察真實標籤值與預測值之差平方的期望值,其值越小代表模型預測越準確。In terms of mean square error D_MSE, the mean square error is used to observe the expected value of the square of the difference between the true label value and the predicted value. The smaller the value, the more accurate the model prediction.
於本發明中,為了評估模型預測藥物標靶交互作用預測和排序能力,將使用隨機森林(Random forest, RF)和梯度上升(Gradient Boost, GB)與本發明建立的模型進行比較。In the present invention, in order to evaluate the model's ability to predict and rank drug-target interactions, random forest (RF) and gradient boost (GB) are used to compare with the model established in the present invention.
以上所述僅為本發明之較佳實施例而已,並非用以限定本發明之範圍;凡其它未脫離本發明所揭示之精神下所完成之等效改變或修飾,均應包含在下述之專利範圍內。The above description is only a preferred embodiment of the present invention and is not intended to limit the scope of the present invention; any other equivalent changes or modifications that are accomplished without departing from the spirit disclosed by the present invention should be included in the following patent scope.
對於所屬技術領域中具有通常知識者顯而易見的是,在不脫離本發明的精神的情況下,本發明可以以其他特定形式實施。因此,以上描述不應在所有方面都被解釋為限制意義,而應被解釋為說明性的。It is obvious to those skilled in the art that the present invention can be implemented in other specific forms without departing from the spirit of the present invention. Therefore, the above description should not be interpreted as limiting in all aspects, but should be interpreted as illustrative.
本發明的範圍應當透過對所附申請專利範圍的合理解釋來確定,並且在本發明的等同物的範圍內的所有改變都包含在本發明的範圍內。The scope of the present invention should be determined by reasonable interpretation of the appended claims, and all changes within the scope of equivalents of the present invention are included in the scope of the present invention.
1:深度學習模型系統
2:資料處理服務模組
3:預測親和力模型模組
4:排序模型模組
11:D_protein
12:D_drug
13:D_drug_protein
101 102 103:步驟
201 202 203:步驟
1: Deep learning model system
2: Data processing service module
3: Prediction affinity model module
4: Ranking model module
11: D_protein
12: D_drug
13:
第1圖為一系統示意圖,用以顯示說明本發明之深度學習模型系統之系統架構、以及運作情形; 第2圖為一流程圖,用以顯示說明利用如第1圖中之本發明之深度學習模型系統以進行深度學習模型方法的流程步驟; 第3圖為一示意圖,用以顯示說明利用本發明之深度學習模型系統的一實施例的系統架構、以及運作情形; 第4圖為一流程圖,用以顯示說明利用如第3圖中之本發明之深度學習模型系統的一實施例以進行深度學習模型方法的一流程步驟。 Figure 1 is a system schematic diagram for showing the system architecture and operation of the deep learning model system of the present invention; Figure 2 is a flow chart for showing the process steps of using the deep learning model system of the present invention as shown in Figure 1 to perform a deep learning model method; Figure 3 is a schematic diagram for showing the system architecture and operation of an embodiment of the deep learning model system of the present invention; Figure 4 is a flow chart for showing the process steps of using an embodiment of the deep learning model system of the present invention as shown in Figure 3 to perform a deep learning model method.
101 102 103:步驟 101 102 103: Steps
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW112149805A TWI874059B (en) | 2023-12-20 | 2023-12-20 | A deep learning model system and method thereof |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| TW112149805A TWI874059B (en) | 2023-12-20 | 2023-12-20 | A deep learning model system and method thereof |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TWI874059B true TWI874059B (en) | 2025-02-21 |
| TW202526963A TW202526963A (en) | 2025-07-01 |
Family
ID=95557447
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW112149805A TWI874059B (en) | 2023-12-20 | 2023-12-20 | A deep learning model system and method thereof |
Country Status (1)
| Country | Link |
|---|---|
| TW (1) | TWI874059B (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118675621A (en) * | 2024-06-07 | 2024-09-20 | 杭州灵思智康科技有限公司 | Nucleic acid sequence generation method based on reinforcement learning and context-free grammar |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115312125A (en) * | 2022-03-03 | 2022-11-08 | 华东师范大学 | A deep learning method for predicting drug-target interactions based on biological substructures |
| TW202309919A (en) * | 2021-07-16 | 2023-03-01 | 國立清華大學 | Drug ranking method and system, comparison method and system for drug ranking method, correction method for drug ranking method and new use of drug selected using the same |
| CN113160894B (en) * | 2021-04-23 | 2023-10-24 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for predicting interaction between medicine and target |
| TW202349275A (en) * | 2022-03-07 | 2023-12-16 | 美商應用材料股份有限公司 | Synthetic time series data associated with processing equipment |
-
2023
- 2023-12-20 TW TW112149805A patent/TWI874059B/en active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113160894B (en) * | 2021-04-23 | 2023-10-24 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for predicting interaction between medicine and target |
| TW202309919A (en) * | 2021-07-16 | 2023-03-01 | 國立清華大學 | Drug ranking method and system, comparison method and system for drug ranking method, correction method for drug ranking method and new use of drug selected using the same |
| CN115312125A (en) * | 2022-03-03 | 2022-11-08 | 华东师范大学 | A deep learning method for predicting drug-target interactions based on biological substructures |
| TW202349275A (en) * | 2022-03-07 | 2023-12-16 | 美商應用材料股份有限公司 | Synthetic time series data associated with processing equipment |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN118675621A (en) * | 2024-06-07 | 2024-09-20 | 杭州灵思智康科技有限公司 | Nucleic acid sequence generation method based on reinforcement learning and context-free grammar |
Also Published As
| Publication number | Publication date |
|---|---|
| TW202526963A (en) | 2025-07-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Bepler et al. | Learning the protein language: Evolution, structure, and function | |
| Hu et al. | An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences | |
| Naseer et al. | iPhosS (Deep)-PseAAC: identification of phosphoserine sites in proteins using deep learning on general pseudo amino acid compositions | |
| JP2022543234A (en) | Machine learning assisted polypeptide design | |
| Naseer et al. | Sequence-based identification of arginine amidation sites in proteins using deep representations of proteins and PseAAC | |
| Zhang et al. | IChrom-deep: an attention-based deep learning model for identifying chromatin interactions | |
| Shen et al. | RNA-protein binding sites prediction via multi scale convolutional gated recurrent unit networks | |
| CN115116538A (en) | Protein ligand affinity prediction method, related device and equipment | |
| Diaz-Flores et al. | Evolution of artificial intelligence-powered technologies in biomedical research and healthcare | |
| CN115472221A (en) | Protein fitness prediction method based on deep learning | |
| Shah et al. | SVM-HUSTLE—an iterative semi-supervised machine learning approach for pairwise protein remote homology detection | |
| CN112652355B (en) | Drug-target relation prediction method based on deep forest and PU learning | |
| Zhu et al. | Computational advances in biosynthetic gene cluster discovery and prediction | |
| Zhang et al. | DeepSG2PPI: A protein-protein interaction prediction method based on Deep learning | |
| Li et al. | A systematic comparison of single-cell perturbation response prediction models | |
| TWI874059B (en) | A deep learning model system and method thereof | |
| Hegde et al. | Machine learning methods for gene regulatory network inference | |
| Lilhore et al. | ProtienCNN‐BLSTM: An efficient deep neural network with amino acid embedding‐based model of protein sequence classification and biological analysis | |
| Wang et al. | CyclePermea: membrane permeability prediction of cyclic peptides with a multi-loss fusion network | |
| Kroll et al. | Drug-target interaction prediction using a multi-modal transformer network demonstrates high generalizability to unseen proteins | |
| Lemetre et al. | Artificial neural network based algorithm for biomolecular interactions modeling | |
| Abbas et al. | Trans-ARG: Predicting Antibiotic Resistance Genes with a Transformer-Based Model and Pretrained Protein Language Model | |
| Osman et al. | Hybrid learning algorithm in neural network system for enzyme classification | |
| Li et al. | UniZyme: A Unified Protein Cleavage Site Predictor Enhanced with Enzyme Active-Site Knowledge | |
| Chen et al. | MuFAl: A Universal Drug-Target Interaction Prediction Framework |