[go: up one dir, main page]

CN110189795B - Sub-space learning-based detection method for subgroup-specific driving genes - Google Patents

Sub-space learning-based detection method for subgroup-specific driving genes Download PDF

Info

Publication number
CN110189795B
CN110189795B CN201910366338.1A CN201910366338A CN110189795B CN 110189795 B CN110189795 B CN 110189795B CN 201910366338 A CN201910366338 A CN 201910366338A CN 110189795 B CN110189795 B CN 110189795B
Authority
CN
China
Prior art keywords
subspace
subgroup
gene
genes
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910366338.1A
Other languages
Chinese (zh)
Other versions
CN110189795A (en
Inventor
习佳宁
袁细国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201910366338.1A priority Critical patent/CN110189795B/en
Publication of CN110189795A publication Critical patent/CN110189795A/en
Application granted granted Critical
Publication of CN110189795B publication Critical patent/CN110189795B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

The invention belongs to the technical field of cancer variant gene detection, and discloses a subspace learning-based subgroup specificity driving gene detection method, wherein variant data of variant genes in a cancer sample are used as input objects, low-dimensional vector output of each gene is obtained through a subspace learning algorithm, and the coordinate values of the low-dimensional vectors of each gene in different dimensions of subspaces can reflect the subgroup specificity of the genes; and analyzing the low-dimensional vector of the gene through outlier detection, and identifying the gene corresponding to the outlier vector in the subspace as a subgroup-specific driving gene. The invention can effectively infer the subgroup affiliation of the driving gene when the subgroup affiliation of the driving gene is unknown; aiming at the problem of multi-subgroup coexistence driving gene missed detection, the method strengthens the outlier significance of the driving genes in the characterization prediction and improves the detection performance of the subgroup specific driving genes by a subgroup specific outlier judgment method of subspace multimodal distribution.

Description

一种基于子空间学习的亚群特异性驱动基因检测方法A method for subpopulation-specific driver gene detection based on subspace learning

技术领域technical field

本发明属于癌症变异基因的计算检测技术领域,尤其涉及一种基于子空间学习的亚群特异性驱动基因检测方法。The invention belongs to the technical field of computational detection of cancer mutation genes, and in particular relates to a subgroup-specific driver gene detection method based on subspace learning.

背景技术Background technique

目前,最接近的现有技术:现有方法主要通过基因变异频率或频率显著性对驱动基因进行检测,但仅能识别癌症样本中的共有驱动基因。在异质性癌症中,一些亚群样本对于全部样本所占比例较少,导致相应亚群特异性驱动基因的变异频率偏低而难以被有效检测。虽然国内外学术界目前已在共有驱动基因检测中取得了一定成果,但对于亚群特异性驱动基因的检测还停留在起步阶段。近年来,肿瘤异质性问题在癌症驱动基因分析研究中受到广泛关注。由于异质性癌症存在不同亚群样本,亚群特异性驱动基因仅在相应亚群样本中表现较高变异频率,而在全部样本的变异频率相对偏低,因而难以被基于变异频率或频率显著性的方法有效检测。现有亚群特异性驱动基因检测方法主要通过已知亚群类别或双聚类所得样本类别,对各类样本中的共有驱动基因分别进行检测。Currently, the closest prior art: Existing methods mainly detect driver genes by gene variation frequency or frequency significance, but can only identify shared driver genes in cancer samples. In heterogeneous cancers, some subgroup samples account for a small proportion of all samples, resulting in low mutation frequency of the corresponding subgroup-specific driver genes, making it difficult to be effectively detected. Although domestic and foreign academic circles have achieved certain results in the detection of common driver genes, the detection of subgroup-specific driver genes is still in its infancy. In recent years, the problem of tumor heterogeneity has received extensive attention in the analysis of cancer driver genes. Due to the existence of different subgroup samples in heterogeneous cancers, the subgroup-specific driver genes only show a high mutation frequency in the corresponding subgroup samples, while the mutation frequency in the entire sample is relatively low, so it is difficult to be determined based on the mutation frequency or frequency significance. effective detection method. The existing subgroup-specific driver gene detection methods mainly detect the common driver genes in various samples through known subgroup categories or sample categories obtained by double clustering.

对于异质性癌症,现阶段研究主要针对亚群类别信息已知的理想情况,通过分别检测各类亚群样本内的共有驱动基因作为相应的亚群特异性驱动基因。OncodriveCLUST是目前常用的共有驱动基因检测方法,该方法根据不同样本的基因变异位置对基因进行聚类,并挑选出聚类结果中存在显著偏差的基因作为驱动基因。尽管该方法在样本类别已知的理想情况下表现出有效性,但对于癌症样本亚群类别未知的情况则不再适用。For heterogeneous cancers, the current research is mainly aimed at the ideal situation where the subgroup information is known, and the corresponding subgroup-specific driver genes are detected by respectively detecting the common driver genes in samples of each subgroup. OncodriveCLUST is a commonly used common driver gene detection method. This method clusters genes according to the gene variation positions of different samples, and selects genes with significant deviations in the clustering results as driver genes. Although this method has been shown to be effective in the ideal case where the class of the sample is known, it is no longer applicable when the class of the subpopulation of cancer samples is unknown.

在亚群类别信息缺失的情况下,双聚类算法可对癌症基因变异数据进行聚类,并获得各样本的聚类。基于上述线索,现有研究开始致力于发展基于双聚类的检测算法,进而预测异质性癌症的样本从属亚群及亚群特异性驱动基因。然而,现有双聚类算法仅对癌症样本的各个聚类分别进行孤立分析,由于多亚群共存驱动基因的变异频率在不同亚群间分布不均,导致双聚类方法在部分亚群样本中对多亚群共存驱动基因存在漏检。In the absence of subgroup category information, the biclustering algorithm can cluster the cancer gene variation data and obtain the clustering of each sample. Based on the above clues, existing research has begun to develop a detection algorithm based on double clustering to predict the subgroups and subgroup-specific driver genes of heterogeneous cancer samples. However, the existing biclustering algorithm only performs isolated analysis on each cluster of cancer samples. Because the mutation frequency of multi-subgroup coexistence driver genes is unevenly distributed among different subgroups, the biclustering method is not effective in some subgroup samples. There were missed detections in the multi-subgroup coexistence driver genes.

综上所述,现有技术存在的问题是:In summary, the problems in the prior art are:

(1)驱动基因的亚群从属关系未知,由于异质性癌症样本的亚群从属关系不明,导致亚群特异性驱动基因难以检测。(1) The subgroup affiliation of the driver genes is unknown. Due to the unknown subgroup affiliation of heterogeneous cancer samples, it is difficult to detect subgroup-specific driver genes.

(2)多亚群共存驱动基因漏检,对于同时在多个亚群发生变异的多亚群共存驱动基因,由于其变异频率在不同亚群间分布不均,导致现有方法即使通过各亚群结果取交集的方式进行检测,无法避免此类基因在部分亚群中漏检。(2) Missing detection of multi-subgroup coexistence driver genes. For multi-subgroup coexistence driver genes that mutate in multiple subgroups at the same time, due to the uneven distribution of variation frequency among different subgroups, the existing methods even pass each subgroup. The group results are detected by intersection, which cannot avoid the missing detection of such genes in some subgroups.

解决上述技术问题改进点在于:对于离体癌症样本数据的自动分析过程中,驱动基因的亚群从属关系未知问题,现有子空间学习方法不具备亚群特异性指引性,无法对驱动基因进行亚群特异性指引定位;同时,对于多亚群共存驱动基因漏检问题,由于现有方法主要对各亚群分别进行孤立判定,造成多亚群共存驱动基因的变异频率在亚群间分布不均而丧失离群显著性,进而难以被现有方法有效检测。本申请技术方案对现有分析模型有效性进行提升,当其方法应用于癌症基因的自动筛查分析系统时,可以提高其检测结果的有效性。The improvement to solve the above technical problems lies in: for the automatic analysis of in vitro cancer sample data, the subgroup affiliation of the driver gene is unknown, and the existing subspace learning method does not have subgroup-specific guidance, and it is impossible to analyze the driver gene. Subgroup-specific guidance positioning; at the same time, for the problem of missed detection of multi-subgroup coexistence driver genes, since the existing methods mainly make separate judgments for each subgroup, the variation frequency of multi-subgroup coexistence driver genes is not distributed among the subgroups. It loses the significance of outliers, which makes it difficult to be effectively detected by existing methods. The technical solution of the present application improves the validity of the existing analysis model, and when the method is applied to the automatic screening and analysis system of cancer genes, the validity of the detection results can be improved.

解决上述技术问题的意义:亚群特异性驱动基因不仅可指示癌症样本在发病机理层面的分歧,还可有效反映癌症病灶在临床治疗中的耐药性差异。上述技术可有效检测亚群特异性驱动基因,不仅具有揭示异质性癌症发病机理的科学研究价值,还具有为癌症精准医疗提供诊疗依据的临床应用前景。The significance of solving the above technical problems: Subgroup-specific driver genes can not only indicate the divergence of cancer samples at the level of pathogenesis, but also effectively reflect the differences in drug resistance of cancer lesions in clinical treatment. The above-mentioned technology can effectively detect subgroup-specific driver genes, not only has the scientific research value of revealing the pathogenesis of heterogeneous cancers, but also has the prospect of clinical application in providing diagnosis and treatment basis for cancer precision medicine.

发明内容Contents of the invention

针对现有技术存在的问题,本发明提供了一种基于子空间学习的亚群特异性驱动基因检测方法。Aiming at the problems existing in the prior art, the present invention provides a subgroup-specific driver gene detection method based on subspace learning.

本发明是这样实现的,一种基于子空间学习的亚群特异性驱动基因检测方法,所述基于子空间学习的亚群特异性驱动基因检测方法将变异基因在癌症样本中的变异数据作为输入对象,通过子空间学习算法得到各基因的低维向量输出,各基因低维向量在子空间不同维度的坐标值大小可反映基因的亚群特异性;通过离群点检测分析基因的低维向量,将子空间中离群向量所对应的基因识别为亚群特异性驱动基因。The present invention is achieved in this way, a subgroup-specific driver gene detection method based on subspace learning, the subgroup-specific driver gene detection method based on subspace learning uses the variation data of mutated genes in cancer samples as input Object, the low-dimensional vector output of each gene is obtained through the subspace learning algorithm, and the coordinate values of the low-dimensional vector of each gene in different dimensions of the subspace can reflect the subgroup specificity of the gene; the low-dimensional vector of the gene is analyzed by outlier detection , identifying genes corresponding to outlier vectors in the subspace as subgroup-specific driver genes.

进一步,所述基于子空间学习的亚群特异性驱动基因检测方法具体包括以下步骤:Further, the method for detecting subgroup-specific driver genes based on subspace learning specifically includes the following steps:

(1)癌症样本的基因名符号统一化与数据对齐,从TCGA、ICGC等癌症基因组公共数据库中获取异质性癌症的基因变异数据;针对不同平台所获得样本基因名符号的同物异名现象,通过Entrez Gene ID对癌症样本进行基因名符号统一化处理,将各样本的单核苷酸突变和拷贝数变异等基因变异匹配至相应基因,获得癌症样本的变异基因数据;将所有癌症样本数据按基因名进行对齐,构建描述各基因在样本间变异分布的癌症变异数据矩阵;(1) Unification and data alignment of gene names and symbols of cancer samples, obtaining genetic variation data of heterogeneous cancers from public databases of cancer genomes such as TCGA and ICGC; aiming at the phenomenon of homonyms and synonyms of sample gene names and symbols obtained on different platforms , use Entrez Gene ID to unify the gene names and symbols of cancer samples, match the gene variations such as single nucleotide mutations and copy number variations of each sample to the corresponding genes, and obtain the mutated gene data of cancer samples; all cancer sample data Align by gene name to construct a cancer variation data matrix describing the variation distribution of each gene among samples;

(2)亚群特异性指引的子空间映射构建,采用子空间学习方法从数据矩阵中提炼出基因变异在样本间的分布规律,并据此规律将待测基因以子空间向量点方式进行描述;在子空间映射的构建中,通过随机梯度下降法将原始数据和逆映射重建数据的残差平方和进行最小化求解,使逆映射可近似重建原始变异数据;(2) The subspace mapping construction of subgroup-specific guidance, using the subspace learning method to extract the distribution law of gene variation among samples from the data matrix, and describe the gene to be tested in the form of subspace vector points based on this law ; In the construction of the subspace mapping, the residual sum of squares of the original data and the inverse mapping reconstruction data is minimized by the stochastic gradient descent method, so that the inverse mapping can approximately reconstruct the original variation data;

采用子空间的不同维度方向对驱动基因的潜在从属亚群进行指引,当子空间的亚群维数确定时,子空间向量点与坐标轴/坐标面的位置关系可分别表示驱动基因的单亚群/多亚群从属关系;根据子空间向量点在空间中的分布密度,通过空间变换坐标指引方法对子空间各维度进行正交变换旋转;Different dimensions of the subspace are used to guide the potential subordinate subgroups of the driver genes. When the subgroup dimension of the subspace is determined, the positional relationship between the subspace vector points and the coordinate axes/planes can represent the single subgroup of the driver genes, respectively. Group/multi-subgroup affiliation; according to the distribution density of subspace vector points in the space, the orthogonal transformation and rotation of each dimension of the subspace is carried out through the space transformation coordinate guidance method;

(3)驱动基因在坐标轴/坐标面的亚群特异性定位,对子空间坐标值引入非负约束,使子空间向量的坐标值大小反映驱动基因对各亚群维度的从属程度;共有驱动基因的子空间向量应同时隶属于所有亚群维度,而多亚群共存驱动基因则属于部分亚群维度,单亚群特异性驱动基因应只属于单个亚群维度;(3) The subgroup-specific positioning of the driver genes on the coordinate axis/coordinate plane, introducing non-negative constraints on the subspace coordinate values, so that the coordinate value of the subspace vector reflects the degree of subordination of the driver genes to each subgroup dimension; The subspace vector of a gene should belong to all subgroup dimensions at the same time, while multi-subgroup coexistence driver genes should belong to some subgroup dimensions, and single subgroup specific driver genes should only belong to a single subgroup dimension;

(4)基于驱动基因子空间向量坐标的亚群从属判定,根据结果中驱动基因的亚群特异性定位,对子空间中不同位置的子空间向量点是否位于坐标轴、坐标面或第一卦限进行判定;位于子空间坐标轴的子空间向量点可指示单亚群特异性驱动基因,位于坐标面的子空间向量点可指示多亚群共存驱动基因,位于第一卦限且相距原点较远的子空间向量点则代表所有样本的共有驱动基因;(4) Based on the determination of subgroup subordination based on the subspace vector coordinates of the driver gene, according to the subgroup-specific positioning of the driver gene in the result, whether the subspace vector points at different positions in the subspace are located on the coordinate axis, the coordinate plane or the first hexagram The subspace vector point located on the coordinate axis of the subspace can indicate the specific driver gene of a single subgroup, and the subspace vector point located on the coordinate plane can indicate the coexistence driver gene of multiple subgroups. The far subspace vector points represent the common driver genes of all samples;

(5)亚群特异性驱动基因的空间方位多峰分布进行空间坐标表示,由于降维结果中的无关基因主要集中于子空间坐标原点附近,而驱动基因子空间向量点则沿着坐标轴/坐标面呈现多峰分布;通过基因子空间向量坐标所推断的亚群从属关系,对上述降维结果进行亚群特异性驱动基因的空间方位多峰分布表示,以描述基因子空间向量点在各维度方向的离群程度,为后续基于离群点判定的亚群特异性驱动基因检测提供判别标准;(5) The spatial orientation multimodal distribution of subgroup-specific driver genes is represented by spatial coordinates. Since the irrelevant genes in the dimensionality reduction results are mainly concentrated near the origin of the subspace coordinates, the vector points of the driver gene subspace are along the coordinate axis/ The coordinate surface presents a multimodal distribution; through the subgroup affiliation relationship inferred from the vector coordinates of the gene subspace, the above dimensionality reduction results are represented by the spatial orientation multimodal distribution of the subgroup-specific driver genes to describe the distribution of the gene subspace vector points in each The degree of outlier in the dimension direction provides a criterion for subsequent detection of subgroup-specific driver genes based on outlier determination;

(6)多峰分布离群点分界面的超椭球面半主轴估计,针对各亚群孤立分界面传统策略的多亚群共存驱动基因漏检,超椭球面分界面分界判定对多亚群共存驱动基因进行判别;为此,对所有子空间向量计算协方差矩阵,并通过特征值分解得到协方差矩阵的特征值与特征向量;其中,特征向量则用于对子空间进行正交变换旋转,并通过旋转后子空间中较大特征值对应的维度指示子空间亚群维度;相应特征值开方后的倒数大小为所估计的超椭球面半主轴长度,用于构建超椭球面分界面。(6) Estimation of the semi-principal axis of the hyperellipsoid at the interface of multimodal distribution outliers, the missing detection of driving genes for the coexistence of multiple subgroups in the traditional strategy of the isolated interface of each subgroup, and the determination of the boundary of the hyperellipsoid interface for the coexistence of multiple subgroups For this purpose, the covariance matrix is calculated for all subspace vectors, and the eigenvalues and eigenvectors of the covariance matrix are obtained through eigenvalue decomposition; among them, the eigenvectors are used for orthogonal transformation and rotation of the subspace, And the subspace subgroup dimension is indicated by the dimension corresponding to the larger eigenvalue in the rotated subspace; the reciprocal of the corresponding eigenvalue square root is the estimated semi-principal length of the hyperellipsoid, which is used to construct the hyperellipsoid interface.

(7)子空间离群点判定的亚群特异性驱动基因检测,在子空间中,通过步骤6中所得的超椭球面分界面,对基因的子空间向量点多峰分布进行离群点判别:其中,将超椭球面内部部分的子空间向量点所对应基因判定为癌症无关基因;对于超椭球面外部部分的子空间向量点所对应基因,则根据子空间亚群维度指示的坐标轴/坐标面,分别预测为单亚群特异性/多亚群共存驱动基因。(7) Subgroup-specific driver gene detection for subspace outlier determination. In the subspace, through the hyperellipsoid interface obtained in step 6, outlier discrimination is performed on the multimodal distribution of gene subspace vector points : Among them, the gene corresponding to the subspace vector point in the inner part of the hyperellipsoid is judged as a cancer-independent gene; for the gene corresponding to the subspace vector point in the outer part of the hyperellipsoid, the coordinate axis/ Coordinate plane, predicted as single-subgroup-specific/multi-subgroup-coexisting driver genes, respectively.

进一步,所述第三步的驱动基因在坐标轴/坐标面的亚群特异性定位对基因子空间向量引入稀疏化策略,对子空间向量进行稀疏正则化,提出基因子空间向量的亚群特异性指引子空间学习公式;Further, the subgroup-specific positioning of the driver gene in the third step on the coordinate axis/coordinate plane introduces a sparse strategy to the gene subspace vector, performs sparse regularization on the subspace vector, and proposes the subgroup specificity of the gene subspace vector. Sexual guidance subspace learning formula;

Figure BDA0002048316570000051
Figure BDA0002048316570000051

Figure BDA0002048316570000052
Figure BDA0002048316570000052

其中,xi是第i个基因在不同样本的变异数据向量,zi是第i个基因的子空间向量,W为子空间映射/逆映射的参数矩阵,而λ则为子空间向量zi的稀疏正则化参数。Among them, x i is the variation data vector of the i-th gene in different samples, z i is the subspace vector of the i-th gene, W is the parameter matrix of subspace mapping/inverse mapping, and λ is the subspace vector z i The sparse regularization parameter for .

在基因变异数据的降维过程中,稀疏化子空间学习策略可确保具备亚群特异性倾向基因的子空间向量坐标取值稀疏,保障相应向量点的子空间位置锁定于坐标轴/坐标面;在驱动基因的子空间向量坐标中,稀疏非零值所在维度可指示该驱动基因的从属亚群。通过检索稀疏化子空间向量的非零元维度索引,定位子空间向量点所在的坐标轴/坐标面,推断出驱动基因的亚群特异性。In the process of dimensionality reduction of genetic variation data, the sparse subspace learning strategy can ensure that the subspace vector coordinates of genes with subgroup-specific tendency are sparse, and ensure that the subspace positions of corresponding vector points are locked on the coordinate axis/coordinate plane; In the subspace vector coordinates of a driver gene, the dimension of the sparse non-zero value can indicate the subordinate subgroup of the driver gene. The subgroup specificity of the driver genes is inferred by retrieving the non-zero element dimension index of the sparse subspace vector, locating the coordinate axis/coordinate plane where the subspace vector points lie.

进一步,所述第六步实现超椭球面分界判定,通过子空间向量点开展多峰分布离群点分界面的超椭球面半主轴估计;采用协方差矩阵特征值对超椭球面在各维度的半主轴长度进行估计;超椭球面分界面可同时兼顾单亚群特异性/多亚群共存驱动基因的离群显著性,具体方法为:对所有子空间向量计算协方差矩阵,并通过特征值分解得到协方差矩阵的特征值与特征向量;其中,特征向量则用于对子空间进行正交变换旋转,并通过旋转后子空间中较大特征值对应的维度指示子空间亚群维度;相应特征值开方后的倒数大小为所估计的超椭球面半主轴长度,用于构建超椭球面分界面。Further, the sixth step realizes the boundary judgment of the hyperellipsoid, and carries out the estimation of the semi-axis of the hyperellipsoid of the interface of multimodal distribution outliers through the subspace vector points; The length of the semi-principal axis is estimated; the hyperellipsoid interface can take into account the outlier significance of single-subgroup specific/multi-subgroup coexistence driver genes at the same time. The specific method is: calculate the covariance matrix for all subspace vectors, and pass the eigenvalue The eigenvalues and eigenvectors of the covariance matrix are decomposed to obtain the eigenvalues and eigenvectors of the covariance matrix; among them, the eigenvectors are used to perform orthogonal transformation and rotation on the subspace, and indicate the dimension of the subspace subgroup through the dimension corresponding to the larger eigenvalue in the rotated subspace; corresponding The reciprocal of the square root of the eigenvalue is the estimated semi-principal length of the hyperellipsoid, which is used to construct the interface of the hyperellipsoid.

本发明的另一目的在于提供一种应用所述基于子空间学习的亚群特异性驱动基因检测方法的癌症变异基因检测系统。Another object of the present invention is to provide a cancer mutation gene detection system using the method for detecting subgroup-specific driver genes based on subspace learning.

综上所述,本发明的优点及积极效果为:本发明通过空间变换坐标指引方法对子空间各维度进行正交变换旋转,使坐标轴/坐标面可在表征空间反映驱动基因的亚群特异性。由于基因的子空间向量点具有发散性而无法定位于亚群相应的坐标轴/坐标面,因此采用稀疏化表征以保障驱动基因表征点的坐标取值稀疏化,对驱动基因在坐标轴/坐标面进行亚群特异性定位。In summary, the advantages and positive effects of the present invention are: the present invention performs orthogonal transformation and rotation on each dimension of the subspace through the space transformation coordinate guidance method, so that the coordinate axis/coordinate plane can reflect the subgroup specificity of the driving gene in the representation space sex. Since the subspace vector points of genes are divergent and cannot be positioned on the corresponding coordinate axes/coordinate planes of subgroups, sparse representation is used to ensure that the coordinate values of driver gene characterization points are sparse. face for subpopulation-specific mapping.

本发明通过离群点分界面的超椭球面半主轴估计,针对单亚群特异性/多亚群共存驱动基因构建超椭球面分界面,以反映多峰分布在各维度方向的不同分布密度。相比于各亚群孤立的离群点分界面,超椭球面离群点分界可有效强化多亚群共存基因的离群显著性,实现单亚群特异性/多亚群共存驱动基因的同步预测。In the present invention, by estimating the semi-axis of the hyperellipsoid of the interface of outliers, a superellipsoid interface is constructed for single-subgroup-specific/multi-subgroup coexistence driver genes to reflect the different distribution densities of multi-modal distribution in each dimension. Compared with the isolated outlier interface of each subgroup, the hyperellipsoidal outlier boundary can effectively strengthen the outlier significance of multi-subgroup coexisting genes, and realize the synchronization of single-subgroup-specific/multi-subgroup coexistence-driven genes predict.

本发明通过建立亚群特异性指引的子空间学习算法,可在驱动基因的亚群从属关系未知情况时,对驱动基因的亚群从属关系进行有效推断;针对多亚群共存驱动基因漏检问题,通过子空间多峰分布的亚群特异性离群点判定方法,强化驱动基因在表征预测中的离群显著性,从而提升亚群特异性驱动基因的检测性能。By establishing a subspace learning algorithm guided by subgroup specificity, the invention can effectively infer the subgroup affiliation of the driver gene when the subgroup affiliation of the driver gene is unknown; aiming at the problem of missed detection of multi-subgroup coexistence driver genes , through the subgroup-specific outlier determination method of subspace multimodal distribution, the outlier significance of the driver genes in the representation prediction is strengthened, thereby improving the detection performance of the subgroup-specific driver genes.

附图说明Description of drawings

图1是本发明实施例提供的基于子空间学习的亚群特异性驱动基因检测方法流程图。Fig. 1 is a flow chart of a subgroup-specific driver gene detection method based on subspace learning provided by an embodiment of the present invention.

图2是本发明实施例提供的基于子空间学习的亚群特异性驱动基因检测方法实现流程图。Fig. 2 is a flow chart of the realization of the method for subgroup-specific driver gene detection based on subspace learning provided by the embodiment of the present invention.

图3是本发明实施例提供的基于亚群特异性指引子空间学习的降维映射示意图。Fig. 3 is a schematic diagram of a dimensionality reduction mapping based on subgroup-specific guidance subspace learning provided by an embodiment of the present invention.

图4是本发明实施例提供的子空间多峰分布的亚群特异性离群点判定方法(以亚群维数=3为例)示意图;Fig. 4 is a schematic diagram of a subgroup-specific outlier determination method (taking subgroup dimension = 3 as an example) provided by an embodiment of the present invention for subspace multimodal distribution;

图中:(A)各亚群孤立分界面的漏检问题;(B)超椭圆面分界面的亚群特异性驱动基因预测。In the figure: (A) The missing detection problem of the isolated interface of each subgroup; (B) Subgroup-specific driver gene prediction of the superelliptic interface.

图5是本发明实施例提供的亚群特异性驱动基因检测性能结果的ROC曲线图。Fig. 5 is a ROC curve diagram of the subgroup-specific driver gene detection performance results provided by the embodiment of the present invention.

图6是本发明实施例提供的亚群特异性驱动基因的子空间低维向量(展示其中三个亚群)示意图。Fig. 6 is a schematic diagram of subspace low-dimensional vectors (showing three subgroups) of subgroup-specific driver genes provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白,以下结合实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

针对现有技术在异质性癌症样本的亚群类别未知情况时,本发明能够解决驱动基因的亚群从属关系推断问题。对于变异频率分布不均的多亚群共存驱动基因,本发明能够解决现有技术对此类基因的遗检问题。本发明针对驱动基因的亚群从属关系未知问题,通过子空间维度对亚群特异性进行指引,从而有效对驱动基因的亚群从属关系进行推断;针对多亚群共存驱动基因遗漏问题,通过子空间多峰分布的亚群特异性离群点判定方法,实现单亚群特异性/多亚群共存驱动基因的同步检测。Aiming at the situation in the prior art that the subgroup category of the heterogeneous cancer sample is unknown, the present invention can solve the problem of inferring the subgroup affiliation of the driver gene. For multi-subgroup co-existing driving genes with uneven variation frequency distribution, the present invention can solve the problem of genetic testing of such genes in the prior art. Aiming at the unknown subgroup affiliation of driver genes, the present invention guides the subgroup specificity through the subspace dimension, thereby effectively inferring the subgroup affiliation of driver genes; The subgroup-specific outlier determination method based on spatial multimodal distribution realizes simultaneous detection of single-subgroup-specific/multi-subgroup coexistence driver genes.

下面结合附图对本发明的应用原理作详细的描述。The application principle of the present invention will be described in detail below in conjunction with the accompanying drawings.

如图1所示,本发明实施例提供的基于子空间学习的亚群特异性驱动基因检测方法包括以下步骤:As shown in Figure 1, the method for detecting subgroup-specific driver genes based on subspace learning provided by the embodiment of the present invention includes the following steps:

S101:将变异基因在癌症样本中的变异数据作为输入对象,通过子空间学习算法得到各基因的低维向量输出,各基因低维向量在子空间不同维度的坐标值大小可反映基因的亚群特异性;S101: Take the mutation data of the mutated gene in the cancer sample as the input object, and obtain the low-dimensional vector output of each gene through the subspace learning algorithm, and the coordinate values of the low-dimensional vector of each gene in different dimensions of the subspace can reflect the subgroup of the gene specificity;

S102:通过离群点检测分析基因的低维向量,将子空间中离群向量所对应的基因识别为亚群特异性驱动基因。S102: Analyzing the low-dimensional vectors of genes by outlier detection, and identifying the genes corresponding to the outlier vectors in the subspace as subgroup-specific driver genes.

下面结合附图对本发明的应用原理作进一步的描述。The application principle of the present invention will be further described below in conjunction with the accompanying drawings.

本发明实施例提供的基于子空间学习的亚群特异性驱动基因检测方法具体包括以下步骤:The subgroup-specific driver gene detection method based on subspace learning provided by the embodiment of the present invention specifically includes the following steps:

(1)癌症样本的基因名符号统一化与数据对齐(1) Gene name symbol unification and data alignment of cancer samples

为在亚群类别未知情况下构建异质性癌症的亚群特异性驱动基因检测算法,首先从TCGA、ICGC等癌症基因组公共数据库中获取异质性癌症的基因变异数据。针对不同平台所获得样本基因名符号的同物异名现象,通过Entrez Gene ID对癌症样本进行基因名符号统一化处理,将各样本的单核苷酸突变和拷贝数变异等基因变异匹配至相应基因,以获得癌症样本的变异基因数据。将所有癌症样本数据按基因名进行对齐,构建描述各基因在样本间变异分布的癌症变异数据矩阵。In order to construct a subgroup-specific driver gene detection algorithm for heterogeneous cancers when the subgroup type is unknown, the gene variation data of heterogeneous cancers were first obtained from public databases of cancer genomes such as TCGA and ICGC. In view of the phenomenon of homonym and synonym of gene names and symbols obtained on different platforms, the gene names and symbols of cancer samples are unified through Entrez Gene ID, and genetic variations such as single nucleotide mutations and copy number variations of each sample are matched to the corresponding Genes to obtain mutation gene data of cancer samples. All cancer sample data were aligned by gene name, and a cancer variation data matrix describing the variation distribution of each gene among samples was constructed.

(2)亚群特异性指引的子空间映射构建(2) Construction of subspace mapping for subpopulation-specific guidance

在肿瘤异质性癌症数据中,有效推测驱动基因的从属亚群是实现亚群特异性驱动基因检测的基本前提。对于亚群从属关系未知问题,本发明采用子空间学习方法从数据矩阵中提炼出基因变异在样本间的分布规律,并据此规律将待测基因以子空间向量点方式进行描述。上述子空间学习方法是一种可将高维数据映射至低维空间的数据降维算法,其针对高维输入数据,通过无监督学习方式构建出子空间降维映射,并以此计算出相应的低维向量输出。在子空间映射的构建中,通过随机梯度下降法将原始数据和逆映射重建数据的残差平方和进行最小化求解,使逆映射可近似重建原始变异数据,从而保障子空间降维结果中含有癌症变异数据的分布模式信息(图3)。In cancer data with tumor heterogeneity, effectively inferring the subordinate subgroups of driver genes is the basic premise to realize subgroup-specific driver gene detection. For the problem of unknown subgroup affiliation, the invention adopts the subspace learning method to extract the distribution law of gene variation among samples from the data matrix, and describes the gene to be tested in the form of subspace vector points according to the law. The above subspace learning method is a data dimensionality reduction algorithm that can map high-dimensional data to low-dimensional space. It constructs a subspace dimensionality reduction map through unsupervised learning for high-dimensional input data, and calculates the corresponding The low-dimensional vector output of . In the construction of the subspace mapping, the residual sum of squares of the original data and the inverse mapping reconstruction data is minimized by the stochastic gradient descent method, so that the inverse mapping can approximately reconstruct the original variation data, thereby ensuring that the subspace dimensionality reduction results contain Distribution pattern information of cancer mutation data (Fig. 3).

为使子空间的向量坐标对驱动基因的亚群从属关系具有指引作用,采用子空间的不同维度方向对驱动基因的潜在从属亚群进行指引。具体方法为:通过对子空间进行正交变换旋转,使得子空间坐标轴/坐标面位于多数向量点附件,从而确保子空间不同维度方向可有效指引不同的亚群。因此当子空间的亚群维数确定时,子空间向量点与坐标轴/坐标面的位置关系可分别表示驱动基因的单亚群/多亚群从属关系。为使子空间映射具备亚群特异性指引作用,需保障同类样本内变异分布模式相似的驱动基因分布在子空间相同维度坐标轴/坐标面上,对此,根据子空间向量点在空间中的分布密度,通过空间变换坐标指引方法对子空间各维度进行正交变换旋转,即通过对子空间进行正交变换旋转,使得子空间坐标轴/坐标面位于多数向量点附件,从而确保子空间不同维度方向可有效指引不同的亚群,以确保子空间向量点主要集中于坐标轴/坐标面上,从而实现对驱动基因的亚群特异性指引。In order to make the vector coordinates of the subspace guide the subgroup affiliation of the driver genes, different dimensional directions of the subspace are used to guide the potential subgroups of the driver genes. The specific method is: through the orthogonal transformation and rotation of the subspace, the coordinate axes/coordinate planes of the subspace are located near most vector points, so as to ensure that the directions of different dimensions of the subspace can effectively guide different subgroups. Therefore, when the subgroup dimension of the subspace is determined, the positional relationship between the subspace vector point and the coordinate axis/coordinate plane can represent the single-subgroup/multi-subgroup affiliation of the driver genes, respectively. In order for the subspace mapping to have subgroup-specific guidance, it is necessary to ensure that the driver genes with similar variation distribution patterns in the same sample are distributed on the same dimension coordinate axis/coordinate plane of the subspace. For this, according to the subspace vector points in the space Distribution density, through the space transformation coordinate guidance method, the orthogonal transformation and rotation of each dimension of the subspace is carried out, that is, the orthogonal transformation and rotation of the subspace is carried out so that the coordinate axis/coordinate plane of the subspace is located near the majority of vector points, thereby ensuring that the subspaces are different. Dimensional directions can effectively guide different subgroups to ensure that the subspace vector points are mainly concentrated on the coordinate axes/coordinate planes, thereby achieving subgroup-specific guidance for driver genes.

(3)驱动基因在坐标轴/坐标面的亚群特异性定位(3) Subgroup-specific positioning of driver genes on coordinate axes/planes

为确保上述子空间坐标轴/坐标面的亚群特异性定位作用,对子空间坐标值引入非负约束,通过拉格朗日乘子法对非负约束问题进行求解,从而使子空间向量的坐标值大小反映驱动基因对各亚群维度的从属程度。其中,共有驱动基因的子空间向量应同时隶属于所有亚群维度,而多亚群共存驱动基因则属于部分亚群维度,单亚群特异性驱动基因应只属于单个亚群维度。为避免与共有驱动基因的混淆,亚群特异性驱动基因的子空间向量应在部分坐标取值为零,而其余非零坐标可定位该基因从属的亚群维度。为此,本发明对基因子空间向量引入稀疏化策略,对子空间向量进行稀疏正则化,进而提出基因子空间向量的亚群特异性指引子空间学习公式;In order to ensure the subgroup-specific positioning of the above-mentioned subspace coordinate axes/coordinate planes, non-negative constraints are introduced to the subspace coordinate values, and the non-negative constraint problem is solved by the Lagrangian multiplier method, so that the subspace vector The size of the coordinate value reflects the degree of subordination of the driver gene to each subgroup dimension. Among them, the subspace vectors of common driver genes should belong to all subgroup dimensions at the same time, while multi-subgroup coexistence driver genes should belong to some subgroup dimensions, and single subgroup-specific driver genes should only belong to a single subgroup dimension. To avoid confusion with shared driver genes, the subspace vector of a subgroup-specific driver gene should take on a value of zero at some coordinates, while the remaining non-zero coordinates can locate the subgroup dimension to which the gene belongs. For this reason, the present invention introduces a sparse strategy to the gene subspace vector, performs sparse regularization on the subspace vector, and then proposes a subgroup specific guidance subspace learning formula of the gene subspace vector;

Figure BDA0002048316570000091
Figure BDA0002048316570000091

Figure BDA0002048316570000092
Figure BDA0002048316570000092

其中,xi是第i个基因在不同样本的变异数据向量,zi是第i个基因的子空间向量,W为子空间映射/逆映射的参数矩阵,而λ则为子空间向量zi的稀疏正则化参数。Among them, x i is the variation data vector of the i-th gene in different samples, z i is the subspace vector of the i-th gene, W is the parameter matrix of subspace mapping/inverse mapping, and λ is the subspace vector z i The sparse regularization parameter for .

在基因变异数据的降维过程中,稀疏化子空间学习策略可确保具备亚群特异性倾向基因的子空间向量坐标取值稀疏,保障相应向量点的子空间位置锁定于坐标轴/坐标面。在驱动基因的子空间向量坐标中,稀疏非零值所在维度可指示该驱动基因的从属亚群。通过检索稀疏化子空间向量的非零元维度索引,即可定位子空间向量点所在的坐标轴/坐标面,推断出驱动基因的亚群特异性。In the process of dimensionality reduction of genetic variation data, the sparse subspace learning strategy can ensure that the subspace vector coordinates of genes with subgroup-specific tendency are sparse, and ensure that the subspace positions of corresponding vector points are locked on the coordinate axis/coordinate plane. In the subspace vector coordinates of a driver gene, the dimension of the sparse non-zero value can indicate the subordinate subgroup of the driver gene. By retrieving the non-zero element dimension index of the sparse subspace vector, the coordinate axis/coordinate plane where the subspace vector point is located can be located, and the subgroup specificity of the driver gene can be inferred.

(4)基于驱动基因子空间向量坐标的亚群从属判定(4) Subgroup affiliation determination based on the vector coordinates of the driver gene subspace

根据上述结果中驱动基因的亚群特异性定位,对子空间中不同位置的子空间向量点分别进行亚群从属判定。由于子空间的各维度分别对应不同潜在亚群,因此位于子空间坐标轴的子空间向量点可指示单亚群特异性驱动基因,位于坐标面的子空间向量点可指示多亚群共存驱动基因,而位于第一卦限且相距原点较远的子空间向量点则代表所有样本的共有驱动基因(图3)。相比之下,癌症无关基因的子空间向量点则集中于子空间的原点附近。According to the subgroup-specific localization of the driver genes in the above results, subgroup affiliation judgments were made for the subspace vector points at different positions in the subspace. Since each dimension of the subspace corresponds to different potential subgroups, the subspace vector points located on the subspace coordinate axis can indicate single-subgroup-specific driver genes, and the subspace vector points located on the coordinate plane can indicate multi-subgroup coexistence driver genes , and the subspace vector points located in the first hexagram limit and far away from the origin represent the common driver genes of all samples (Fig. 3). In contrast, the subspace vector points of cancer-unrelated genes are concentrated near the origin of the subspace.

鉴于子空间的上述特性,子空间向量点与原点的距离可在一定程度上反映相应基因的驱动基因倾向性,而子空间向量点在空间中坐标轴/坐标面的定位可推断驱动基因的亚群从属关系。In view of the above characteristics of the subspace, the distance between the subspace vector point and the origin can reflect the driver gene tendency of the corresponding gene to a certain extent, and the location of the subspace vector point in the space coordinate axis/coordinate plane can infer the subspace of the driver gene. group affiliation.

(5)亚群特异性驱动基因的空间方位多峰分布表示(5) Spatial orientation multimodal distribution representation of subpopulation-specific driver genes

由于上述降维结果中的无关基因主要集中于子空间坐标原点附近,而驱动基因子空间向量点则沿着坐标轴/坐标面呈现多峰分布,因此通过基因子空间向量坐标所推断的亚群从属关系,对上述降维结果进行亚群特异性驱动基因的空间方位多峰分布表示,以描述基因子空间向量点在各维度方向的离群程度,为后续基于离群点判定的亚群特异性驱动基因检测提供判别标准。Since the irrelevant genes in the above dimensionality reduction results are mainly concentrated near the origin of the subspace coordinates, while the vector points of the driver gene subspace present a multimodal distribution along the coordinate axis/coordinate plane, the subgroups inferred from the gene subspace vector coordinates Affiliation, the spatial orientation multimodal distribution of subgroup-specific driver genes is expressed for the above-mentioned dimensionality reduction results to describe the outlier degree of gene subspace vector points in each dimension direction, which is used for subsequent determination of subgroup specificity based on outlier points. Sex drive gene testing provides discriminative criteria.

(6)多峰分布离群点分界面的超椭球面半主轴估计(6) Estimation of the semi-principal axis of the hyperellipsoid on the interface of multimodal outliers

针对各亚群孤立分界面传统策略的多亚群共存驱动基因漏检问题(图4A),通过超椭球面分界面对多亚群共存驱动基因进行判别。相比之下,超椭球面离群点分界面可在子空间分布确定的情况下,进一步强化多亚群共存驱动基因的离群显著性(图4B),从而提升此类驱动基因的检测性能。In view of the missing detection problem of multi-subgroup coexistence driver genes in the traditional strategy of isolated interface of each subgroup (Figure 4A), the multi-subgroup coexistence driver genes were discriminated through the hyperellipsoid interface. In contrast, the hyperellipsoid outlier interface can further strengthen the outlier significance of multi-subgroup coexistence driver genes when the subspace distribution is determined (Figure 4B), thereby improving the detection performance of such driver genes .

为实现超椭球面分界判定,通过子空间向量点开展多峰分布离群点分界面的超椭球面半主轴估计。鉴于子空间多峰分布在不同维度方向的分布密度不同,采用协方差矩阵特征值对超椭球面在各维度的半主轴长度进行估计。针对子空间向量点在各亚群维度的不同分布密度,超椭球面分界面可同时兼顾单亚群特异性/多亚群共存驱动基因的离群显著性,从而提升亚群特异性驱动基因的检测性能。In order to realize the boundary judgment of the hyperellipsoid, the semi-axis estimation of the boundary of the hyperellipsoid with multimodal distribution outliers is carried out through the subspace vector points. In view of the different distribution densities of the subspace multimodal distribution in different dimensions, the semi-principal length of the hyperellipsoid in each dimension is estimated by using the eigenvalues of the covariance matrix. In view of the different distribution densities of subspace vector points in each subgroup dimension, the hyperellipsoid interface can take into account the outlier significance of single-subgroup-specific/multi-subgroup-coexisting driver genes, thereby improving the probability of subgroup-specific driver genes. Detection performance.

(7)子空间离群点判定的亚群特异性驱动基因检测(7) Subgroup-specific driver gene detection for subspace outlier determination

在子空间中,由于与坐标原点距离较远的子空间向量点具有更强的驱动基因倾向性,因此通过上述超椭球面分界面,对子空间多峰分布的基因子空间向量点进行离群点判别。其中,将超椭球面内部(坐标原点侧)部分的子空间向量点基因判定为癌症无关基因,而对于超椭球面外部(坐标原点对立侧)部分的子空间向量点基因,则根据子空间亚群维度指示的坐标轴/坐标面,将其分别预测为单亚群特异性/多亚群共存驱动基因。In the subspace, since the subspace vector points farther from the origin of the coordinates have a stronger propensity to drive genes, through the above hyperellipsoid interface, the gene subspace vector points with multimodal distribution in the subspace are outlier point of discrimination. Among them, the subspace vector point gene in the inner part of the hyperellipsoid (coordinate origin side) is judged as a cancer-independent gene, while the subspace vector point gene in the outer part of the hyperellipsoid (the opposite side of the coordinate origin) is determined according to the subspace subspace. Axes/planes indicated by the population dimension, which are predicted as single-subgroup-specific/multi-subgroup-coexisting driver genes, respectively.

下面结合实验对本发明的应用效果作详细的描述。The application effects of the present invention will be described in detail below in conjunction with experiments.

本发明的群特异性驱动基因的检测结果评估:由于已知癌症驱动基因数目稀少,本发明在癌症基因的子空间学习过程中均未使用已知驱动基因的标注信息。为评估上述预测算法的识别性能,通过COSMIC和IntOGen等数据库收集的已知驱动基因,并与所检测的亚群特异性驱动基因进行比对,分别计算检测结果的真阳性率和假阳性率,绘制ROC曲线对检测结果进行性能评估。当应用于乳腺癌和非小细胞肺癌数据时,ROC曲线结果表明本发明所提出的基于子空间学习的亚群特异性驱动基因检测方法,检测性能优于现有的OncodriveCLUST方法和基于稀疏奇异值分解的双聚类方法(图5)。而不同基因在子空间中与坐标轴/坐标面的位置,也可反映出驱动基因的相应亚群(图6)。Evaluation of detection results of group-specific driver genes in the present invention: Due to the scarcity of known cancer driver genes, the present invention does not use the labeling information of known driver genes in the subspace learning process of cancer genes. In order to evaluate the recognition performance of the above prediction algorithm, the known driver genes collected from databases such as COSMIC and IntOGen were compared with the detected subgroup-specific driver genes, and the true positive rate and false positive rate of the detection results were calculated respectively. Draw the ROC curve to evaluate the performance of the detection results. When applied to breast cancer and non-small cell lung cancer data, the ROC curve results show that the proposed method for subgroup-specific driver gene detection based on subspace learning has better detection performance than the existing OncodriveCLUST method and sparse singular value-based method Decomposed biclustering approach (Fig. 5). The positions of different genes in the subspace and coordinate axes/planes can also reflect the corresponding subgroups of driver genes (Figure 6).

以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. within range.

Claims (4)

1. The method is characterized in that mutation data of mutation genes in cancer samples are used as input objects, coordinate values of low-dimensional vectors output by all genes are obtained through a subspace learning algorithm, the subgroup specificity of the genes is reflected by the coordinate values of different dimensions, and finally genes corresponding to outlier vectors in subspaces corresponding to outlier points are identified as subgroup-specific driving genes;
the subspace learning algorithm of the subspace learning-based subgroup specific driving gene detection method specifically comprises the following steps of:
(1) The gene name symbols of cancer samples are unified and aligned with data, and genetic variation data of heterogeneous cancers are obtained from TCGA and ICGC cancer genome public databases; aiming at the homonymous and heteronymous phenomena of the Gene name symbols of samples obtained by different platforms, carrying out Gene name symbol unification treatment on cancer samples through Entrez Gene ID, and matching single nucleotide mutation and copy number variation Gene variation of each sample to corresponding genes to obtain Gene variation data of the cancer samples; aligning all cancer sample data according to gene names, and constructing a cancer variation data matrix for describing variation distribution of each gene among samples;
(2) Subspace mapping construction of subgroup specificity guidance, extracting a distribution rule of gene variation among samples from a data matrix by adopting a subspace learning method, and describing genes to be detected in a subspace vector point mode according to the rule; in the construction of subspace mapping, the residual square sum of the original data and the inverse mapping reconstruction data is subjected to minimization solution by a random gradient descent method;
guiding potential subordinate sub-populations of the driving genes by adopting different dimension directions of subspaces, and carrying out orthogonal transformation rotation on each dimension of the subspaces by a space transformation coordinate guiding method according to the distribution density of subspace vector points in the space;
(3) The driving genes are specifically positioned on the subpopulations of the coordinate axes/planes, and non-negative constraint is introduced to the subspace coordinate values, so that the coordinate values of the subspace vectors reflect the degree of the dependence of the driving genes on the dimensions of each subpopulation;
(4) Based on the sub-group subordinate judgment of the space vector coordinates of the driving base factors, judging whether the sub-space vector points at different positions in the sub-space are positioned in coordinate axes, coordinate planes or first diagrams according to the sub-group specific positioning of the driving genes in the result;
(5) Carrying out space coordinate representation on space orientation multimodal distribution of the subgroup specific driving genes so as to describe the outlier degree of the gene subspace vector points in each dimension direction, and providing a discrimination standard for the subsequent subgroup specific driving gene detection based on outlier discrimination;
(6) Aiming at multi-subgroup coexistence driving gene omission in the traditional strategy of the isolated interfaces of all subgroups, the super-ellipsoid semi-principal axis estimation of the multimodal distribution outlier interfaces is adopted to judge the multi-subgroup coexistence driving genes by adopting the super-ellipsoid interface demarcation judgment;
(7) And (3) detecting the subgroup-specific driving genes judged by subspace outliers, and judging outliers of subspace vector point multimodal distribution of the genes in subspaces through the super-ellipsoidal interface obtained in step (6).
2. The subspace learning-based subgroup-specific driving gene detection method as claimed in claim 1, wherein the driving gene (3) introduces a sparse strategy to the gene subspace vector in the subgroup-specific positioning of the coordinate axes/coordinate planes, performs sparse regularization to the subspace vector, and proposes a subgroup-specific index subspace learning formula of the base factor space vector;
Figure FDA0004106929370000021
Figure FDA0004106929370000022
wherein x is i Is the variation data vector of the ith gene in different samples, z i Is the subspace vector of the ith gene, W is the parameter matrix of subspace mapping/inverse mapping, and lambda is the subspace vector z i Is used for the sparse regularization parameters of (a).
3. The method for detecting a subgroup specific driving gene based on subspace learning according to claim 1, wherein (6) the determination of the boundary of the super-ellipsoid interface is realized, and the super-ellipsoid semi-principal axis estimation of the multimodal distribution outlier interface is carried out by adopting a method of using covariance matrix eigenvalue to the semi-principal axis length of the super-ellipsoid in each dimension through subspace vector points, and the specific method is as follows: calculating covariance matrixes for all subspace vectors, and decomposing the eigenvalues and eigenvectors of the covariance matrixes through eigenvalues; the feature vector is used for carrying out orthogonal transformation rotation on the subspace, and indicates the subspace subgroup dimension through the dimension corresponding to the larger feature value in the subspace after rotation; the inverse number of the corresponding eigenvalue after the evolution is the estimated length of the semi-principal axis of the super-ellipsoid, and is used for constructing the super-ellipsoid interface.
4. A cancer variant gene detection system applying the subspace learning-based subpopulation-specific driver gene detection method according to any one of claims 1 to 3.
CN201910366338.1A 2019-05-05 2019-05-05 Sub-space learning-based detection method for subgroup-specific driving genes Active CN110189795B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910366338.1A CN110189795B (en) 2019-05-05 2019-05-05 Sub-space learning-based detection method for subgroup-specific driving genes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910366338.1A CN110189795B (en) 2019-05-05 2019-05-05 Sub-space learning-based detection method for subgroup-specific driving genes

Publications (2)

Publication Number Publication Date
CN110189795A CN110189795A (en) 2019-08-30
CN110189795B true CN110189795B (en) 2023-06-23

Family

ID=67715492

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910366338.1A Active CN110189795B (en) 2019-05-05 2019-05-05 Sub-space learning-based detection method for subgroup-specific driving genes

Country Status (1)

Country Link
CN (1) CN110189795B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111785325B (en) * 2020-06-23 2021-10-22 西北工业大学 Mutually-exclusively constrained graph Laplacian approach to heterogeneous cancer driver gene identification
CN113517021B (en) * 2021-06-09 2022-09-06 海南精准医疗科技有限公司 Cancer driver gene prediction method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408116A (en) * 2014-11-26 2015-03-11 浪潮电子信息产业股份有限公司 Method for detecting outlier data from large-scale high-dimensional data based on genetic algorithm
CN108090328A (en) * 2017-12-31 2018-05-29 浙江大学 It is a kind of that gene identification method is driven based on the cancer of machine learning and a variety of Principles of Statistics

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001053992A1 (en) * 2000-01-21 2001-07-26 Shaw Sandy C Method for the manipulation, storage, modeling, visualization and quantification of datasets

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408116A (en) * 2014-11-26 2015-03-11 浪潮电子信息产业股份有限公司 Method for detecting outlier data from large-scale high-dimensional data based on genetic algorithm
CN108090328A (en) * 2017-12-31 2018-05-29 浙江大学 It is a kind of that gene identification method is driven based on the cancer of machine learning and a variety of Principles of Statistics

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于椭球模型的无线传感器网络的局部离群点检测;王玉琳等;《计算机应用研究》;20130215(第02期);全文 *
基于混合遗传算法的高维离群数据检测;施冬冬等;《赤峰学院学报(自然科学版)》;20161025(第20期);全文 *
基于距离离群点的分析与研究;韩红霞;《中国优秀硕士学位论文全文数据库 信息科技辑》;20080815;第10-11,15-18页 *

Also Published As

Publication number Publication date
CN110189795A (en) 2019-08-30

Similar Documents

Publication Publication Date Title
CN106683081A (en) Brain glioma molecular marker nondestructive prediction method and prediction system based on radiomics
CN103761426B (en) A kind of method and system quickly identifying feature combination in high dimensional data
CN110189795B (en) Sub-space learning-based detection method for subgroup-specific driving genes
CN110010204B (en) Identification method of prognostic biomarkers based on fusion network and multi-scoring strategy
CN103093119A (en) Method for recognizing significant biologic pathway through utilization of network structural information
CN111914930A (en) Density peak value clustering method based on self-adaptive micro-cluster fusion
CN113903398B (en) Colorectal cancer early screening marker, detection method, detection device and computer readable medium
Li et al. scPROTEIN: a versatile deep graph contrastive learning framework for single-cell proteomics embedding
CN116612814A (en) Gene sample contamination batch detection method, device, equipment and medium based on regression model
WO2022011855A1 (en) False positive structural variation filtering method, storage medium, and computing device
CN111860591A (en) Cervical cell image classification method based on interval adaptive feature selection fusion
Chen et al. Transrnam: identifying twelve types of rna modifications by an interpretable multi-label deep learning model based on transformer
CN119252350A (en) A method for evaluating the status of environmental microbial communities based on unsupervised clustering
CN111368910B (en) Internet of things equipment cooperative sensing method
CN110797083B (en) Biomarker identification method based on multiple networks
Allen et al. A Bayesian multivariate mixture model for spatial transcriptomics data
CN118866331A (en) Early diagnosis method and system of lung adenocarcinoma based on non-targeted metabolomics
CN115881218B (en) Gene automatic selection method for whole genome association analysis
Kang et al. Benchmarking computational methods for detecting spatial domains and domain-specific spatially variable genes from spatial transcriptomics data
Arenas et al. Identifying extreme observations, outliers and noise in clinical and genetic data
CN113113085B (en) Analysis system and method for tumor detection based on intelligent metagenome sequencing data
Sangeetha et al. Advanced segmentation method for integrating multi-omics data for early cancer detection
Huo et al. Iman: An adaptive network for robust npc mortality prediction with missing modalities
CN116976574A (en) Building load curve dimension reduction method based on two-stage hybrid clustering algorithm
Ghai et al. Proximity measurement technique for gene expression data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant