CN118692572A - A visual genetic data analysis system - Google Patents
A visual genetic data analysis system Download PDFInfo
- Publication number
- CN118692572A CN118692572A CN202410675816.8A CN202410675816A CN118692572A CN 118692572 A CN118692572 A CN 118692572A CN 202410675816 A CN202410675816 A CN 202410675816A CN 118692572 A CN118692572 A CN 118692572A
- Authority
- CN
- China
- Prior art keywords
- data
- gene
- analysis
- genetic variation
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007405 data analysis Methods 0.000 title claims abstract description 39
- 230000002068 genetic effect Effects 0.000 title claims abstract description 29
- 230000000007 visual effect Effects 0.000 title claims description 19
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 216
- 238000004458 analytical method Methods 0.000 claims abstract description 130
- 230000007614 genetic variation Effects 0.000 claims abstract description 126
- 238000013079 data visualisation Methods 0.000 claims abstract description 33
- 239000002773 nucleotide Substances 0.000 claims description 33
- 125000003729 nucleotide group Chemical group 0.000 claims description 33
- 238000012098 association analyses Methods 0.000 claims description 27
- 230000014509 gene expression Effects 0.000 claims description 22
- 239000000523 sample Substances 0.000 claims description 20
- 102000054765 polymorphisms of proteins Human genes 0.000 claims description 15
- 230000010354 integration Effects 0.000 claims description 11
- 238000010586 diagram Methods 0.000 claims description 6
- 238000013500 data storage Methods 0.000 claims description 4
- 238000012423 maintenance Methods 0.000 claims description 4
- 239000002157 polynucleotide Substances 0.000 claims 2
- 102000040430 polynucleotide Human genes 0.000 claims 2
- 108091033319 polynucleotide Proteins 0.000 claims 2
- 230000007321 biological mechanism Effects 0.000 abstract description 4
- 210000001519 tissue Anatomy 0.000 description 22
- 238000000034 method Methods 0.000 description 13
- 238000011160 research Methods 0.000 description 13
- 238000012800 visualization Methods 0.000 description 13
- 230000000694 effects Effects 0.000 description 10
- 230000008303 genetic mechanism Effects 0.000 description 10
- 230000008569 process Effects 0.000 description 8
- 108010077544 Chromatin Proteins 0.000 description 5
- 210000003483 chromatin Anatomy 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 4
- 230000001364 causal effect Effects 0.000 description 4
- 230000008045 co-localization Effects 0.000 description 4
- 201000010099 disease Diseases 0.000 description 4
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 4
- 230000001105 regulatory effect Effects 0.000 description 4
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 3
- 230000002708 enhancing effect Effects 0.000 description 3
- 230000001973 epigenetic effect Effects 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 230000011987 methylation Effects 0.000 description 3
- 238000007069 methylation reaction Methods 0.000 description 3
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 2
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 2
- 108091028043 Nucleic acid sequence Proteins 0.000 description 2
- 210000004369 blood Anatomy 0.000 description 2
- 239000008280 blood Substances 0.000 description 2
- 210000005013 brain tissue Anatomy 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 238000013075 data extraction Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 101150000874 11 gene Proteins 0.000 description 1
- 101150066838 12 gene Proteins 0.000 description 1
- 101150025032 13 gene Proteins 0.000 description 1
- 101150093131 131 gene Proteins 0.000 description 1
- 101150082072 14 gene Proteins 0.000 description 1
- 108700024394 Exon Proteins 0.000 description 1
- 108700039691 Genetic Promoter Regions Proteins 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000003623 enhancer Substances 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000012252 genetic analysis Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000009916 joint effect Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 230000008844 regulatory mechanism Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B45/00—ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Molecular Biology (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
本发明涉及基因数据分析技术领域,公开了一种可视化基因数据分析系统,本发明通过联合基因数据在线分析模块、基因数据发送模块、基因数据可视化显示模块和孟德尔随机化分析数据库,能够同时整合来自不同组织和多组学层面的广泛GWAS数据和xQTL数据,还能够综合查询并展示GWAS数据与xQTL数据的分析结果,形成一个全面的局部位点图,这不仅揭示了遗传关联,实现深入探讨变异与性状之间的联系,有利于辅助研究人员系统地分析遗传变异对特定表型的影响,最终能够深入理解遗传变异潜在的生物学机制。
The present invention relates to the technical field of gene data analysis, and discloses a visualized gene data analysis system. The present invention can simultaneously integrate a wide range of GWAS data and xQTL data from different tissues and multi-omics levels by combining a gene data online analysis module, a gene data sending module, a gene data visualization display module and a Mendelian randomization analysis database, and can also comprehensively query and display the analysis results of GWAS data and xQTL data to form a comprehensive local site map, which not only reveals genetic associations and enables in-depth exploration of the connection between variation and traits, but also helps researchers systematically analyze the impact of genetic variation on specific phenotypes, and ultimately enables in-depth understanding of the potential biological mechanisms of genetic variation.
Description
技术领域Technical Field
本发明涉及基因数据分析技术领域,具体涉及一种可视化基因数据分析系统。The present invention relates to the technical field of gene data analysis, and in particular to a visualized gene data analysis system.
背景技术Background Art
在遗传学和基因组学分析中,特别是在运用全基因组关联分析(Genome-wideassociation studies,简称GWAS)和多种分子表型量性状位点分析(molecularquantitative trait loci,简称xQTL)来探究复杂性状或疾病遗传基础的过程中,研究人员常常面临着如何有效地整合和分析大规模数据集的挑战。In genetic and genomic analyses, especially in the process of using genome-wide association studies (GWAS) and molecular quantitative trait loci (xQTL) to explore the genetic basis of complex traits or diseases, researchers often face the challenge of how to effectively integrate and analyze large-scale data sets.
目前,一些基因数据分析平台能够提供遗传变异与分子表型、遗传变异与复杂性状之间的联系,采用一些统计遗传学方式,如孟德尔随机化分析方式(Mendelianrandomization,简称MR),共定位分析方式(colocalization,简称COLOC),和全转录组关联分析方式(Transcriptome-wide association studies,简称TWAS),以鉴定与特定疾病或性状相关的候选基因。这些分析方式主要是单独利用GWAS分析方式或多种分子表型量性状位点分析方式,但无法对不同类型的多源基因数据同时进行可视化分析,导致全面分析和理解复杂遗传机制的基因组受到限制。At present, some genetic data analysis platforms can provide the connection between genetic variation and molecular phenotype, genetic variation and complex traits, and use some statistical genetic methods, such as Mendelian randomization (MR), colocalization (COLOC), and transcriptome-wide association studies (TWAS) to identify candidate genes associated with specific diseases or traits. These analysis methods mainly use GWAS analysis methods or multiple molecular phenotypic trait site analysis methods alone, but cannot simultaneously visualize and analyze different types of multi-source gene data, resulting in the limitation of comprehensive analysis and understanding of the genome of complex genetic mechanisms.
发明内容Summary of the invention
有鉴于此,本发明提供了一种可视化基因数据分析系统,以解决无法对不同类型的多源基因数据同时进行可视化分析,导致全面分析和理解复杂遗传机制的基因组受到限制的问题。In view of this, the present invention provides a visual gene data analysis system to solve the problem that it is impossible to simultaneously perform visual analysis on different types of multi-source gene data, resulting in limitations in the comprehensive analysis and understanding of genomes with complex genetic mechanisms.
根据第一方面,本实施例提供一种可视化基因数据分析系统,包括:According to the first aspect, this embodiment provides a visual gene data analysis system, including:
基因数据在线分析模块,用于基于多种分子表型量性状位点数据集和全基因组关联分析数据集,通过孟德尔随机化分析目标基因序列中单核苷酸多态性是否存在共同遗传变异,当目标基因序列中单核苷酸多态性存在共同遗传变异时,通过异质性独立工具分析共同遗传变异是由于单个多核苷酸多态性引起的遗传变异或是多个单核苷酸多态性引起的遗传变异;Gene data online analysis module, which is used to analyze whether there is common genetic variation in single nucleotide polymorphisms in the target gene sequence based on multiple molecular phenotypic trait loci data sets and whole genome association analysis data sets through Mendelian randomization. When there is common genetic variation in single nucleotide polymorphisms in the target gene sequence, the common genetic variation is analyzed by heterogeneity independent tools to determine whether it is caused by a single multinucleotide polymorphism or multiple single nucleotide polymorphisms.
基因数据发送模块,用于将孟德尔随机化分析结果和异质性独立工具分析结果生成的目标链接发送至目标用户;A gene data sending module is used to send the target link generated by the Mendelian randomization analysis results and the heterogeneity independent tool analysis results to the target user;
基因数据可视化显示模块,用于目标用户基于目标链接在线查看孟德尔随机化分析结果和异质性独立工具分析结果,并生成目标基因序列的局部位点数据图;Gene data visualization display module, which is used for target users to view Mendelian randomization analysis results and heterogeneity independent tool analysis results online based on target links, and generate local site data maps of target gene sequences;
孟德尔随机化分析数据库,用于存储孟德尔随机化分析结果和异质性独立工具分析结果,并实时更新和维护多种分子表型量性状位点数据集和全基因组关联分析数据集,以及支持目标用户查询目标基因序列的遗传变异分析结果,和支持目标用户登录至基因数据可视化显示模块显示目标基因序列的局部位点数据图。The Mendelian randomization analysis database is used to store the results of Mendelian randomization analysis and the results of heterogeneity independent tool analysis, and to update and maintain a variety of molecular phenotypic trait site data sets and whole genome association analysis data sets in real time, as well as to support target users to query the genetic variation analysis results of target gene sequences, and to support target users to log in to the gene data visualization display module to display the local site data map of the target gene sequence.
在一种可选的实施方式中,基因数据在线分析模块包括:In an optional embodiment, the gene data online analysis module includes:
多源基因数据整合子模块,用于获取多种分子表型量性状位点数据集和全基因组关联分析数据集,并进行多源数据整合。The multi-source gene data integration submodule is used to obtain a variety of molecular phenotypic trait loci data sets and whole-genome association analysis data sets, and to integrate multi-source data.
在一种可选的实施方式中,基因数据在线分析模块,还包括:In an optional embodiment, the gene data online analysis module further includes:
遗传变异工具变量子模块,用于从多种分子表型量性状位点数据集中针对每个探针选取与基因表达相关的遗传变异工具变量;Genetic variation instrumental variable submodule, used to select genetic variation instrumental variables related to gene expression for each probe from a variety of molecular phenotypic trait loci data sets;
共同遗传变异确定子模块,用于从所述全基因组关联分析数据集中提取遗传变异工具变量与目标基因性状之间的基因关联数值,并基于孟德尔随机化分析基因关联数值,以确定目标基因序列中单核苷酸多态性与性状遗传数据是否存在共同遗传变异。The common genetic variation determination submodule is used to extract the gene association value between the genetic variation instrument variable and the target gene trait from the whole genome association analysis data set, and analyze the gene association value based on Mendelian randomization to determine whether there is common genetic variation between the single nucleotide polymorphism in the target gene sequence and the trait genetic data.
在一种可选的实施方式中,共同遗传变异确定子模块,包括:In an optional embodiment, the common genetic variation determination submodule comprises:
第一确定单元,用于当基因关联数值小于第一预设数值,则确定目标基因序列中单核苷酸多态性存在共同遗传变异;A first determination unit, configured to determine that a common genetic variation exists in the single nucleotide polymorphism in the target gene sequence when the gene association value is less than a first preset value;
第二确定单元,用于当基因关联数值大于或等于第二预设数值,则确定目标基因序列中单核苷酸多态性不存在共同遗传变异。The second determination unit is used to determine that there is no common genetic variation in the single nucleotide polymorphism in the target gene sequence when the gene association value is greater than or equal to a second preset value.
在一种可选的实施方式中,基因数据在线分析模块,还包括:In an optional embodiment, the gene data online analysis module further includes:
所述遗传变异确定子模块,还用于当目标基因序列中单核苷酸多态性存在共同遗传变异,则利用异质性独立工具评估共同遗传变异所在区域的连锁不平衡,确定目标基因序列是否存在其他遗传变异干扰结果。The genetic variation determination submodule is also used to evaluate the linkage disequilibrium in the region where the common genetic variation exists in the single nucleotide polymorphism in the target gene sequence using a heterogeneity independent tool to determine whether there are other genetic variations in the target gene sequence that interfere with the result.
在一种可选的实施方式中,共同遗传变异确定子模块,包括:In an optional embodiment, the common genetic variation determination submodule comprises:
第一遗传变异确定单元,用于当目标基因序列不存在其他遗传变异干扰结果,则确定共同遗传变异是由于单个多核苷酸多态性引起的遗传变异;A first genetic variation determination unit is used to determine that the common genetic variation is a genetic variation caused by a single multinucleotide polymorphism when there is no other genetic variation interference result in the target gene sequence;
第二遗传变异确定单元,用于当目标基因序列存在其他遗传变异干扰结果,则确定共同遗传变异是由于多个单核苷酸多态性引起的遗传变异。The second genetic variation determination unit is used to determine that the common genetic variation is a genetic variation caused by multiple single nucleotide polymorphisms when other genetic variations interfere with the results of the target gene sequence.
在一种可选的实施方式中,基因数据可视化显示模块,包括:In an optional embodiment, the gene data visualization display module includes:
基因探针识别子模块,用于从目标用户在线查看的孟德尔随机化分析结果和异质性独立工具分析结果中选择与遗传变异相关的基因,并识别该基因相关的探针名称和探针位置;A gene probe identification submodule is used to select genes related to genetic variation from the Mendelian randomization analysis results and heterogeneity independent tool analysis results viewed online by the target user, and to identify the probe name and probe position related to the gene;
局部位点识别子模块,用于将探针位置的预设区域作为局部位点;A local site identification submodule, for taking a preset region of the probe position as a local site;
位点数据生成子模块,用于在局部位点内,显示基于多种分子表型量性状位点数据集和全基因组关联分析数据集生成目标基因的局部位点数据图。The site data generation submodule is used to display the local site data map of the target gene generated based on multiple molecular phenotypic trait site data sets and whole genome association analysis data sets within the local site.
在一种可选的实施方式中,孟德尔随机化分析数据库,包括:In an optional embodiment, the Mendelian randomization analysis database comprises:
数据存储模块,用于存储通过整合多种分子表型量性状位点数据集和全基因组关联分析数据集进行基因数据分析的孟德尔随机化分析结果和异质性独立工具分析结果。The data storage module is used to store the Mendelian randomization analysis results and heterogeneity independent tool analysis results of gene data analysis by integrating multiple molecular phenotypic trait locus data sets and whole genome association analysis data sets.
在一种可选的实施方式中,孟德尔随机化分析数据库,还包括:In an optional embodiment, the Mendelian randomization analysis database further includes:
数据维护模块,用于实时更新和维护多种分子表型量性状位点数据集和全基因组关联分析数据集。The data maintenance module is used to update and maintain multiple molecular phenotypic trait loci data sets and whole genome association analysis data sets in real time.
在一种可选的实施方式中,孟德尔随机化分析数据库,还包括:In an optional embodiment, the Mendelian randomization analysis database further includes:
数据查询模块,用于支持目标用户查询目标基因序列的遗传变异分析结果;A data query module is used to support target users to query the genetic variation analysis results of target gene sequences;
数据可视化模块,用于支持目标用户登录至基因数据可视化显示模块显示目标基因序列的局部位点数据图。The data visualization module is used to support the target user to log in to the gene data visualization display module to display the local site data map of the target gene sequence.
本发明技术方案,具有如下优点:The technical solution of the present invention has the following advantages:
本发明涉及基因数据分析技术领域,公开了一种可视化基因数据分析系统,基因数据在线分析模块,用于基于多种分子表型量性状位点数据集和全基因组关联分析数据集,通过孟德尔随机化分析目标基因序列中单核苷酸多态性是否存在共同遗传变异,当目标基因序列中单核苷酸多态性存在共同遗传变异时,通过异质性独立工具分析共同遗传变异是由于单个多核苷酸多态性引起的遗传变异或是多个单核苷酸多态性引起的遗传变异;基因数据发送模块,用于将孟德尔随机化分析结果和异质性独立工具分析结果生成的目标链接发送至目标用户;基因数据可视化显示模块,用于目标用户基于目标链接在线查看孟德尔随机化分析结果和异质性独立工具分析结果,并生成目标基因序列的局部位点数据图;孟德尔随机化分析数据库,用于存储孟德尔随机化分析结果和异质性独立工具分析结果,并实时更新和维护多种分子表型量性状位点数据集和全基因组关联分析数据集,以及支持目标用户查询目标基因序列的遗传变异分析结果,和支持目标用户登录至基因数据可视化显示模块显示目标基因序列的局部位点数据图。本发明能够同时整合来自不同组织和多组学层面的广泛GWAS数据和xQTL数据,还能够综合查询并展示GWAS数据与xQTL数据,形成一个全面的局部位点图,这不仅揭示了遗传关联,实现深入探讨变异与性状之间的联系,最终有利于辅助研究人员系统地分析遗传变异对特定表型的影响,能够深入理解遗传变异潜在的生物学机制。The present invention relates to the technical field of gene data analysis, and discloses a visualized gene data analysis system, a gene data online analysis module, which is used to analyze whether there is a common genetic variation in single nucleotide polymorphisms in a target gene sequence based on a plurality of molecular phenotypic trait locus data sets and a whole genome association analysis data set through Mendelian randomization, and when there is a common genetic variation in the single nucleotide polymorphisms in the target gene sequence, analyze whether the common genetic variation is caused by a single multinucleotide polymorphism or a genetic variation caused by multiple single nucleotide polymorphisms through a heterogeneity independent tool; a gene data sending module, which is used to send the Mendelian randomization analysis result and the heterogeneity independent tool analysis result to the target gene sequence. The target link generated by the result is sent to the target user; the gene data visualization display module is used for the target user to view the Mendelian randomization analysis results and the heterogeneity independent tool analysis results online based on the target link, and generate a local site data map of the target gene sequence; the Mendelian randomization analysis database is used to store the Mendelian randomization analysis results and the heterogeneity independent tool analysis results, and to update and maintain a variety of molecular phenotypic trait site data sets and genome-wide association analysis data sets in real time, as well as to support the target user to query the genetic variation analysis results of the target gene sequence, and to support the target user to log in to the gene data visualization display module to display the local site data map of the target gene sequence. The present invention can simultaneously integrate a wide range of GWAS data and xQTL data from different tissues and multi-omics levels, and can also comprehensively query and display GWAS data and xQTL data to form a comprehensive local site map, which not only reveals genetic associations, but also realizes in-depth exploration of the connection between variation and traits, and ultimately helps to assist researchers in systematically analyzing the effects of genetic variation on specific phenotypes, and can deeply understand the potential biological mechanisms of genetic variation.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the specific implementation methods of the present invention or the technical solutions in the prior art, the drawings required for use in the specific implementation methods or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are some implementation methods of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.
图1是根据本发明实施例的可视化基因数据分析系统的结构示意图;FIG1 is a schematic diagram of the structure of a visual gene data analysis system according to an embodiment of the present invention;
图2是根据本发明实施例的另一可视化基因数据分析系统的结构示意图;FIG2 is a schematic diagram of the structure of another visual gene data analysis system according to an embodiment of the present invention;
图3是根据本发明实施例的又一可视化基因数据分析系统的结构示意图;FIG3 is a schematic diagram of the structure of another visual gene data analysis system according to an embodiment of the present invention;
图4是根据本发明实施例的基因数据可视化显示模块的界面显示示意图。FIG. 4 is a schematic diagram of an interface display of a gene data visualization display module according to an embodiment of the present invention.
附图标记:Reference numerals:
11-基因数据在线分析模块;12-基因数据发送模块;11-gene data online analysis module; 12-gene data sending module;
13-基因数据可视化显示模块;14-基因数据可视化显示模块;13-gene data visualization display module; 14-gene data visualization display module;
111-多源基因数据整合子模块;112-遗传变异工具变量子模块;111-Multi-source gene data integration submodule; 112-Genetic variation instrumental variable submodule;
113-共同遗传变异确定子模块;131-基因探针识别子模块;113-common genetic variation determination submodule; 131-gene probe identification submodule;
132-局部位点识别子模块;133-位点数据生成子模块;132-local site identification submodule; 133-site data generation submodule;
141-数据存储模块;142-数据维护模块;143-数据查询模块;141-data storage module; 142-data maintenance module; 143-data query module;
144-数据可视化模块。144-Data visualization module.
具体实施方式DETAILED DESCRIPTION
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the embodiments of the present invention clearer, the technical solution in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work are within the scope of protection of the present invention.
根据本发明实施例,提供了一种可视化基因数据分析系统的实施例,如图1所示,该可视化基因数据分析系统包括:基因数据在线分析模块11、基因数据发送模块12、基因数据可视化显示模块13、孟德尔随机化分析数据库14。According to an embodiment of the present invention, an embodiment of a visualized gene data analysis system is provided. As shown in FIG1 , the visualized gene data analysis system includes: a gene data online analysis module 11 , a gene data sending module 12 , a gene data visualization display module 13 , and a Mendelian randomization analysis database 14 .
其中,如图2所示,基因数据在线分析模块11,用于基于多种分子表型量性状位点数据集和全基因组关联分析数据集,通过孟德尔随机化分析目标基因序列中单核苷酸多态性是否存在共同遗传变异,当目标基因序列中单核苷酸多态性存在共同遗传变异时,通过异质性独立工具分析共同遗传变异是由于单个多核苷酸多态性引起的遗传变异或是多个单核苷酸多态性引起的遗传变异。Among them, as shown in Figure 2, the genetic data online analysis module 11 is used to analyze whether there is common genetic variation in single nucleotide polymorphisms in the target gene sequence based on multiple molecular phenotypic trait site data sets and whole genome association analysis data sets through Mendelian randomization. When there is common genetic variation in single nucleotide polymorphisms in the target gene sequence, the common genetic variation is analyzed by heterogeneity independent tools to determine whether it is genetic variation caused by a single multinucleotide polymorphism or genetic variation caused by multiple single nucleotide polymorphisms.
示例性地,多种分子表型量性状位点数据集的英文为molecular quantitativetrait loci,简称xQTL数据,全基因组关联分析数据集的英文为Genome-wide associationstudies,简称GWAS数据。其中,xQTL数据包括表达量位点(eQTL)、剪接量位点(sQTL)和甲基化位点(mQTL)。For example, the English name of multiple molecular phenotypic quantitative trait loci data sets is molecular quantitative trait loci, referred to as xQTL data, and the English name of genome-wide association study data sets is Genome-wide association studies, referred to as GWAS data. Among them, xQTL data include expression quantity loci (eQTL), splicing quantity loci (sQTL) and methylation loci (mQTL).
进一步地,例如,基因数据在线分析模块包含了共计103个xQTL数据构成的汇总数据集,涵盖了51个来自人体49个组织的表达量位点eQTL、50个来自人体49个组织的剪接量位点sQTL,以及2个来自人体大脑组织和血液的甲基化位点mQTL。此外,基因数据在线分析模块还支持目标用户上传自定义的各类分子表型xQTL数据集,如蛋白质丰度pQTL等。Furthermore, for example, the online analysis module for genetic data contains a total of 103 xQTL data sets, including 51 expression sites eQTL from 49 human tissues, 50 splicing sites sQTL from 49 human tissues, and 2 methylation sites mQTL from human brain tissue and blood. In addition, the online analysis module for genetic data also supports the target user to upload customized xQTL data sets of various molecular phenotypes, such as protein abundance pQTL, etc.
由于当前现有工具仅能够提供对单一数据源的分析,但它们往往缺乏一个统一平台来综合分析来自不同来源数据的xQTL数据和GWAS数据,限制了研究人员在多层次生物信息整合方面的能力,进而影响了对复杂遗传机制的全面理解。Since currently available tools can only provide analysis of a single data source, they often lack a unified platform to comprehensively analyze xQTL data and GWAS data from different sources, limiting researchers' ability to integrate multi-level biological information, which in turn affects the comprehensive understanding of complex genetic mechanisms.
因此,本公开实施例通过基因数据在线分析模块可以同时获取xQTL数据和GWAS数据,并对这两类数据同时进行汇总成多源基因数据,以提高多层次生物信息整合能力,进而提高对复杂遗传机制的深入理解。Therefore, the disclosed embodiment can simultaneously obtain xQTL data and GWAS data through the gene data online analysis module, and simultaneously aggregate these two types of data into multi-source gene data to improve the multi-level biological information integration capability, thereby improving the in-depth understanding of complex genetic mechanisms.
上述中的孟德尔随机化的英文为summary-data-based MendelianRandomization,简称MR,上述中的异质性独立工具的英文为heterogeneity inindependent instruments,简称HEIDI。而对基于汇总数据的孟德尔随机化的英文为summary-data-based Mendelian Randomization,简称SMR。The English name of Mendelian randomization mentioned above is summary-data-based Mendelian Randomization, abbreviated as MR, and the English name of heterogeneity in independent instruments mentioned above is heterogeneity in independent instruments, abbreviated as HEIDI. The English name of Mendelian randomization based on summary data is summary-data-based Mendelian Randomization, abbreviated as SMR.
具体地,孟德尔随机化分析方式可以使用β_GWAS数据和β_eQTL数据来估计目标基因序列(目标DNA序列)中每个单核苷酸多态性(SNP)对基因表达和表型的联合影响。本公开实施例中的异质性独立工具主要是用来检测孟德尔随机化结果中的异质性。Specifically, the Mendelian randomization analysis method can use β_GWAS data and β_eQTL data to estimate the joint effect of each single nucleotide polymorphism (SNP) in the target gene sequence (target DNA sequence) on gene expression and phenotype. The heterogeneity independent tool in the disclosed embodiment is mainly used to detect heterogeneity in the Mendelian randomization results.
虽然当前一些遗传机制测试方式,如孟德尔随机化(Mendelian randomization,MR),共定位分析(colocalization,COLOC),和全转录组关联分析(transcriptome-wideassociation studies,TWAS),能够鉴定与特定疾病或性状相关的候选基因。这些现有技术方案虽为揭示遗传背景提供了关键的支持,但在如何高效收集和整合来自不同源的数据、简化从数据到见解的分析流程、以及提供更加直观可视化显示数据方面,存在显著的改进空间。并且,现有平台没有将最新的统计方法,如同时将孟德尔随机化分析方式和异质性独立工具同时融合。Although some current genetic mechanism testing methods, such as Mendelian randomization (MR), colocalization analysis (COLOC), and transcriptome-wide association studies (TWAS), can identify candidate genes associated with specific diseases or traits. Although these existing technical solutions provide key support for revealing genetic background, there is significant room for improvement in how to efficiently collect and integrate data from different sources, simplify the analysis process from data to insights, and provide more intuitive visualization of data. In addition, the existing platform does not integrate the latest statistical methods, such as the simultaneous integration of Mendelian randomization analysis methods and heterogeneity independent tools.
因此,本公开实施例综合利用孟德尔随机化和异质性独立工具,并整合多种分子表型量性状位点数据集和全基因组关联分析数据集构成的两类多源数据,以高效分析基因序列的复杂调控机制。并且,本公开实施例综合利用孟德尔随机化和异质性独立工具,对识别遗传变异与性状之间的潜在分子表型关联至关重要,显著提升了研究结果的准确性和深度。Therefore, the disclosed embodiments comprehensively utilize Mendelian randomization and heterogeneity independent tools, and integrate two types of multi-source data consisting of multiple molecular phenotypic trait loci data sets and whole genome association analysis data sets to efficiently analyze the complex regulatory mechanisms of gene sequences. In addition, the disclosed embodiments comprehensively utilize Mendelian randomization and heterogeneity independent tools, which is crucial for identifying potential molecular phenotypic associations between genetic variation and traits, and significantly improves the accuracy and depth of research results.
其中,基因数据发送模块,用于将孟德尔随机化分析结果和异质性独立工具分析结果生成的目标链接发送至目标用户。Among them, the gene data sending module is used to send the target link generated by the Mendelian randomization analysis results and the heterogeneity independent tool analysis results to the target user.
具体地,当上述基因数据在线分析模块完成分析过程后,基因数据发送模块可以通过电子邮件将分析结果通知目标用户,在电子邮件中添加有目标链接。Specifically, when the above gene data online analysis module completes the analysis process, the gene data sending module can notify the target user of the analysis result via email, and a target link is added to the email.
本公开实施例通过基因数据在线分析模块可以达到及时提示目标用户的目的。The disclosed embodiment can achieve the purpose of prompting the target user in a timely manner through the gene data online analysis module.
其中,如图2所示,基因数据可视化显示模块13,用于目标用户基于目标链接在线查看孟德尔随机化分析结果和异质性独立工具分析结果,并生成目标基因序列的局部位点数据图。As shown in FIG2 , the gene data visualization display module 13 is used for the target user to view the Mendelian randomization analysis results and the heterogeneity independent tool analysis results online based on the target link, and to generate a local site data map of the target gene sequence.
本公开实施例中的基因数据可视化显示模块主要提供位点可视化(Locusvisualization),在图2中,基因数据可视化显示模块13展示了基因序列的位点可视化示意图,目标用户可以通过选择自己感兴趣的特定GWAS性状的位点和基因来探索重要的位点。基因数据可视化模块包括GWAS、eQTL(49种组织)、sQTL(49种组织)和mQTL(2种组织)的位点图。染色质状态注释图显示了来自REMC的127个样本在不同初级细胞和组织类型(左侧的行)中的14种染色质状态注释(右侧以颜色表示)。基因注释提供了来自GENCODE v40的外显子和基因坐标、符号和方向。The gene data visualization display module in the disclosed embodiment mainly provides locus visualization. In Figure 2, the gene data visualization display module 13 shows a schematic diagram of the locus visualization of the gene sequence. The target user can explore important loci by selecting the loci and genes of the specific GWAS traits of interest. The gene data visualization module includes locus maps of GWAS, eQTL (49 tissues), sQTL (49 tissues) and mQTL (2 tissues). The chromatin state annotation map shows 14 chromatin state annotations (indicated by color on the right) of 127 samples from REMC in different primary cell and tissue types (rows on the left). Gene annotations provide exon and gene coordinates, symbols and directions from GENCODE v40.
因此,本公开实施例中的基因数据可视化显示模块能够支持强大的位点可视化功能,且能够详细展开GWAS数据和xQTL数据的局部位点图,深入揭示遗传基因关联,该基因数据可视化显示模块能够帮助研究人员更好地解析和理解复杂的基因-性状关联结果。Therefore, the gene data visualization display module in the embodiment of the present disclosure can support powerful site visualization functions, and can expand the local site maps of GWAS data and xQTL data in detail, and deeply reveal the genetic gene association. The gene data visualization display module can help researchers better parse and understand complex gene-trait association results.
其中,在图2中,孟德尔随机化分析数据库14,用于存储孟德尔随机化分析结果和异质性独立工具分析结果,并实时更新和维护多种分子表型量性状位点数据集和全基因组关联分析数据集,以及支持目标用户查询目标基因序列的遗传变异分析结果,和支持目标用户登录至基因数据可视化显示模块13显示目标基因序列的局部位点数据图。Among them, in Figure 2, the Mendelian randomization analysis database 14 is used to store the Mendelian randomization analysis results and the heterogeneity independent tool analysis results, and to update and maintain a variety of molecular phenotypic trait site data sets and whole genome association analysis data sets in real time, as well as to support the target user to query the genetic variation analysis results of the target gene sequence, and to support the target user to log in to the gene data visualization display module 13 to display the local site data map of the target gene sequence.
具体地,本公开实施例中的孟德尔随机化分析数据库用于存储和管理广泛的GWAS性状和xQTL数据的分析结果,例如,该孟德尔随机化分析数据库可以收录60,255个基因-性状关联,覆盖了213种复杂性状和103个xQTL数据集(包括49个组织,3个组学数据),包括基因名称、性状、关联的xQTL数据集组织和分子表型、关联显著程度p值等。本公开实施例孟德尔随机化分析数据库庞大的数据集为研究社区提供了宝贵的资源,支持多层次、多维度的遗传机制探索。Specifically, the Mendelian randomization analysis database in the disclosed embodiment is used to store and manage the analysis results of a wide range of GWAS traits and xQTL data. For example, the Mendelian randomization analysis database can include 60,255 gene-trait associations, covering 213 complex traits and 103 xQTL data sets (including 49 tissues and 3 omics data), including gene names, traits, associated xQTL data set tissues and molecular phenotypes, association significance p-values, etc. The huge data set of the Mendelian randomization analysis database in the disclosed embodiment provides a valuable resource for the research community and supports multi-level and multi-dimensional exploration of genetic mechanisms.
本公开实施例中的可视化基因数据分析系统,极大简化了遗传学研究的复杂工作流程,提高了遗传机制的研究效率和准确性。并且本公开实施例中的可视化基因数据分析系统整合了来自不同组织和多组学层面的广泛GWAS数据和xQTL数据,并且支持用户上传自定义数据,以及提供了具有广泛基因-性状关联的孟德尔随机化数据库,进一步增强了研究的灵活性和广度。本公开实施例中的可视化基因数据分析系统可以辅助研究人员系统地分析遗传变异对特定表型的影响,能够深入理解其潜在的生物学机制。The visualized gene data analysis system in the disclosed embodiment greatly simplifies the complex workflow of genetic research and improves the efficiency and accuracy of genetic mechanism research. In addition, the visualized gene data analysis system in the disclosed embodiment integrates a wide range of GWAS data and xQTL data from different tissues and multi-omics levels, supports users to upload custom data, and provides a Mendelian randomization database with a wide range of gene-trait associations, further enhancing the flexibility and breadth of research. The visualized gene data analysis system in the disclosed embodiment can assist researchers in systematically analyzing the effects of genetic variation on specific phenotypes and gaining an in-depth understanding of their underlying biological mechanisms.
在一种可选的实施方式中,如图3所示,基因数据在线分析模块11包括:多源基因数据整合子模块111,用于获取多种分子表型量性状位点数据集和全基因组关联分析数据集,并进行多源数据整合。In an optional embodiment, as shown in FIG3 , the gene data online analysis module 11 includes: a multi-source gene data integration submodule 111 for acquiring a plurality of molecular phenotypic trait loci data sets and whole genome association analysis data sets, and performing multi-source data integration.
具体地,本公开实施例中的基因数据在线分析模块中的多种分子表型量性状位点数据集为xQTL数据,全基因组关联分析数据集为GWAS数据,其中,多种分子表型量性状位点数据集包含了103个xQTL汇总数据集,涵盖了51个来自人体49个组织的表达量位点eQTL、50个来自人体49个组织的剪接量位点sQTL,以及2个来自人体大脑组织和血液的甲基化位点mQTL。Specifically, the various molecular phenotypic trait locus data sets in the online analysis module of genetic data in the embodiment of the present disclosure are xQTL data, and the whole genome association analysis data set is GWAS data, among which the various molecular phenotypic trait locus data sets include 103 xQTL summary data sets, covering 51 expression loci eQTL from 49 human tissues, 50 splicing loci sQTL from 49 human tissues, and 2 methylation sites mQTL from human brain tissue and blood.
本公开实施例通过多源基因数据整合子模块能够高效地整合来自不同数据源的GWAS数据和各种xQTL数据(如eQTL、sQTL、mQTL),这种整合不仅提高了数据的可用性,也加深了对复杂遗传机制的理解。The disclosed embodiment can efficiently integrate GWAS data and various xQTL data (such as eQTL, sQTL, mQTL) from different data sources through a multi-source gene data integration submodule. This integration not only improves the availability of data, but also deepens the understanding of complex genetic mechanisms.
在另一种具体的实施方式中,本公开实施例中的可视化基因数据分析系统,在图3中,基因数据在线分析模块11,还包括:遗传变异工具变量子模块112、共同遗传变异确定子模块113。In another specific embodiment, in the visualized gene data analysis system in the embodiment of the present disclosure, in FIG. 3 , the gene data online analysis module 11 further includes: a genetic variation instrumental variable submodule 112 and a common genetic variation determination submodule 113 .
遗传变异工具变量子模块,用于从多种分子表型量性状位点数据集中针对每个探针选取与基因表达相关的遗传变异工具变量。The genetic variation instrumental variable submodule is used to select genetic variation instrumental variables related to gene expression for each probe from a variety of molecular phenotypic trait locus data sets.
具体地,选取遗传变异工具变量:基因数据在线分析模块从表达量性状位点(eQTL)数据中针对每个探针选取与基因表达显著相关的遗传变异的单核苷酸多态性(SNP)作为工具变量(Instrument Variables,IV),这些遗传变异应该在基因的顺式调控区域内,并确保用于分析的每个探针至少有一个具有显著水平(PeQTL<5e-8)的cis-eQTL支持。Specifically, genetic variation instrumental variables are selected: the genetic data online analysis module selects single nucleotide polymorphisms (SNPs) of genetic variations that are significantly correlated with gene expression as instrumental variables (IV) for each probe from the expression trait locus (eQTL) data. These genetic variations should be within the cis-regulatory region of the gene, and ensure that each probe used for analysis has at least one cis-eQTL support with a significant level (P eQTL <5e-8).
在遗传学中,顺式表达量性状位点(cis-eQTL)指的是那些位于或靠近调控基因表达量的基因本身的遗传变异。这种变异通常在基因组上与被调控的基因相距较近(通常是在2Mb以内)。cis-eQTL(顺式表达量性状位点)通常影响基因表达的方式,例如通过影响启动子区域、增强子或其他调控元件的DNA序列,从而直接调控邻近的基因表达水平。In genetics, cis-eQTLs refer to genetic variants that are located in or near the gene itself that regulates gene expression. Such variants are usually close to the regulated gene on the genome (usually within 2Mb). cis-eQTLs usually affect the way genes are expressed, for example, by affecting the DNA sequence of promoter regions, enhancers or other regulatory elements, thereby directly regulating the expression level of neighboring genes.
共同遗传变异确定子模块,用于从全基因组关联分析数据集中提取遗传变异工具变量与目标基因性状之间的基因关联数值,并基于孟德尔随机化分析基因关联数值,以确定目标基因序列中单核苷酸多态性与性状遗传数据是否存在共同遗传变异。The common genetic variation determination submodule is used to extract the gene association value between the genetic variation instrumental variable and the target gene trait from the whole genome association analysis data set, and analyze the gene association value based on Mendelian randomization to determine whether there is common genetic variation between the single nucleotide polymorphism in the target gene sequence and the trait genetic data.
具体地,基于遗传变异工具变量,可以从全基因组关联分析数据集(GWAS数据)中提取这些遗传变异与目标性状之间的关联统计数据,包括效应大小、标准误以及P值。Specifically, based on genetic variation instrumental variables, association statistics between these genetic variations and target traits, including effect size, standard error, and P value, can be extracted from genome-wide association analysis datasets (GWAS data).
具体地,本公开实施例中的基因关联数值可以通过孟德尔随机化分析获知,也可以通过上述基因关联数据提取子模块提取得到。本公开实施例通过孟德尔随机化分析基因关联数值的具体过程如下:Specifically, the gene association value in the embodiment of the present disclosure can be obtained through Mendelian randomization analysis, or can be extracted through the above-mentioned gene association data extraction submodule. The specific process of analyzing the gene association value through Mendelian randomization in the embodiment of the present disclosure is as follows:
在本公开实施例中,使用线性回归模型估计发生遗传变异的单核苷酸多态性(SNP)对基因表达和性状的影响。基因表达的遗传变异被视为潜在的中介变量,介于遗传变异和性状之间。因果效应的估计是通过比较第二步回归的系数(遗传变异对性状的影响)与第一步回归的系数(遗传变异对基因表达的影响)的比值进行的,该比值即为上述中的基因关联数值。这个比值即为遗传变异通过改变基因表达对性状影响的估计大小。分析的统计显著性通常通过P值来评估。如果该P值小于预设的显著性水平(如P<0.05/n),则认为遗传变异影响基因表达对性状有显著的因果效应。In the disclosed embodiment, a linear regression model is used to estimate the effect of single nucleotide polymorphisms (SNPs) in which genetic variation occurs on gene expression and traits. The genetic variation of gene expression is regarded as a potential mediating variable, between genetic variation and traits. The estimation of causal effect is carried out by comparing the ratio of the coefficient of the second step regression (the effect of genetic variation on traits) to the coefficient of the first step regression (the effect of genetic variation on gene expression), and the ratio is the gene association value in the above. This ratio is the estimated size of the effect of genetic variation on traits by changing gene expression. The statistical significance of the analysis is usually evaluated by P value. If the P value is less than the preset significance level (such as P < 0.05/n), it is considered that genetic variation affects gene expression and has a significant causal effect on traits.
在一种具体的实施方式中,共同遗传变异确定子模块,包括:In a specific embodiment, the common genetic variation determination submodule comprises:
第一确定单元,用于当基因关联数值小于第一预设数值,则确定目标基因序列中单核苷酸多态性存在共同遗传变异。The first determination unit is used to determine that there is a common genetic variation in the single nucleotide polymorphism in the target gene sequence when the gene association value is less than a first preset value.
第二确定单元,用于当基因关联数值小于第二预设数值,则确定目标基因序列中单核苷酸多态性不存在共同遗传变异。The second determination unit is used to determine that there is no common genetic variation in the single nucleotide polymorphism in the target gene sequence when the gene association value is less than a second preset value.
在一示例中,对于第一确定单元而言,第一预设阈值可以为5e-6,基因关联数值可以用PSMR表示时,例如,当PSMR<5e-6时,则确定目标基因序列中单核苷酸多态性存在共同遗传变异。In one example, for the first determination unit, the first preset threshold may be 5e-6, and when the gene association value may be represented by P SMR , for example, when P SMR <5e-6, it is determined that there is a common genetic variation in the single nucleotide polymorphism in the target gene sequence.
在另一示例中,对于第二确定单元而言,第二预设阈值可以为0.01,基因关联数值可以用PHEIDI值表示,例如,当PHEIDI<0.01时,则确定目标基因序列中单核苷酸多态性没有存在共同遗传变异。In another example, for the second determination unit, the second preset threshold may be 0.01, and the gene association value may be represented by a P HEIDI value. For example, when P HEIDI <0.01, it is determined that there is no common genetic variation in the single nucleotide polymorphism in the target gene sequence.
由于基因关联数值可获知遗传变异影响基因表达对性状有显著的因果效应,那反之,基因关联数据提取子模块从全基因组关联分析数据集中提取了基因关联数值(P值),如果该P值小于一定标准,则可以认为目标基因序列中的单核苷酸多态性可能存在一个共同遗传变异,该共同遗传变异可以通过调节基因表达来影响表型。Since the gene association value can be used to know that the genetic variation affects gene expression and has a significant causal effect on the trait, conversely, the gene association data extraction submodule extracts the gene association value (P value) from the whole genome association analysis data set. If the P value is less than a certain standard, it can be considered that the single nucleotide polymorphism in the target gene sequence may have a common genetic variation, which can affect the phenotype by regulating gene expression.
在一种具体的实施方式中,在图3中,共同遗传变异确定子模块113,还用于当目标基因序列中单核苷酸多态性存在共同遗传变异,则利用异质性独立工具评估共同遗传变异所在区域的连锁不平衡,确定目标基因序列是否存在其他遗传变异干扰结果。In a specific embodiment, in FIG. 3 , the common genetic variation determination submodule 113 is also used to evaluate the linkage disequilibrium in the region where the common genetic variation is located using a heterogeneity independent tool when there is common genetic variation in the single nucleotide polymorphism in the target gene sequence, to determine whether there are other genetic variations in the target gene sequence that interfere with the results.
进一步地,本公开实施例中的共同遗传变异确定子模块,包括:Furthermore, the common genetic variation determination submodule in the embodiment of the present disclosure includes:
第一遗传变异确定单元,用于当目标基因序列不存在其他遗传变异干扰结果,则确定共同遗传变异是由于单个多核苷酸多态性引起的遗传变异;A first genetic variation determination unit is used to determine that the common genetic variation is a genetic variation caused by a single multinucleotide polymorphism when there is no other genetic variation interference result in the target gene sequence;
第二遗传变异确定单元,用于当目标基因序列存在其他遗传变异干扰结果,则确定共同遗传变异是由于多个单核苷酸多态性引起的遗传变异。The second genetic variation determination unit is used to determine that the common genetic variation is a genetic variation caused by multiple single nucleotide polymorphisms when other genetic variations interfere with the results of the target gene sequence.
具体地,对孟德尔随机化分析中识别的每一个显著关联数据进行异质性独立工具测试,以检测是否存在遗传异质性。本公开实施例中的遗传变异确定子模块在评估GWAS数据和eQTL数据是否可能由同一个遗传变异驱动,还是由多个独立遗传变异引起。在本实施例中的异质性独立工具可以为参考基因组数据(如UK Biobank项目数据)评估工具,通过该异质性独立工具评估共同遗传变异所在区域的连锁不平衡(LD结构),在目标基因序列中确认是否存其他与性状相关的变异可能干扰分析结果。如果异质性独立工具分析显示不存在显著的异质性(PHEIDI>0.01),支持孟德尔随机化分析结果表明基因表达的变化可能是性状变异的因果因素,如果存在异质性,表明该位点的GWAS数据和eQTL数据可能由不同的遗传变异引起。Specifically, each significant association data identified in the Mendelian randomization analysis is tested for heterogeneity independent tools to detect whether there is genetic heterogeneity. The genetic variation determination submodule in the disclosed embodiment evaluates whether the GWAS data and eQTL data may be driven by the same genetic variation or caused by multiple independent genetic variations. The heterogeneity independent tool in this embodiment can be a reference genome data (such as UK Biobank project data) evaluation tool, which evaluates the linkage disequilibrium (LD structure) in the region where the common genetic variation is located by the heterogeneity independent tool, and confirms whether there are other trait-related variations in the target gene sequence that may interfere with the analysis results. If the heterogeneity independent tool analysis shows that there is no significant heterogeneity (PHEIDI>0.01), it supports that the Mendelian randomization analysis results show that changes in gene expression may be causal factors of trait variation. If heterogeneity exists, it indicates that the GWAS data and eQTL data at the site may be caused by different genetic variations.
在本实施例中异质性独立工具是为了区分这种基因表达和表型之间的关联是由单一SNP引起的还是由多个独立的SNP引起的,在本实施例中可以使用HEIDI(Heterogeneity in Dependent Instruments)测试。HEIDI测试是用来检测孟德尔随机化分析结果中的异质性的,如果HEIDI测试结果不显著(p值较大),则支持是单一SNP的作用;如果HEIDI测试显著,可能表明存在多个独立的遗传变异分别影响基因表达和表型。In this example, the heterogeneity independent tool is to distinguish whether the association between gene expression and phenotype is caused by a single SNP or by multiple independent SNPs. In this example, the HEIDI (Heterogeneity in Dependent Instruments) test can be used. The HEIDI test is used to detect heterogeneity in the results of Mendelian randomization analysis. If the HEIDI test result is not significant (large p value), it supports the effect of a single SNP; if the HEIDI test is significant, it may indicate that there are multiple independent genetic variations that affect gene expression and phenotype respectively.
本实施例中的可视化基因数据分析系统,通过基因数据在线分析模块能够集成孟德尔随机化分析和异质性独立工具先进统计遗传学分析方式,揭示了遗传变异与性状之间的潜在分子表型关联,极大地增强了遗传机制研究的准确性和深度。The visualized gene data analysis system in this embodiment can integrate Mendelian randomization analysis and heterogeneity independent tool advanced statistical genetic analysis methods through the gene data online analysis module, revealing the potential molecular phenotypic association between genetic variation and traits, greatly enhancing the accuracy and depth of genetic mechanism research.
在一种可选的实施方式中,在图3中,基因数据可视化显示模块13,包括:In an optional embodiment, in FIG3 , the gene data visualization display module 13 includes:
基因探针识别子模块131,用于从目标用户在线查看的孟德尔随机化分析结果和异质性独立工具分析结果中选择与遗传变异相关的基因,并识别该基因相关的探针名称和探针位置;A gene probe identification submodule 131 is used to select genes related to genetic variation from the Mendelian randomization analysis results and heterogeneity independent tool analysis results viewed online by the target user, and to identify the probe name and probe position related to the gene;
局部位点识别子模块132,用于将探针位置的预设区域作为局部位点;A local site identification submodule 132 is used to use a preset area of the probe position as a local site;
位点数据生成子模块133,用于在局部位点内,显示基于多种分子表型量性状位点数据集和全基因组关联分析数据集生成目标基因的局部位点数据图。The site data generation submodule 133 is used to display, within a local site, a local site data map of a target gene generated based on a plurality of molecular phenotypic trait site data sets and a genome-wide association analysis data set.
具体地,在本公开实施例中的基因数据可视化显示模块提供的位点可视化功能专注于展示与性状显著相关的基因。此功能首先根据基因数据在线分析模块输出分析结果,选择与性状显著相关的基因,并识别与这些基因相关的探针名称及其位置。然后,以这些探针位置为中心,定义周围±1Mb的区域作为局部位点。在这个局部位点内,本公开实施例中的可视化基因数据分析系统将综合查询并展示GWAS数据与xQTL数据,形成一个全面的局部位点图,这不仅揭示了遗传关联,实现深入探讨变异与性状之间的联系。Specifically, the site visualization function provided by the gene data visualization display module in the disclosed embodiment focuses on displaying genes significantly associated with traits. This function first outputs the analysis results of the gene data online analysis module, selects genes significantly associated with traits, and identifies the probe names and positions associated with these genes. Then, with these probe positions as the center, the surrounding ±1Mb area is defined as a local site. Within this local site, the visualized gene data analysis system in the disclosed embodiment will comprehensively query and display GWAS data and xQTL data to form a comprehensive local site map, which not only reveals genetic associations, but also enables in-depth exploration of the connection between variation and traits.
在一种可选的实施方式中,基因数据可视化显示模块,包括:目标链接按钮,用于目标用户登录目标链接进入目标界面。In an optional implementation, the gene data visualization display module includes: a target link button, which is used for the target user to log in to the target link and enter the target interface.
如图4所示,为基因数据可视化显示模块所显示的目标界面,在图4的左侧区域,显示通过基因数据在线分析模块处理后的数据,包括基因名称、位点名称以及来自不同xQTL分析的孟德尔随机化分析结果(如eSMR、sSMR、mSMR、xSMR),这些结果分别代表不同类型的量性状位点(如eQTL、sQTL、mQTL)的分析数据。在图4的右侧区域:展示位点可视化的主图,整合了多种数据层,包括GWAS结果、eQTL和xQTL数据、来自Roadmap表观基因组映射联盟的表观遗传注释(如染色质可及性状态),以及来自GENCODE数据库的基因注释(包括基因名称、外显子位置等)。这些信息共同描绘出遗传变异与性状、表达量、表观遗传状态之间的详细关系。在交互功能上可获取详细数据,目标用户可以通过鼠标悬停或点击特定的遗传变异(SNP),即时查看该变异的详细信息,如p值、效应大小、相关组织类型等。并且,在目标界面提供了工具栏和滑动条,允许用户筛选特定组织的xQTL数据,或对图中特定区域进行放大,以便更精细地查看和分析数据。目标界面支持目标用户交互式地点击,动态地展示每个基因和位点的遗传关联结果,增加了界面的动态性和用户的探索体验。As shown in Figure 4, it is the target interface displayed by the gene data visualization display module. In the left area of Figure 4, the data processed by the gene data online analysis module are displayed, including gene names, site names, and Mendelian randomization analysis results from different xQTL analyses (such as eSMR, sSMR, mSMR, xSMR). These results represent the analysis data of different types of quantitative trait sites (such as eQTL, sQTL, mQTL). In the right area of Figure 4: the main map showing site visualization integrates multiple data layers, including GWAS results, eQTL and xQTL data, epigenetic annotations from the Roadmap Epigenome Mapping Alliance (such as chromatin accessibility status), and gene annotations from the GENCODE database (including gene names, exon locations, etc.). Together, this information depicts the detailed relationship between genetic variation and traits, expression, and epigenetic status. Detailed data can be obtained in the interactive function. The target user can hover or click on a specific genetic variation (SNP) to instantly view detailed information about the variation, such as p-value, effect size, and related tissue type. In addition, a toolbar and slider are provided on the target interface, allowing users to filter xQTL data for specific tissues or zoom in on specific areas in the graph to view and analyze data more carefully. The target interface supports interactive clicking by target users to dynamically display the genetic association results of each gene and site, increasing the dynamism of the interface and the user's exploration experience.
本公开实施例中的基因数据可视化显示模块,提供了强大的位点可视化功能,能够详细展示GWAS数据和xQTL数据的局部位点图,同时提供了表观遗传学注释和基因注释,深入揭示遗传关联,为研究人员解析和理解复杂的基因-性状关联结果提供了有力的视觉支持。The gene data visualization display module in the disclosed embodiment provides a powerful site visualization function, which can display the local site map of GWAS data and xQTL data in detail, and also provides epigenetic annotations and gene annotations, deeply reveals genetic associations, and provides strong visual support for researchers to parse and understand complex gene-trait association results.
在一种可选的实施方式中,孟德尔随机化分析数据库,包括:数据存储模块141,用于存储通过整合多种分子表型量性状位点数据集和全基因组关联分析数据集进行基因数据分析的孟德尔随机化分析结果和异质性独立工具分析结果。In an optional embodiment, the Mendelian randomization analysis database includes: a data storage module 141 for storing Mendelian randomization analysis results and heterogeneity independent tool analysis results of gene data analysis by integrating multiple molecular phenotypic trait locus data sets and whole genome association analysis data sets.
示例性地,本公开实施例中的孟德尔随机化分析数据库用于存储和管理广泛的GWAS性状和xQTL数据的分析结果。目前,该数据库已经收录了60,255个基因-性状关联,覆盖了213种复杂性状和103个xQTL数据集(包括49个组织,3个组学数据),包括基因名称、性状、关联的xQTL数据集组织和分子表型、关联显著程度p值等。这个庞大的数据集为研究社区提供了一个宝贵的资源,支持多层次、多维度的遗传机制探索。Exemplarily, the Mendelian randomization analysis database in the disclosed embodiment is used to store and manage the analysis results of a wide range of GWAS traits and xQTL data. At present, the database has collected 60,255 gene-trait associations, covering 213 complex traits and 103 xQTL data sets (including 49 tissues, 3 omics data), including gene names, traits, associated xQTL data set tissues and molecular phenotypes, association significance p values, etc. This huge dataset provides a valuable resource for the research community, supporting multi-level and multi-dimensional exploration of genetic mechanisms.
在一种可选的实施方式中,在图2中,孟德尔随机化分析数据库14,包括:数据维护模块142,用于实时更新和维护多种分子表型量性状位点数据集和全基因组关联分析数据集。In an optional embodiment, in FIG. 2 , the Mendelian randomization analysis database 14 includes: a data maintenance module 142 for updating and maintaining a plurality of molecular phenotypic trait loci data sets and genome-wide association analysis data sets in real time.
示例性地,孟德尔随机化分析数据库不断进行迭代更新,整合最新公布的GWAS数据和xQTL数据,在更新过程中,系统自动检测新数据集的发布,评估其质量和与现有数据的相关性,并通过精心设计的数据整合流程将其融入数据库,确保了数据库内容的时效性和广泛的适用性。For example, the Mendelian randomization analysis database is continuously updated iteratively to integrate the latest published GWAS data and xQTL data. During the updating process, the system automatically detects the release of new data sets, evaluates their quality and relevance to existing data, and integrates them into the database through a carefully designed data integration process, ensuring the timeliness and wide applicability of the database content.
在一种可选的实施方式中,在图3中,孟德尔随机化分析数据库14,还包括:数据查询模块143和数据可视化模块144;In an optional embodiment, in FIG3 , the Mendelian randomization analysis database 14 further includes: a data query module 143 and a data visualization module 144;
数据查询模块,用于支持目标用户查询目标基因序列的遗传变异分析结果。The data query module is used to support target users to query the genetic variation analysis results of target gene sequences.
本公开实施例中通过数据查询模块,提供了高效的搜索工具,允许用户根据特定基因-性状和疾病进行查询。本公开实施例可以使研究人员能够快速找到所需的数据,从而加速科研进程。The disclosed embodiments provide an efficient search tool through a data query module, allowing users to query based on specific gene-traits and diseases. The disclosed embodiments enable researchers to quickly find the required data, thereby accelerating the scientific research process.
数据可视化模块,用于支持目标用户登录至基因数据可视化显示模块显示目标基因序列的局部位点数据图。The data visualization module is used to support the target user to log in to the gene data visualization display module to display the local site data map of the target gene sequence.
本公开实施例中的数据登录模块,结合强大的位点可视化功能,孟德尔随机化分析数据库不仅允许用户查询特定的基因-性状关联,还能深入探索这些关联在不同组织和多个组学层面的细节,在每个基因-性状关联结果旁边,提供“Locus plot”链接和按钮,点击即可跳转至详细的位点可视化界面。本公开实施例通过这种可视化工具,研究者可以直观地理解基因-性状关联的遗传背景,从而更有效地进行科学研究和数据解读。The data login module in the disclosed embodiment, combined with the powerful site visualization function, allows the Mendelian randomization analysis database to not only allow users to query specific gene-trait associations, but also to explore the details of these associations at different tissues and multiple omics levels. Next to each gene-trait association result, a "Locus plot" link and button are provided, which can be clicked to jump to the detailed site visualization interface. Through this visualization tool, the disclosed embodiment allows researchers to intuitively understand the genetic background of gene-trait associations, thereby conducting scientific research and data interpretation more effectively.
在图2中,基因数据在线分析模块11,获取该基因数据在线分析模块11内置的xQTL数据集或自定义上传的xQTL数据集,结合GWAS汇总统计数据执行孟德尔随机化分析和HEIDI分析,基因数据在线分析模块11支持复杂的数据集,为目标用户提供可一个灵活且强大的分析工具。基因数据可视化显示模块13,为目标用户选取感兴趣的GWAS性状相关位点和基因进行深入探索,并且在可视化界面展示了GWAS及eQTL(涵盖49种组织)、sQTL(同样覆盖49种组织)和mQTL(2种组织)的相关数据。此外,染色质状态注释图展示了来自RoadmapEpigenomics Mapping Consortium的127个样本的14种染色质状态(颜色标示),覆盖多种初级细胞和组织类型。基因注释从GENCODE数据库引入,提供外显子、基因坐标、基因名和方向等信息。孟德尔随机化分析数据库14收录了来自孟德尔随机化分析和HEIDI分析的60,255个基因-性状关联,覆盖213种GWAS性状和103个xQTL数据集,该孟德尔随机化分析数据库14所提供的位点可视化支持进一步地数据探索,增强了用户对遗传关联深层次理解的能力。In FIG2 , the gene data online analysis module 11 obtains the xQTL data set built into the gene data online analysis module 11 or the custom uploaded xQTL data set, and performs Mendelian randomization analysis and HEIDI analysis in combination with GWAS summary statistics. The gene data online analysis module 11 supports complex data sets, providing target users with a flexible and powerful analysis tool. The gene data visualization display module 13 selects the GWAS trait-related sites and genes of interest for in-depth exploration for the target users, and displays the relevant data of GWAS and eQTL (covering 49 tissues), sQTL (also covering 49 tissues) and mQTL (2 tissues) in the visualization interface. In addition, the chromatin state annotation map shows 14 chromatin states (color-coded) of 127 samples from the RoadmapEpigenomics Mapping Consortium, covering a variety of primary cells and tissue types. Gene annotations are introduced from the GENCODE database, providing information such as exons, gene coordinates, gene names and directions. The Mendelian Randomization Analysis Database 14 includes 60,255 gene-trait associations from Mendelian randomization analysis and HEIDI analysis, covering 213 GWAS traits and 103 xQTL data sets. The site visualization provided by the Mendelian Randomization Analysis Database 14 supports further data exploration and enhances users' ability to deeply understand genetic associations.
本公开实施例中的可视化基因数据分析系统,极大简化了遗传学研究的复杂工作流程,提高了研究效率和准确性。该可视化基因数据分析系统整合了来自不同组织和多组学层面的广泛GWAS数据和xQTL数据,支持用户上传自定义数据,并且提供了具有广泛基因-性状关联的孟德尔随机化分析数据库,进一步增强了研究的灵活性和广度。通过该可视化基因数据分析系统,研究人员可以系统地分析遗传变异对特定表型的影响,深入理解其潜在的生物学机制。此外,本公开实施例中可视化基因数据分析系统的动态交互界面和强大的可视化工具使得研究人员能够直观地观察和解析复杂的遗传关联数据,从而实现加速从基因发现到功能理解的转化过程。该可视化基因数据分析系统不仅推动了功能基因组学的发展,也为个性化医学研究提供了强大的科学支持,展现了前沿科学技术的深入应用和创新能力。The visualized gene data analysis system in the disclosed embodiment greatly simplifies the complex workflow of genetic research and improves research efficiency and accuracy. The visualized gene data analysis system integrates extensive GWAS data and xQTL data from different tissues and multi-omics levels, supports users to upload custom data, and provides a Mendelian randomization analysis database with extensive gene-trait associations, further enhancing the flexibility and breadth of the research. Through the visualized gene data analysis system, researchers can systematically analyze the effects of genetic variation on specific phenotypes and deeply understand their potential biological mechanisms. In addition, the dynamic interactive interface and powerful visualization tools of the visualized gene data analysis system in the disclosed embodiment enable researchers to intuitively observe and analyze complex genetic association data, thereby accelerating the transformation process from gene discovery to functional understanding. The visualized gene data analysis system not only promotes the development of functional genomics, but also provides strong scientific support for personalized medical research, demonstrating the in-depth application and innovation capabilities of cutting-edge science and technology.
虽然结合附图描述了本发明的实施例,但是本领域技术人员可以在不脱离本发明的精神和范围的情况下做出各种修改和变型,这样的修改和变型均落入由所附权利要求所限定的范围之内。Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the present invention, and such modifications and variations are all within the scope defined by the appended claims.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410675816.8A CN118692572A (en) | 2024-05-28 | 2024-05-28 | A visual genetic data analysis system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410675816.8A CN118692572A (en) | 2024-05-28 | 2024-05-28 | A visual genetic data analysis system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118692572A true CN118692572A (en) | 2024-09-24 |
Family
ID=92763994
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410675816.8A Pending CN118692572A (en) | 2024-05-28 | 2024-05-28 | A visual genetic data analysis system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118692572A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119446256A (en) * | 2024-11-20 | 2025-02-14 | 华中科技大学 | Social Isolation and Neuroticism Analysis System Based on GWAS |
-
2024
- 2024-05-28 CN CN202410675816.8A patent/CN118692572A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119446256A (en) * | 2024-11-20 | 2025-02-14 | 华中科技大学 | Social Isolation and Neuroticism Analysis System Based on GWAS |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | The Human Pangenome Project: a global resource to map genomic diversity | |
Yépez et al. | Detection of aberrant gene expression events in RNA sequencing data | |
Sedlazeck et al. | Piercing the dark matter: bioinformatics of long-range sequencing and mapping | |
Cretu Stancu et al. | Mapping and phasing of structural variation in patient genomes using nanopore sequencing | |
US9298804B2 (en) | Variant database | |
Chin et al. | Multiscale analysis of pangenomes enables improved representation of genomic diversity for repetitive and clinically relevant genes | |
Mahmoud et al. | PRINCESS: comprehensive detection of haplotype resolved SNVs, SVs, and methylation | |
KR20160107237A (en) | Systems and methods for use of known alleles in read mapping | |
Li et al. | Toward high-throughput genotyping: dynamic and automatic software for manipulating large-scale genotype data using fluorescently labeled dinucleotide markers | |
Yap et al. | A graph-theoretic approach to comparing and integrating genetic, physical and sequence-based maps | |
CA2740796A1 (en) | Interactive genome browser | |
CN118692572A (en) | A visual genetic data analysis system | |
CN110603594A (en) | Interactive precision medical explorer for genome deletion and treatment selection | |
Wood et al. | Recommendations for accurate resolution of gene and isoform allele-specific expression in RNA-Seq data | |
Secomandi et al. | Pangenome graphs and their applications in biodiversity genomics | |
Yin et al. | Identification of a de novo fetal variant in osteogenesis imperfecta by targeted sequencing-based noninvasive prenatal testing | |
CN112292730A (en) | Computing device with improved user interface for interpreting and visualizing data | |
CN114882943B (en) | Method and device for analyzing somatic cell variation | |
Sana et al. | GAMES identifies and annotates mutations in next-generation sequencing projects | |
CN117542410A (en) | Knowledge graph representation and prediction method for carcinogenicity of multiple types of mutations in lung cancer genome | |
Oliveira et al. | DiseaseCard: a web-based tool for the collaborative integration of genetic and medical information | |
US20110301862A1 (en) | System for array-based DNA copy number and loss of heterozygosity analyses and reporting | |
CN113939881A (en) | Systems and methods with improved user interfaces for interpreting and visualizing longitudinal data | |
Song et al. | iHAP–integrated haplotype analysis pipeline for characterizing the haplotype structure of genes | |
Peltzer | Computational methods for ancient genome reconstruction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |