CN108004302A

CN108004302A - A kind of association analysis method of transcript profile reference and its application

Info

Publication number: CN108004302A
Application number: CN201711319379.2A
Authority: CN
Inventors: 刘头明; 李富; 朱四元
Original assignee: Institute of Bast Fiber Crops of CAAS
Current assignee: Institute of Bast Fiber Crops of CAAS
Priority date: 2017-12-12
Filing date: 2017-12-12
Publication date: 2018-05-08

Abstract

本发明涉及一种转录组参考的关联分析方法，包括：1).建立一个参考群体并将所述参考群体的所有个体的转录组序列测通得到参考转录本；2).对预测群体进行高通量RNA测序，并将所得到的测序数据与所述参考转录本进行比对拼接，得到群体转录组测序结果；3).由所述群体转录组测序结果中分析得到群体SNP的数据，利用所述群体SNP与感兴趣的性状进行关联分析，确定目标基因所在区域；再利用目标区域的基因表达变异与性状进行关联，得到目标性状的候选基因。与GWAS相比，转录组测序便宜、快速且不受基因组复杂性所限制。该技术理论上可以实现在所有物种中开展性状关联分析，无论该物种有无参考基因组、基因组是否很复杂。The present invention relates to a method for association analysis of transcriptome reference, comprising: 1). Establishing a reference population and measuring the transcriptome sequences of all individuals in the reference population to obtain reference transcripts; 2). Performing high-level analysis on the predicted population Throughput RNA sequencing, and compare and splicing the obtained sequencing data with the reference transcripts to obtain the population transcriptome sequencing results; 3). Analyze the population SNP data from the population transcriptome sequencing results, using The association analysis between the population SNP and the trait of interest is carried out to determine the region where the target gene is located; then the gene expression variation in the target region is used to correlate with the trait to obtain the candidate gene of the target trait. Compared with GWAS, transcriptome sequencing is cheap, fast and not limited by genome complexity. This technology can theoretically carry out trait association analysis in all species, regardless of whether the species has a reference genome or whether the genome is complex.

Description

A method for association analysis of transcriptome reference and its application

技术领域technical field

本发明涉及生物技术领域，具体而言，涉及一种转录组参考的关联分析方法及其应用。The present invention relates to the field of biotechnology, in particular to a transcriptome reference association analysis method and its application.

背景技术Background technique

随着测序技术的发展以及高密度单核苷酸多态性(single nucleotideploymorphism，SNP)芯片的普及，全基因组关联分析(Genome-wide association study，GWAS)已经日益成为了人类疾病研究及动物育种的一个有力工具。GWAS是基于连锁不平衡原理，通过在自然群体中分析基因组中数以万计的SNP标记与目标性状的关联性，以实现目标性状基因的定位。With the development of sequencing technology and the popularization of high-density single nucleotide polymorphism (single nucleotide polymorphism, SNP) chips, genome-wide association study (GWAS) has increasingly become the focus of human disease research and animal breeding. A powerful tool. GWAS is based on the principle of linkage disequilibrium, by analyzing the correlation between tens of thousands of SNP markers in the genome and target traits in natural populations to achieve the positioning of target trait genes.

随着对GWAS的深入研究，它还逐渐暴露出了以下一些缺陷，具体有：With the in-depth research on GWAS, it has gradually exposed the following defects, specifically:

第一、为了实现全基因组SNP标记的识别，需要开展样本的重测序，并与参考基因组比对。但是目前仅有少数物种完成了全基因组测序，具备开展GWAS所需要的参考基因组。因而，GWAS方法的一个主要缺点是该技术的应用依赖相应物种的参考基因组。最近的基于简化基因组的GWAS方法虽然不依赖物种的参考基因组，但是因为遗传标记附近的基因信息未知，难以检测到目标性状候选基因，在生物学研究中意义不大。First, in order to realize the identification of SNP markers in the whole genome, it is necessary to resequence the samples and compare them with the reference genome. However, only a few species have completed whole-genome sequencing and have the reference genomes required for GWAS. Thus, a major disadvantage of GWAS methods is that the application of the technique relies on the reference genome of the corresponding species. Although the recent GWAS method based on the simplified genome does not rely on the reference genome of the species, it is difficult to detect the candidate gene of the target trait because the gene information near the genetic marker is unknown, which is of little significance in biological research.

第二、GWAS技术只能将目标基因确定在一个基因组区域，难以直接确定目标性状候选基因。Second, GWAS technology can only determine the target gene in a genomic region, and it is difficult to directly determine the target trait candidate genes.

第三、为了避免基因定位时的统计误差，GWAS分析通常需要较大的群体样本。Third, in order to avoid statistical errors in gene mapping, GWAS analysis usually requires a larger population sample.

第四、GWAS技术难以确定目标性状候选基因间的相互关系。Fourth, GWAS technology is difficult to determine the relationship between candidate genes of target traits.

有鉴于此，特提出本发明。In view of this, the present invention is proposed.

发明内容Contents of the invention

本发明的目的在于提供一种转录组参考的关联分析方法(transcriptome-referenced association study,TRAS)，以解决上述问题。The purpose of the present invention is to provide a transcriptome reference association analysis method (transcriptome-referenced association study, TRAS), in order to solve the above problems.

本发明涉及一种转录组参考的关联分析方法，包括：The present invention relates to a method for association analysis of transcriptome reference, comprising:

1).建立一个参考群体并将所述参考群体的所有个体的转录组序列测通得到参考转录本；1). Establish a reference population and measure the transcriptome sequences of all individuals in the reference population to obtain reference transcripts;

2).对预测群体进行高通量RNA测序，并将所得到的测序数据与所述参考转录本进行比对拼接，得到群体转录组测序结果；2). Perform high-throughput RNA sequencing on the predicted population, and compare and splice the obtained sequencing data with the reference transcript to obtain the population transcriptome sequencing result;

3).由所述群体转录组测序结果中分析得到群体SNP的数据，利用所述群体SNP与感兴趣的性状进行关联分析，确定目标基因所在区域；再利用目标区域的基因表达变异与性状进行关联，得到目标性状的候选基因。3). Analyze the population SNP data from the population transcriptome sequencing results, use the population SNP and traits of interest to perform association analysis, and determine the region where the target gene is located; then use the gene expression variation and traits in the target region to perform Correlation to get the candidate genes of the target traits.

与基因组分析不一样，转录组测序便宜、快速且不受基因组复杂性所限制。本发明中利用相应研究物种的全长转录组作为参考序列，取代基因组开展研究群体的SNP识别和基因表达定量。因此，该技术理论上可以实现在所有物种中开展性状关联分析，不管该物种有无参考基因组，基因组是不是很复杂均不影响该技术的实施。Unlike genome analysis, transcriptome sequencing is cheap, fast, and not limited by genome complexity. In the present invention, the full-length transcriptome of the corresponding research species is used as a reference sequence to replace the genome to carry out SNP identification and gene expression quantification of the research population. Therefore, this technology can theoretically carry out trait association analysis in all species, regardless of whether the species has a reference genome or whether the genome is very complex will not affect the implementation of the technology.

具体实施方式Detailed ways

本发明中TRAS技术分两步进行，首先它和GWAS技术一样，利用SNP与性状进行关联分析，确定目标基因所在的区域。之后，TRAS再利用目标区域的基因表达与性状进行关联，进而确定目标性状候选基因。In the present invention, the TRAS technology is carried out in two steps. First, it is the same as the GWAS technology, using SNP and traits for association analysis to determine the region where the target gene is located. After that, TRAS uses the gene expression of the target region to correlate with the trait, and then determines the candidate gene of the target trait.

优选的，如上所述的转录组参考的关联分析方法，所述方法还包括：Preferably, the above-mentioned association analysis method of transcriptome reference, said method also includes:

确定候选基因的eQTL，再于所述eQTL中找是否含有其余的目标性状候选基因。Determine the eQTL of the candidate gene, and then find whether the remaining target trait candidate genes are contained in the eQTL.

eQTL(Expression quantitative trait loci)将来自分离群体的各基因型的表达数据作为一个数量性状，利用传统的QTL分析方法进行分析。本发明中，为了确定候选基因间的互作关系，TRAS技术利用候选基因(命名为A基因)的表达量为表型数据，与群体中的SNP开展关联分析，从而确定这些候选基因的eQTL。之后，再在这些eQTL中找是否含目标性状候选基因，如果有(命名为B基因)。因为B基因既是目标性状候选基因，又是另外一个目标性状候选基因的eQTL,进而可以确定A基因和B基因在目标性状调控方面具有互作关系。eQTL (Expression quantitative trait loci) takes the expression data of each genotype from a segregated population as a quantitative trait, and uses traditional QTL analysis methods for analysis. In the present invention, in order to determine the interaction relationship between candidate genes, TRAS technology uses the expression level of candidate genes (named A gene) as phenotypic data, and carries out association analysis with SNPs in the population, thereby determining the eQTL of these candidate genes. After that, find out whether the candidate gene of the target trait is contained in these eQTLs, and if so (named as B gene). Because gene B is not only a target trait candidate gene, but also an eQTL of another target trait candidate gene, it can be determined that gene A and B gene have an interactive relationship in the regulation of target traits.

优选的，如上所述的转录组参考的关联分析方法，所述基因表达变异包括基因序列变异和基因表达量的变异。Preferably, in the transcriptome-referenced association analysis method described above, the gene expression variation includes gene sequence variation and gene expression variation.

GWAS只利用一个参数(群体中的序列变异—SNP)与性状进行关联分析，因此为了减少导致误差，分析需要尽可能大的群体样本。而TRAS技术不仅在群体中开展SNP与性状的关联分析，也开展群体中的基因表达变异与性状的关联分析，具有序列变异和基因表达变异两个参数，因此分析群体样本大小可以不必像GWAS那么大。GWAS only uses one parameter (sequence variation in the population—SNP) for association analysis with traits, so in order to reduce errors, the analysis requires as large a population sample as possible. The TRAS technology not only carries out the association analysis between SNP and traits in the population, but also carries out the association analysis between gene expression variation and traits in the population. It has two parameters of sequence variation and gene expression variation, so the sample size of the analysis population does not need to be as large as that of GWAS. big.

优选的，如上所述的转录组参考的关联分析方法，在步骤1)中，得到所述参考转录本后还包括：对所述参考转录本进行功能注释。Preferably, the method for association analysis of transcriptome reference as described above, in step 1), after obtaining the reference transcripts, further includes: performing functional annotation on the reference transcripts.

优选的，如上所述的转录组参考的关联分析方法，在步骤1)中，所述参考群体中的品种数目≥1个。Preferably, in the above-mentioned association analysis method of transcriptome reference, in step 1), the number of species in the reference population is ≥ 1.

在本发明中，参考群体的作用在于对群体转录组的数据处理作为参考，因此从成本考虑，参考群体的数量不宜设置过多。In the present invention, the role of the reference population is to serve as a reference for the data processing of the population transcriptome. Therefore, considering the cost, the number of reference populations should not be set too much.

优选的，如上所述的转录组参考的关联分析方法，在步骤1)中，所述转录组序列测通的方法为单分子实时测序。Preferably, in the above-mentioned transcriptome reference association analysis method, in step 1), the transcriptome sequence determination method is single-molecule real-time sequencing.

优选的，如上所述的转录组参考的关联分析方法，在步骤3)中，利用所述群体SNP与感兴趣的性状进行关联分析之前，先对所述群体转录组测序结果进行共表达分析，将在群体中显示共表达的转录本分配到一个模块，并分析所述模块与感兴趣的性状的相关性，识别与所述感兴趣的性状相关的共表达模块。Preferably, in the above-mentioned transcriptome reference association analysis method, in step 3), before using the population SNP to perform association analysis with the trait of interest, first perform co-expression analysis on the population transcriptome sequencing results, Transcripts exhibiting co-expression in a population are assigned to a module, and the association of said module with a trait of interest is analyzed to identify co-expressed modules associated with said trait of interest.

优选的，如上所述的转录组参考的关联分析方法，在步骤3)中，利用所述群体SNP与感兴趣的性状进行关联分析时，所用的关联分析模型为一般线性模型或混合线性模型。Preferably, in the above-mentioned transcriptome-referenced association analysis method, in step 3), when using the population SNP to perform association analysis with the trait of interest, the association analysis model used is a general linear model or a mixed linear model.

优选的，如上所述的转录组参考的关联分析方法，在步骤3)中，利用所述群体SNP与感兴趣的性状进行关联分析时，所述SNP的质量分数≥40，覆盖深度≥2，次要等位基因频率≥0.05，缺失基因型频率≤0.5。Preferably, in the above-mentioned transcriptome reference association analysis method, in step 3), when using the population SNP to perform association analysis with the trait of interest, the quality score of the SNP is ≥ 40, and the coverage depth is ≥ 2, Minor allele frequency ≥ 0.05, deletion genotype frequency ≤ 0.5.

如上所述的方法在主效基因的筛查和鉴定、以及基因功能的关联性分析中的应用。The application of the above-mentioned method in the screening and identification of main effect genes, and the correlation analysis of gene functions.

下面将结合实施例对本发明的实施方案进行详细描述，但是本领域技术人员将会理解，下列实施例仅用于说明本发明，而不应视为限制本发明的范围。实施例中未注明具体条件者，按照常规条件或制造商建议的条件进行。所用试剂或仪器未注明生产厂商者，均为可以通过市购获得的常规产品。Embodiments of the present invention will be described in detail below in conjunction with examples, but those skilled in the art will understand that the following examples are only for illustrating the present invention, and should not be considered as limiting the scope of the present invention. Those who do not indicate the specific conditions in the examples are carried out according to the conventional conditions or the conditions suggested by the manufacturer. The reagents or instruments used were not indicated by the manufacturer, and they were all commercially available conventional products.

全基因组关联分析(Genome-wide association study；GWAS)是一种鉴定复杂性状基因的有力工具，但其应用受限于研究物种参考基因组序列的需求。本发明提出一种方法，即转录组参考的关联分析技术(transcriptome-referenced association study，TRAS)，其使用由单分子实时测序产生的转录组作为参考序列来评估基因序列和基因表达的种群变异。当两个评分都与性状相关时，候选基因被鉴定，并且它们的潜在相互作用通过表达数量性状基因座分析来确定。在下面的实施例中，通过应用这种方法描述102个地方品种的大蒜鳞茎性状特点，我们确定了23候选转录本，其中大部分显示广泛的相互作用。13个转录本是lncRNAs，其他则是主要涉及碳水化合物代谢和蛋白质降解的蛋白质。TRAS作为独立于参考基因组的关联研究的有效工具，将关联研究的适用性扩展到广泛的物种。Genome-wide association study (GWAS) is a powerful tool for identifying genes for complex traits, but its application is limited by the need to study species reference genome sequences. The present invention proposes a method, the transcriptome-referenced association study (TRAS), which uses the transcriptome generated by single-molecule real-time sequencing as a reference sequence to evaluate the population variation of gene sequence and gene expression. Candidate genes were identified when both scores were associated with a trait, and their potential interactions were determined by expression quantitative trait locus analysis. In the following example, by applying this approach to characterize garlic bulb traits in 102 landraces, we identified 23 candidate transcripts, most of which showed extensive interactions. Thirteen transcripts were lncRNAs, and the others were proteins mainly involved in carbohydrate metabolism and protein degradation. TRAS serves as an efficient tool for association studies independent of reference genomes, extending the applicability of association studies to a wide range of species.

实施例Example

一、方法1. Method

1.植物材料，实验设计和表型测量1. Plant Material, Experimental Design and Phenotypic Measurements

我们收集了92个中国地方品种和10个海外品种，总共102个大蒜地方品种于2014年9月种植于中国农业科学院麻类研究所实验田。田间试验采取随机的完全区块设计，每块两个重复。每个重复种植一个品种的36瓣大蒜，分三行，每行间隔10cm，两个重复之间间隔20cm。2015年收获成熟的大蒜，取三行中间的30株。然后，风干表面，用游标卡尺测量它们的长度，宽度，厚度。We collected 92 Chinese local varieties and 10 overseas varieties. A total of 102 garlic local varieties were planted in the experimental fields of the Institute of Hemp Research, Chinese Academy of Agricultural Sciences in September 2014. The field experiment adopts a random complete block design with two repetitions in each block. Each replicate planted 36 cloves of garlic of a variety in three rows with an interval of 10 cm between each row and an interval of 20 cm between two replicates. Harvest mature garlic in 2015, and take 30 plants in the middle of the three rows. Then, air dry the surfaces and measure their length, width, thickness with calipers.

2.PacBio单分子实时测序文库构建及测序2. PacBio single-molecule real-time sequencing library construction and sequencing

本发明从102个大蒜地方品种中选取地方品种——阳溪，做单分子测序。使用Clontech SMARTer cDNA合成试剂盒(Clontech Laboratories，Mountain View，CA，USA)和oligo dT引物，将约1μg发育鳞茎的总RNA用于逆转录合成全长cDNA。反应重复三次。使用AMPure PB beads(Pacific Biosciences，Menlo Park，CA，USA)纯化PCR产物。使用BluePippin系统对获得的dsDNA进行大小选择并随后进行再扩增。最后，将产物纯化并经受Iso-Seq SMRTBell文库制备(https://pacbio.secure.force.com/SamplePrep)以产生三个文库(1-2kb，2-3kb和3-6kb)，使用P6-C4 chemistry试剂盒在PacBio RSII平台上将这三个文库测序。The present invention selects a local variety——Yangxi from 102 local varieties of garlic, and performs single-molecule sequencing. About 1 μg of total RNA from developing bulbs was used for reverse transcription to synthesize full-length cDNA using the Clontech SMARTer cDNA Synthesis Kit (Clontech Laboratories, Mountain View, CA, USA) and oligo dT primers. The reaction was repeated three times. PCR products were purified using AMPure PB beads (Pacific Biosciences, Menlo Park, CA, USA). The obtained dsDNA was size selected and subsequently reamplified using the BluePippin system. Finally, the product was purified and subjected to Iso-Seq SMRTBell library prep (https://pacbio.secure.force.com/SamplePrep) to generate three libraries (1-2kb, 2-3kb and 3-6kb), using P6- The C4 chemistry kit sequenced the three libraries on the PacBio RSII platform.

3.PacBio测序数据分析3. PacBio sequencing data analysis

利用SMRT分析软件对测序数据进行处理(http://pacificbiosciences.github.com/DevNet/)。从亚读数文件产生的环化测序读数利用如下程序得来：最小长度300；max_drop_fraction,0.8；min_passes,1；min_predicted_accuracy,0.8。根据是否具有polyA尾和最小测序长度300两个参数，把产生的环化测序读数分为全长和非全长。在默认设置下，全长和非全长环化序列都属于同类型的簇。把那些非冗余且没有末端的序列定义为转录本。因为PacBio读数相较于更短测序读数的二代测序有较高的核苷酸错误，所以利用proovread软件基于阳溪大蒜的Illumina RNA测序数据纠正。冗余的序列放到CD-HIT中。Sequencing data were processed using SMRT analysis software (http://pacificbiosciences.github.com/DevNet/). Circular sequencing reads generated from subread files were obtained using the following program: minimum length 300; max_drop_fraction, 0.8; min_passes, 1; min_predicted_accuracy, 0.8. According to whether there is a polyA tail and the minimum sequencing length of 300, the generated circular sequencing reads are divided into full-length and non-full-length. By default, both full-length and non-full-length circularized sequences belong to the same type of cluster. Transcripts are defined as those sequences that are non-redundant and have no ends. Because the PacBio reads have higher nucleotide errors compared with the next-generation sequencing of shorter sequencing reads, they were corrected using the proovread software based on the Illumina RNA sequencing data of Yangxi garlic. Redundant sequences are put into CD-HIT.

4.转录组的注释4. Annotation of Transcriptome

我们使用Coding Potential Calculator来根据转录组的质量、完整性以及与当前数据库的相似序列的开放阅读框比对等检测转录本的编码潜能。那些没有蛋白质编码潜能的被分为长链非编码RNA(lncRNA)，其余的如以前研究一样进行七个公共数据库的注释。每个转录本的编码序列(CDS)利用BLAST搜索在NCBI非冗余蛋白序列数据库和SwissProt蛋白数据库预测功能，以及进行ESTscan程序。We used the Coding Potential Calculator to detect the coding potential of transcripts based on transcriptome quality, completeness, and open reading frame alignment with similar sequences in current databases. Those without protein-coding potential were categorized as long noncoding RNAs (lncRNAs), and the rest were annotated from seven public databases as in previous studies. The coding sequence (CDS) of each transcript was searched using BLAST in the NCBI non-redundant protein sequence database and the SwissProt protein database to predict function, and the ESTscan program was performed.

5.Illumina RNA测序文库构建及测序5. Illumina RNA sequencing library construction and sequencing

为表征群体SNPs(单核苷酸)和GE(基因表达)变异，所有102个品种都进行Illumina RNA测序。利用试剂盒【UltraTM RNALibrary Prep Kit for(New England BioLabs,Ipswich,MA,USA)】把每个个体的总RNA构建成片段长度为250bp(±25bp)的cDNA组成的混合片段文库。然后，使用Illumina测序平台(HiSeqTM2500)结合HiSeq PE Cluster Kit v4 cBot进行测序。最后，筛选过滤原始数据(rawreads)，得到可分析数据(clean reads)。To characterize population SNPs (single nucleotide) and GE (gene expression) variation, all 102 cultivars were subjected to Illumina RNA sequencing. Use kit【 UltraTM RNALibrary Prep Kit for (New England BioLabs, Ipswich, MA, USA)] The total RNA of each individual was constructed into a mixed fragment library consisting of cDNA with a fragment length of 250bp (±25bp). Then, the Illumina sequencing platform (HiSeqTM2500) combined with HiSeq PE Cluster Kit v4 cBot was used for sequencing. Finally, filter the raw data (rawreads) to obtain the analyzable data (clean reads).

6.表达分析6. Expression Analysis

为了量化102个品种的转录本的表达量，每个品种的所有clean Illuminareads通过Bowtie 2比对到PacBio测序得到的转录本。使用RSEM，通过评估测得的每百万碱基转录本序列的每千碱基片段的期望值来分析每个样品的每个转录本的表达水平。使用R语言包中的加权基因共表达网络分析(WGCNA)对102种基因型中的转录物进行共表达分析，并将在群体中显示共表达的转录本分配到一个模块。最小模块大小被设置为30个转录本，并且如果它们具有不小于25％的相似性，则模块被合并。然后，估计各模块的特征值，并基于Pearson相关因子分析模块与性状的相关性，识别与性状相关的共表达模块。In order to quantify the expression of transcripts of 102 species, all clean Illuminareads of each species were aligned to the transcripts obtained by PacBio sequencing by Bowtie 2. Using RSEM, the expression level of each transcript was analyzed for each sample by evaluating the measured expected value of fragments per kilobase per million bases of transcript sequence. Coexpression analysis of transcripts in 102 genotypes was performed using Weighted Gene Coexpression Network Analysis (WGCNA) in the R language package, and transcripts showing coexpression in the population were assigned to a module. The minimum module size was set to 30 transcripts, and modules were merged if they shared no less than 25% similarity. Then, the eigenvalues of each module were estimated, and the co-expression modules associated with the trait were identified based on the Pearson correlation factor analysis of the module's correlation with the trait.

7.群体SNP检测和系统发育分析7. Population SNP detection and phylogenetic analysis

为了鉴定来自102个地方品种的SNP，通过使用Picard中的Samtools，将每个样品的所有Illumina读数与参照转录组进行比对。删除没有独特位置的读数后，使用GATK软件来进行群体的SNP调用；这个过程要求SNP质量≥40。为了排除不正确比对导致的SNP调用错误，只保留高质量的SNP(覆盖深度≥2，次要等位基因频率≥0.05，缺失基因型频率≤0.5)。利用TreeBest(http://treesoft.sourceforge.net/treebest.shtml)构建了一个基于个体的邻接树，以明确102个地方种群之间的系统发育关系，并在Figtree(http：//tree.bio.ed.ac.uk/software/figtree/)中可视化。To identify SNPs from 102 landraces, all Illumina reads for each sample were aligned to the reference transcriptome by using Samtools in Picard. After removing reads without unique positions, the SNP calling of the population was performed using GATK software; this process requires SNP quality ≥ 40. To rule out SNP calling errors caused by incorrect alignments, only high-quality SNPs (coverage depth ≥ 2, minor allele frequency ≥ 0.05, and deletion genotype frequency ≤ 0.5) were retained. An individual-based neighbor-joining tree was constructed using TreeBest (http://treesoft.sourceforge.net/treebest.shtml) to clarify the phylogenetic relationships among 102 local populations, and in Figtree (http://tree.bio .ed.ac.uk/software/figtree/).

8.识别性状候选转录本8. Identify Trait Candidate Transcripts

为挖掘CL(瓣长)，CW(瓣宽)和CT(瓣厚)的候选转录本，我们进行了全转录组范围的分析以检测与性状相关的可能基因位点。在包含102个样品的联合小组中，采取混合线性模型，使用总共19,912个高质量SNP对三个CST(瓣型特征)进行关联分析。建议性(1/N)P值阈值被设置控制来错误率，结果是P值阈值5×10^-5。为了验证建议位点与性状之间的关联性，计算了与建议位点相关的转录物的表达与性状表型之间的Pearson相关性；在P<0.05时假设显着相关性。在序列和表达水平上都与性状相关的转录本被定义为性状的候选转录本。To mine candidate transcripts for CL (flap length), CW (flap width) and CT (flap thickness), we performed a transcriptome-wide analysis to detect possible loci associated with the trait. In a combined panel containing 102 samples, a mixed linear model was employed to perform association analysis for three CSTs (flap-shaped traits) using a total of 19,912 high-quality SNPs. A suggested (1/N) P-value threshold was set to control the error rate, resulting in a P-value threshold of 5×10 ⁻⁵ . To validate associations between proposed loci and traits, Pearson correlations between expression of transcripts associated with proposed loci and trait phenotypes were calculated; significant correlations were assumed at P<0.05. Transcripts associated with traits at both sequence and expression levels were defined as candidate transcripts for traits.

9.eQTL分析检测潜在相互作用9. eQTL analysis to detect potential interactions

将基因表达水平变化与基因型相联系的eQTL已经被证实可以检测基因的相互作用。为确定候选转录本的相互关系，采用混合线性模型对TRAS(转录组参考的关联分析)的102个当地品种进行eQTL分析。在我们的eQTL分析中，将19,912个SNP定义为基因型，并将候选转录本的表达定义为表型。显着(0.05/N)P值阈值被设置为2.5×10^-6的来控制错误率。如果一个转录本定位于另一个转录本的的eQTL中，我们认为这两个转录本可能相互作用。eQTLs, which link changes in gene expression levels to genotype, have been shown to detect gene interactions. To determine the interrelationships of candidate transcripts, eQTL analysis of 102 local cultivars from TRAS (Association Analysis of Transcriptome Reference) was performed using mixed linear models. In our eQTL analysis, 19,912 SNPs were defined as genotype and expression of candidate transcripts as phenotype. A significant (0.05/N) p-value threshold was set at 2.5 × 10 ^-6 to control the error rate. If one transcript maps to another transcript's eQTL, we think that these two transcripts may interact.

二、结果2. Results

1.三代测序(Pacific Biosciences)获得发育中的瓣的转录组数据1. Three-generation sequencing (Pacific Biosciences) to obtain the transcriptome data of the developing flap

当地品种阳溪总共有36,321个转录本，共计5448万个碱基。转录本长度120bp至4803bp，平均1500bp。检测PloyA和PloyT，发现70％为全长转录本。31125个转录本有功能注释，287个没有注释，4909个为lncRNA(长链非编码RNA)。The local variety Yangxi had a total of 36,321 transcripts totaling 54.48 million bases. Transcript length ranges from 120bp to 4803bp, with an average of 1500bp. Detecting PloyA and PloyT, 70% were found to be full-length transcripts. 31,125 transcripts had functional annotations, 287 were not annotated, and 4,909 were lncRNAs (long non-coding RNAs).

2.性状，序列和基因表达的种群变异2. Population variation in traits, sequences, and gene expression

102个品种的瓣型特征的变异数，瓣长为2.3倍，瓣宽为3.7倍，瓣厚3.6倍。三个性状具有显著相关性P<0.01,说明它们相互影响。The variation of petal shape characteristics of 102 varieties was 2.3 times of petal length, 3.7 times of petal width and 3.6 times of petal thickness. The three traits have a significant correlation P<0.01, indicating that they influence each other.

为了对102个地方品种的基因型进行序列和表达分析，我们对群体中发育鳞茎的转录组进行了测序，并获得了大约60.5亿个clean read，每个地方品种的平均读数为5930万。这些序列比对到参考转录组(三代测序)，覆盖率为51.3％～80.2％.这些比对读数进一步用于对基因序列(SNP)和基因表达(GE)中的变异进行评分。从8,245转录本中识别出来55,012个SNPs，说明在这个102品种中77％的转录本是保守的。过滤后，从5,408个转录本获得19,912个高质量的SNP，其中一半位于编码序列区域内，剩余的SNP位于3'和5'非编码区。基于SNP基因型的系统发育分析将102个大蒜地方品种分为三个不同的组。有趣的是，来自中国的地方品种呈现出明显的地理分布。我们还对GE进行了量化，并在瓣扩张阶段对每个地方品种的全基因组表达谱进行了表征。基于加权基因的共表达网络分析，36,321个转录本涉及46个共表达模型(CEMs),每个模型包含的基因范围从37-7,132个。To perform sequence and expression analysis of the genotypes of 102 landraces, we sequenced the transcriptome of developing bulbs in the population and obtained approximately 6.05 billion clean reads, with an average of 59.3 million reads per landrace. These sequences were aligned to the reference transcriptome (three-generation sequencing) with a coverage of 51.3%–80.2%. These aligned reads were further used to score variation in gene sequence (SNP) and gene expression (GE). 55,012 SNPs were identified from 8,245 transcripts, indicating that 77% of the transcripts were conserved among the 102 varieties. After filtering, 19,912 high-quality SNPs were obtained from 5,408 transcripts, half of which were located within the coding sequence region, and the remaining SNPs were located in the 3' and 5' non-coding regions. Phylogenetic analysis based on SNP genotypes divided 102 garlic landraces into three distinct groups. Interestingly, the landraces from China showed a clear geographical distribution. We also quantified GE and characterized the genome-wide expression profile of each landrace during the flap expansion stage. Based on the co-expression network analysis of weighted genes, 36,321 transcripts were involved in 46 co-expression models (CEMs), each containing genes ranging from 37-7,132.

3.瓣长，瓣宽和瓣厚相关的候选转录本3. Candidate transcripts related to flap length, flap width and flap thickness

我们从18个瓣厚相关转录本识别20个SNPs，从24个瓣宽相关转录本识别40个SNPs，从15个瓣长相关转录本识别21个SNPs(P<5×10^–5)。总共42个转录本和瓣型特征(CSTs)相关。它们中5个和三个性状都相关，5个和两个性状相关。由于基于连锁不平衡的关联分析只能鉴定可能包含许多基因的性状相关的基因组区域，因此仍难以确定性状的候选基因。为了确定候选转录本是否参与了CSTs，我们对42个转录本的GE和相应的性状表型进行了相关分析，结果显示5,17和5个转录本的表达与CT，CW和CL显著相关(P<0.05)。其中2,16和4个转录本分别与CT，CW和CL显着负相关，表明这些转录本负面调控大蒜瓣的生长。发现只有3,1和1个转录本分别正向调节CT，CW和CL。因为基因序列和GE水平都与性状相关，所以我们得出结论：上述鉴定的5,17和5个转录物分别是CT，CW和CL的候选转录物。由于其中有四个是多效性的，将候选转录本总数减少到23个。We identified 20 SNPs from 18 transcripts associated with flap thickness, 40 SNPs from 24 transcripts associated with flap width, and 21 SNPs from 15 transcripts associated with flap length (P<5×10 ^–5 ). A total of 42 transcripts were associated with flap-type signatures (CSTs). 5 of them are associated with all three traits, and 5 are associated with two traits. Because linkage disequilibrium-based association analyzes can only identify trait-associated genomic regions that may contain many genes, it remains difficult to identify candidate genes for traits. To determine whether candidate transcripts are involved in CSTs, we performed correlation analysis on GE and corresponding trait phenotypes of 42 transcripts, and the results showed that the expression of 5, 17 and 5 transcripts were significantly correlated with CT, CW and CL ( P<0.05). Among them, 2, 16 and 4 transcripts were significantly negatively correlated with CT, CW and CL, respectively, suggesting that these transcripts negatively regulate garlic clove growth. Only 3, 1 and 1 transcripts were found to positively regulate CT, CW and CL, respectively. Because both gene sequences and GE levels were associated with traits, we concluded that the 5, 17 and 5 transcripts identified above were candidate transcripts for CT, CW and CL, respectively. Since four of these were pleiotropic, the total number of candidate transcripts was reduced to 23.

4.瓣形状相关候选转录本之间的潜在相互作用4. Potential interactions between flap shape-related candidate transcripts

为了检测性状相关转录本之间的潜在相互作用，我们对23个性状相关转录本进行了eQTL分析。结果显示，一些SNP与23个转录本的表达相关。没有SNPs和ASTG3334,ASTG34822,ASTG35427,和ASTG9068这几个转录本表达相关，这表明这些转录本的表达很少受到其他转录本的调控，并且它们可能在涉及性状调控的途径上游起作用。ASTG35908，ASTG1416，ASTG4276和ASTG4283的表达与至少40个SNP(分别在29,26,26和207个转录本)相关，表明这四个候选转录本受许多基因的表达调节，并且它们可能在性状调控途径的下游起作用。To detect potential interactions among trait-associated transcripts, we performed eQTL analysis on 23 trait-associated transcripts. The results showed that some SNPs were associated with the expression of 23 transcripts. No SNPs were associated with the expression of ASTG3334, ASTG34822, ASTG35427, and ASTG9068 transcripts, suggesting that the expression of these transcripts is rarely regulated by other transcripts and that they may function upstream in pathways involved in trait regulation. The expression of ASTG35908, ASTG1416, ASTG4276 and ASTG4283 was associated with at least 40 SNPs (in 29, 26, 26 and 207 transcripts, respectively), suggesting that these four candidate transcripts are regulated by the expression of many genes and that they may play a role in trait regulation downstream of the pathway.

我们将eQTL所在的转录本定义为eQTL-转录本，我们假设如果转录本及其eQTL-转录本与相同的性状相关联，那么两个转录本都通过相互作用来控制性状。接下来，我们确定了每个候选性状相关转录本的eQTL转录本。结果显示，在5个CT和5个CL候选转录本中，仅两个CT和两个CL转录本显示潜在的相互作用，而17个CW转录物中的14个形成潜在的相互作用网络。CW候选转录本分为三组。表达不受其他转录本调控的6个转录本被分配到第一组，并被认为是性状调控的上游转录物。不仅调控下游转录物的表达，而且调控上游转录物表达的6个转录物被分类在第二组中。第二组的六个转录本显示了彼此之间的广泛的相互作用，导致相互作用的子网络。其余两个不涉及其他转录物表达调控的转录物被分配到第三组。因此，结果提供了对控制CST的交互网络的基本了解。We define the transcript where the eQTL resides as the eQTL-transcript, and we hypothesize that if a transcript and its eQTL-transcript are associated with the same trait, then both transcripts control the trait through interaction. Next, we identified eQTL transcripts for each candidate trait-associated transcript. The results showed that among 5 CT and 5 CL candidate transcripts, only two CT and two CL transcripts showed potential interactions, while 14 of 17 CW transcripts formed a potential interaction network. CW candidate transcripts were divided into three groups. Six transcripts whose expression was not regulated by other transcripts were assigned to the first group and considered as upstream transcripts for trait regulation. Six transcripts that not only regulate the expression of downstream transcripts but also upstream transcripts were classified in the second group. The second set of six transcripts showed extensive interactions with each other, resulting in a subnetwork of interactions. The remaining two transcripts not involved in the regulation of expression of other transcripts were assigned to the third group. Thus, the results provide a fundamental insight into the interaction networks governing CSTs.

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，但本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。It should be noted that at last: above each embodiment is only in order to illustrate technical scheme of the present invention, and is not intended to limit; Although the present invention has been described in detail with reference to foregoing each embodiment, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present invention. range.

Claims

1. A method for association analysis of transcriptome reference, characterized in that, comprising:

1). Establish a reference population and measure the transcriptome sequences of all individuals in the reference population to obtain reference transcripts;

2). Perform high-throughput RNA sequencing on the predicted population, and compare and splice the obtained sequencing data with the reference transcript to obtain the population transcriptome sequencing result;

3). Analyze the population SNP data from the population transcriptome sequencing results, use the population SNP and traits of interest to perform association analysis, and determine the region where the target gene is located; then use the gene expression variation and traits in the target region to perform Correlation to get the candidate genes of the target traits.

2. the association analysis method of transcriptome reference according to claim 1, is characterized in that, described method also comprises:

Determine the eQTL of the candidate gene, and then find whether the remaining target trait candidate genes are contained in the eQTL.

3. The method for association analysis of transcriptome reference according to claim 1, wherein the gene expression variation includes gene sequence variation and gene expression variation.

4. The method for association analysis of transcriptome reference according to claim 1, characterized in that, in step 1), after obtaining the reference transcript, it further comprises: performing functional annotation on the reference transcript.

5. The method for association analysis of transcriptome reference according to claim 1 or 4, characterized in that, in step 1), the number of species in the reference population is ≥ 1.

6. The method for association analysis of transcriptome reference according to claim 1, characterized in that, in step 1), the method for measuring the transcriptome sequence is single-molecule real-time sequencing.

7. The method for association analysis of transcriptome reference according to claim 1, characterized in that, in step 3), before utilizing the population SNP to perform association analysis with the trait of interest, the population transcriptome is sequenced Results Co-expression analysis was performed to assign transcripts showing co-expression in the population to a module, and the association of said modules with a trait of interest was analyzed to identify co-expressed modules associated with said trait of interest.

8. The association analysis method of transcriptome reference according to claim 1, characterized in that, in step 3), when utilizing the population SNP and the trait of interest to carry out association analysis, the association analysis model used is a general linear model or a mixed linear model.

9. The method for association analysis of transcriptome reference according to claim 1, characterized in that, in step 3), when using the population SNP to perform association analysis with the trait of interest, the quality score of the SNP ≥ 40 , coverage depth ≥ 2, minor allele frequency ≥ 0.05, deletion genotype frequency ≤ 0.5.

10. The application of the method according to any one of claims 1 to 9 in the screening and identification of major genes and the correlation of gene functions.