CN112164424A

CN112164424A - Population evolution analysis method based on non-reference genome

Info

Publication number: CN112164424A
Application number: CN202010768331.5A
Authority: CN
Inventors: 刘书云; 张海焕; 姜丽荣; 孙子奎
Original assignee: Nanjing Personal Gene Technology Co ltd
Current assignee: Nanjing Personal Gene Technology Co ltd
Priority date: 2020-08-03
Filing date: 2020-08-03
Publication date: 2021-01-01
Anticipated expiration: 2040-08-03
Also published as: CN112164424B

Abstract

The invention discloses a 2d-RAD sequencing-based population evolution analysis method without a reference genome, which comprises the steps of carrying out data splitting on a sample, filtering and clustering to obtain population SNP, carrying out population genetic parameter analysis based on sample grouping and population SNP information, constructing a phylogenetic tree, determining an optimal K value, then utilizing an R self-writing script, and searching common and specific SNP information between two groups according to the population SNP information and the designated population information to carry out population evolution analysis without the reference genome. The whole data analysis of the invention is more automatic, which saves the labor cost, improves the analysis efficiency, avoids the possible human error and makes the analyzed data chart more beautiful.

Description

Population evolution analysis method based on non-reference genome

Technical Field

The invention relates to the technical field of gene sequencing analysis, in particular to a population evolution analysis method based on a non-reference genome.

Background

Group structural differences and gene communication conditions among different subgroups in the same species can be further researched through group evolution analysis, and group structural characteristics among different species can also be researched; however, many species have not yet published reference genomes, so the population evolution analysis without reference genomes is required.

Because there are multiple non-reference library building methods (RAD, GBS, 2d-RAD, SLAF, etc.), different library building methods differ in the first-step data splitting of non-reference analysis, while the existing non-reference analysis method based on 2d-RAD has complex data filtering process and low efficiency, especially when the number of items is large and the number of samples contained in one item is large, one item may be subjected to multiple on-machine sequencing in the actual operation process, so that different batches of data are obtained, and the existing non-reference analysis method cannot intelligently use an automated process to merge data of different batches and perform filtering, so that data merging and filtering consume a lot of manual time.

With the continuous development of high-throughput sequencing, the analysis content of the existing analysis process is thin, the analysis content is less, and the new analysis content without reference is more diversified and personalized. In the former non-parameter analysis process, a plurality of places need to be operated manually, and the existing novel non-parameter analysis method is more automatic, so that the automatic process improves the service efficiency of a server, reduces the analysis pressure of an analyst and is convenient for controlling the analysis content.

Disclosure of Invention

In order to overcome the above-mentioned drawbacks of the prior art, the present invention aims to provide an automated analysis method for population evolution analysis based on a reference-free genome.

In order to realize the purpose of the invention, the adopted technical scheme is as follows:

a population evolution analysis method based on a reference-free genome after 2d-RAD sequencing comprises the following steps:

the first step is as follows: carrying out data splitting by utilizing a splitting script according to the enzyme cutting site information of barcode, enzyme 1 and enzyme 2 in a sequencing sample, merging a plurality of off-line sequencing data of the same sample, and storing the merged data in a first folder in a fastq.gz format;

the second step is that: performing quality control on the data subjected to the splitting and merging in the first step through a filtering script, and then performing quality control on the data according to the base quality value: performing data filtering on the standard with the Q being more than or equal to 20 and the sequence length being more than or equal to 50bp to obtain filtered data, and storing the filtered data in a second folder in a fastq.gz format;

the third step: firstly, carrying out sequence clustering in a single sample, merging double-end sequencing data of the single sample into a file before clustering, then carrying out clustering by utilizing an ustacks command in software Stacks to obtain a representative catalog sequence of each sample, and storing a result file in a third folder in a tag s.tsv.gz format;

the fourth step: grouping the samples, and clustering based on the catalog sequence of a single sample to obtain consensus sequences of all the samples, wherein the consensus sequences are similar reference genome sequences for all the samples;

the fifth step: reading grouping information of each sample designated by all files, designating deletion rate parameters, detecting group SNP information by utilizing a cstags command in software Stacks, and storing the group SNP information in a format of a VCF file;

and a sixth step: analyzing population genetic parameters by utilizing a population command in Stacks based on the population SNP information in the fifth step, and calculating to obtain a population differentiation index Fst, population nucleotide diversity pi, population expected heterozygosity, observation heterozygosity and haplotype diversity data;

the seventh step: carrying out format conversion on the VCF file of the group SNP information in the fifth step by using software vcftools and plink, then carrying out dimensionality reduction analysis on the SNP by using software GCTA to obtain three main components which have large influence on the group and calculate the contribution value of each main component, and finally drawing a PCA distribution diagram by using an R self-writing script;

eighth step: connecting the obtained group SNP information with the SNP information conversion format of a single sample by using a Python self-writing script, and then constructing a phylogenetic tree by using different models;

the ninth step:

converting the SNP format of the group into a format required by a software structure by using a Perl self-writing script, then specifying the number of SNPs and the number of the group used in analysis, and calculating the percentage of each sample belonging to a specified ancestor;

then determining the optimal K value (the number of ancestors), and obtaining whether the grouping information of the sample is consistent with that originally specified or not through the result;

the tenth step:

common and unique SNP information between two groups is searched according to the Group SNP information and the designated Group information by using Perl self-writing script.

In a preferred embodiment of the present invention, the filtering script is filter _ batch _ v2. pl.

In a preferred embodiment of the present invention, the model for constructing the phylogenetic tree includes any one or more of Maximum Pathology (MP), neighbor-join (NJ), Maximum Likeliood (ML) or Bayesian method (BI).

In a preferred embodiment of the present invention, the optimal K value is a K value corresponding to a post-plateau inflection point of the ln likelihood entry.

The invention has the beneficial effects that:

the whole data analysis is more automatic, so that the labor cost is saved, the analysis efficiency is improved, possible human errors are avoided, and the analyzed data chart is more attractive.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is a PCA profile of the present invention.

FIG. 3 is a NJ model-based distribution diagram of the evolutionary tree of the present invention.

FIG. 4 is the distribution diagram of the population genetic structure when the optimal K is 3.

Detailed Description

The principle of the invention is as follows:

based on the 2d-RAD parametrically simplified automatic filtering process, batch data splitting and filtering can be performed, all subsequent analysis items of data filtering can be automatically completed, the data processing efficiency and the server use efficiency are improved, the labor time is saved, meanwhile, human errors are reduced, the whole project analysis period is finally shortened, and the parametrically-free analysis high-efficiency automation of rich analysis content is realized.

With reference to fig. 1, the population evolution analysis method based on the reference-free genome after 2d-RAD sequencing of the present invention comprises the following steps:

(1) data splitting step

Carrying out automatic data splitting by using a self-written script according to the information of the barcode, enzyme 1 and enzyme 2 enzyme cutting sites of the sequenced sample, wherein the format is approximately one line of information representing one sample, and elements in each column are respectively the name of the sample, the barcode base, the enzyme cutting site of the enzyme 1 and the enzyme cutting site of the enzyme 2, wherein the spacer is set as a tab; if one sample has a plurality of times of off-machine sequencing, the analysis process can be automatically matched and merged, and the merged data is uniformly stored in a folder of 1_ RawData in a format of fastq.

The split script specifically comprises:

a library contains a plurality of samples, the four columns of sample names, barcode, enzyme 1 and enzyme cutting site sequences of enzyme 2 are used as input files 1, and the original double-end data fastq.gz of the library is used as input files 2 and 3;

if the front 7bp of the 5 'end of R1 of a sequence is consistent with barcode, the next 5 bases are consistent with the enzyme cutting site of enzyme 1, and the front 4bp of the 5' end of R2 corresponding to reads is consistent with the enzyme cutting site sequence of enzyme 2, the reads is split into the sample, and the final splitting data result of each sample is output after multiple cycles.

(2) Data quality control and filtering step

And performing quality control on the fastQC of the sample by using the self-written automatic filtering script filter _ batch _ v2.pl, and simultaneously performing data filtering according to the standards of the base quality value (Q is more than or equal to 20) and the sequence length (more than or equal to 50 bp). After the end of the run, all high quality data is stored in 2_ HQData in fastq.gz format.

The filter script is filter _ batch _ v2. pl:

the script firstly reads a double-ended sequence file $ { name } _ R1.fastq.gz and $ { name } _ R1.fastq.gz of sample off-machine original data in 1_ RawData as an input file, then renames the file, performs quality control on the input file through software fastqc, and obtains a fastq file of information such as base quality of the original data;

then using a software Adapter Removal to take a fastq.gz file of original data as an input file, removing a sequencing joint, simultaneously saving a newly generated result file in a fastq format in 2_ HQData, then taking the newly generated fastq file in the last step as an input file of a sequence quality filtering program, performing quality filtering by adopting a sliding window method, setting the window size to be 5bp, and setting the step length to be 1 bp;

moving one base forward every time, taking 5 bases to calculate the average Q value of a window, and if the average Q value of the window is less than or equal to 20, only keeping the last base and the previous base of the window;

and then removing the reads at the double ends if the length of any one of the reads at the double ends is less than or equal to 50 bp. The final result is output as $ { name } _ HQ _ R1.fq and $ { name } _ HQ _ R2. fq.

(3) Single sample internal sequence clustering step

Because the parametrically-free analysis has no reference genome, sequence clustering is firstly carried out in a single sample, the paired end sequencing data of the single sample is merged into a file before clustering, then clustering is carried out by utilizing the ustacks command in software Stacks to obtain a representative catalog sequence of each sample, and a result file is stored in a 3_ Stacks folder in a tag s.tsv.gz format.

(4) Catalog sequence clustering step of all samples

And specifying grouping information of the samples, and simultaneously clustering based on the catalog sequence of a single sample to obtain consensus sequences of all the samples, wherein the consensus sequences are used as reference genome sequences of all the samples.

(5) Step of detecting group SNP

Reading grouping information of all files for appointing each sample, appointing deletion rate parameters, detecting group SNP information by utilizing a cstags command in software Stacks, and storing the group SNP information in a format of a VCF file.

(6) Analysis of population genetic parameters (Fst, π, heterozygosity, haplotype diversity)

According to the SNP information of the population, population genetic parameters are analyzed by utilizing a posts command in Stacks, and a population differentiation index Fst, population nucleotide diversity pi, population expected heterozygosity, observed heterozygosity and haplotype diversity are calculated.

(7) Step of population PCA analysis

Performing format conversion by using software vcfttools and plink according to a VCF file of the group SNP, performing dimensionality reduction analysis on the SNP by using software GCTA to obtain three main components which have large influence on the group, calculating contribution values of the main components, and finally drawing a PCA distribution diagram by using an R self-writing script.

The R self-writing script firstly reads the vector information of the PC1 and the PC2 output by GCTA software as an input file, calculates the contribution rates of the PC1 and the PC2, and then draws a scatter diagram by using a ggplot2 packet in the R.

(8) Step of group phylogenetic tree analysis

And connecting the obtained group SNP information with the SNP information conversion format of each sample together by using the self-writing script, and then constructing a phylogenetic tree by using different models.

Common models for constructing evolutionary trees include Maximum Parsimony (MP), neighbor-join (NJ), Maximum Likelihood (ML), and Bayesian method (BI);

the MP model is suitable for long sequences without back mutation and parallel mutation of sites, high sequence similarity, large number of nucleotides or amino acids and stable substitution rate. The NJ model is suitable for short sequences with small evolutionary distance and few information sites. In the case of evolutionary model determination, the ML method is the tree building method that best fits the evolutionary facts. The BI model reserves the basic principle of a maximum likelihood method, introduces a Monte Carlo method of a Markov chain, and is suitable for deducing a system tree, evaluating the uncertainty of the system tree, detecting and selecting action, comparing the system tree, calculating divergence time by referring to fossil records and detecting a molecular clock.

(9) Step of population genetic Structure analysis

The self-written script converts the population SNP format into a format required by the software structure, then specifies the number of SNPs and the population number used in the analysis, and calculates the percentage of ancestors of each sample belonging to the specification. The best K value (number of ancestors) is then determined, and from this result, it is possible to find whether the clustering information of the sample is consistent with that originally specified.

The simulation result of each K value based on the Bayesian model calculation method can generate a corresponding maximum likelihood value (ln likelihood) which is output after a natural logarithm is taken. The larger the ln likelihood, the closer the K value is to the real case, but generally as the K value increases, the ln likelihood value also enters the plateau stage. The optimal K value is the K value corresponding to the inflection point of the plateau).

(10) Step of population-specific SNP analysis

The self-written script looks for common and unique SNP information between two groups based on the Group SNP information and the designated Group information.

Firstly, filtering original SNPs according to the situation of genotype deletion and the sequencing depth of SNP sites, wherein the specificity of the SNPs in a population is defined by two thresholds (A and B), wherein the frequency of the SNPs in a target population is higher than a certain threshold (A), the frequency of the SNPs in a non-target population is lower than a certain threshold (B), and the threshold is generally set to be 0.8.

Based on the steps, the invention has the advantages that:

(1) the whole data analysis is more automatic, so that the labor cost is saved, the analysis efficiency is improved, and possible human errors are avoided.

(2) The analysis content is richer, and the graph of the analysis result is more attractive (as shown in figures 2-4).

Claims

1. A population evolution analysis method based on a reference-free genome after 2d-RAD sequencing is characterized by comprising the following steps:

eighth step: connecting the obtained group SNP information with the SNP information conversion format of a single sample by using a Perl self-writing script, and then constructing a phylogenetic tree by using different models;

the ninth step:

converting the group SNP format into a format required by a software structure by using a Python self-writing script, then specifying the number of SNPs and the number of groups used in analysis, and calculating the percentage of each sample belonging to a specified ancestor;

the tenth step:

2. The method for population evolution analysis based on 2d-RAD sequenced non-reference genome according to claim 1, wherein the filter script is filter _ batch _ v2. pl.

3. The method for population evolution analysis based on no reference genome after 2d-RAD sequencing according to claim 1, wherein the model for constructing the phylogenetic tree comprises any one or more of Maximum Parsimony (MP), neighbor-join (NJ), Maximum Likelihood (ML) or Bayesian method (BI).

4. The method of claim 1, wherein the optimal K value is a K value corresponding to an inflection point after an ln likelihood entry plateau.