Disclosure of Invention
The invention aims to provide a high-throughput genome sequencing variation detection system and method based on low input initial quantity, which can effectively improve the low-frequency detection capability and avoid false negative/positive caused by experimental or sequencing influence.
In view of this, the scheme of the invention is as follows:
In a first aspect of the present invention, a high throughput genomic sequencing variation detection system based on low input initiation amount is presented, comprising:
the data preprocessing module is used for comparing the sequencing data with a reference gene to obtain a comparison result;
The mutation detection module comprises a first detection module and a second detection module, wherein the first detection module is used for obtaining mutation data based on comparison results, the first detection module is used for obtaining first mutation data containing mutation sites, the second detection module is used for detecting structural mutation, and the second detection module is used for identifying second mutation data containing single nucleotide mutation and indel results;
the mutation integration filtering module is used for integrating and filtering mutation data, wherein the integration process is used for obtaining third mutation data by taking a union set of the first mutation data and the second mutation data, and weighting and scoring each mutation in the third mutation data according to the contribution degree of mutation information parameters to mutation credibility;
And the detection output module is used for outputting mutation data obtained by final filtering.
Further, the first detection module uses Mutect and Vcftools to identify and process mutation sites, and/or the second detection module uses Manta and Strelka2 to detect and identify single nucleotide mutations and indels.
Further, the mutation information parameters are divided into supporting parameters and objectivity parameters according to the contribution degree of mutation, the supporting parameters comprise mutation depth, mutation frequency and measurement penetration reads logarithm, the objectivity parameters comprise sequencing depth, mutation source, comparison quality, unique comparison and regional complexity, and the mutation source is the condition that mutation data are derived from first mutation data and second mutation data.
Preferably, the weighted scoring further comprises additional overall scoring of cases where the variation is derived from three sets of variation data, alignment quality, unique alignment, regional complexity penalty cases.
Further, the sequencing data has a sequencing depth of 200X or more and a mutation frequency of 10% or less, and is particularly suitable for sequencing with a sequencing depth of more than 10000X and a mutation frequency of 1% or less.
Further, the present detection system is suitable for use in sequencing, in which the initial amount of DNA input is 100ng or less, and more preferably 50ng or less.
Further, the sequencing data removes low quality data prior to alignment, including but not limited to excision of unclean linker sequences, excision of continuous low quality base sequences, discarding low quality sequences, discarding sequences of too short length.
Further, the filtering process further comprises the steps of filtering by using a local frequency library, and/or filtering variation without actual influence, and/or checking that bam is filtered according to the authenticity of variation.
In a second aspect of the present invention, a method for detecting high throughput genomic sequencing variation based on a low input starting amount is provided, comprising:
Comparing the sequencing data to a reference gene to obtain a comparison result;
Obtaining mutation data based on the alignment results, including obtaining first mutation data comprising mutation sites, and identifying second mutation data comprising single nucleotide mutations and indel results by detecting structural mutation;
the method comprises the steps of integrating and filtering mutation data, wherein the integration process takes a union set of first mutation data and second mutation data to obtain third mutation data, and weights and scores each mutation in the third mutation data according to the contribution degree of mutation information parameters to mutation credibility;
And outputting mutation data obtained by final filtering.
In a third aspect of the present invention, an electronic device is provided, including a processor and a memory, where the memory stores a computer program, and the processor implements the mutation detection method according to the second aspect when executing the computer program.
In a fourth aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the mutation detection method according to the second aspect.
Compared with the prior art, the invention has the beneficial effects that:
The mutation detection system provided by the invention is based on identifying mutation sites and detecting structural mutation, ensures that low-frequency mutation is detected as much as possible, and reduces false negative detection rate, and performs weighted scoring on the credible contribution degree of mutation based on mutation information parameters for each mutation by combining two groups of mutation data, so that credible mutation is screened, false positive detection is avoided, and false negative/positive caused by experimental or sequencing influence is avoided while the low-frequency detection capability is improved effectively as a whole.
The mutation detection system can improve detection precision and recall rate as much as possible when molecular biological diagnosis is carried out, and can ensure mutation detection with low detection limit under the condition of low input DNA initial quantity.
Detailed Description
The following provides definitions of some of the terms used in this specification. Unless otherwise defined, all terms used herein are intended to have the meanings commonly understood by those skilled in the art to which the present scheme pertains.
Term interpretation:
A low-input starting amount library is generally one in which the initial DNA input amount of the library is 100ng or more and about 50ng of DNA input amount is called a low-input starting amount library.
The sequencing depth is also referred to as DP, and in the VCF file, the "total sequence coverage" for that location is also indicated by DP. I.e., the site was sequenced several times in units of X.
Mutation frequency, also called AF, and the ratio of the coverage depth of a certain mutation sequence to the total sequence coverage depth of a certain locus at the certain locus. Taking the field in the VCF file as an example, where DP represents the "total sequence coverage depth" for that location, AD represents the "mutation sequence coverage depth", and AF is calculated to be equal to = AD (mutation sequence coverage depth)/DP (total sequence coverage depth).
VCF the VCF (VARIANT CALL Format) Format is a common record Format of DNA sequence variation, commonly used for genomic research and genetic variation analysis. It is a text format that is used primarily to store and describe SNPs (single nucleotide polymorphisms), indels (insertions or deletions), and other types of DNA sequence variation information in single or multiple samples. The VCF format is typically composed of metadata (metadata) that includes information such as file format version, sample information, reference genome, etc., a header (header) that is composed of a series of annotation lines beginning with "#", describing the meaning of each column of the VCF file, and mutation information (variant information) that is arranged in columns, including chromosome position, reference sequence, mutation sequence, quality score, filtering status, annotation information, etc. The VCF format has the advantage that it can record DNA sequence variation information for multiple samples simultaneously and supports alignment and statistical analysis of variation sites between different samples. In addition, the VCF format also provides a standard file format for genomics research, facilitating data sharing and processing. The VCF format is an important DNA sequence variation recording format and is widely used in the fields of genomics research, genetic variation analysis and bioinformatics.
False positive detection-false positive detection refers to the situation in which an unrealistic variation or mutation is falsely marked as present. This may be the result of experimental errors, data processing errors, sample degradation, or other factors.
False negative detection refers to the case where the actually existing mutation or mutation is not detected or missed.
Variation ranking the class of variation was classified according to the american society of molecular and pathological conditions (AMP) in combination with the American Society of Clinical Oncology (ASCO) and the interpretation of tumor gene variation commonly issued by the american society of pathologists (CAP) into class 3, class 1, referring to variation with clear clinical significance, class 2, referring to variation with potential clinical significance, and class 3, referring to mutation with unknown clinical significance.
BAM is the most common alignment data storage format in current genetic data analysis, and is used for storing binary file format of large-scale sequencing data, in particular for storing alignment results of sequences and reference genome.
In order to ensure the accuracy of high-throughput genetic variation detection of a low-input initial-quantity library (about 50 ng), a method and a system for performing variation detection on the low-input initial-quantity library based on high-throughput genome sequencing data are provided, so that the low-frequency detection capability can be effectively improved, and meanwhile, false negative/positive caused by experimental or sequencing influence can be avoided.
The inventors of the present invention found that the simple use of mutect2 for a mutation having a sequencing depth of 200X or more and a mutation frequency of 10% or less does not ensure the detection effect, and that a large number of false positives are detected. False positive detection may be misleading to diagnosis and treatment schemes, so that the influence of false positive needs to be eliminated when analyzing and interpreting NGS detection results, and when sequencing depth is more than 10000X and mutation frequency is less than 1%, not only a large number of false positive detection exists, but also false negative detection becomes more common. False negative detection may lead to patient omission of critical diagnostic information or treatment opportunities, and care is required in NGS detection to reduce the false negative detection rate to ensure accuracy and reliability of the results.
To systematically address this problem, it is necessary to first ensure that the low frequency variation is detected as much as possible, thereby reducing the false negative detection rate. To achieve this, it is necessary to adjust the various mutation detection software (Mutect software) to be as sensitive as possible. However, through early tests, all the variations in the system cannot be detected by trying to adjust the parameters of more than ten combinations. Through investigation, we speculate that this is due to whether the default statistical test of Mutect software cannot accurately distinguish whether a low-quality variation is a true low-quality variation or a false low-quality variation, and under a more conservative consideration, the software loses a part of the ultralow-frequency variation (mutation frequency is less than one thousandth). So we have introduced a second piece of software strelka in the bioinformatics detection system, which is a piece of open source software developed by Illumina corporation, which is characterized by an average running speed of about 17-22 times that of Mutect software. The low frequency variation can be detected more easily by combining the results of the software Manta. Feature training can also be performed by training a true (false) negative dataset, thereby ensuring that mutations within the response system are detected as much as possible. It should be noted that the current bioinformatics software for mutation detection includes not only Mutect and Strelka2, but also conclusively detailed somatic mutation detection software such as Varscan, vardict, deepVariant. But with respect to the literature and industry use experience that has been published so far. Other software is far less efficient (including accuracy and time) than Mutect and Strelka2, and we use only Mutect and Strelka2 for mutation detection in order to reduce the redundancy of the system as much as possible.
By taking intersection and union of the detected mutations of the two mutation software according to the loci (chromosomal locations), the mutations can be distinguished into three sets of Mutect, strelka, and common mutations. And scoring the mutation by using a corresponding scoring equation according to the belonging of different sets, wherein the scoring is used for mutation filtering.
After the false negative detection phenomenon is substantially solved (which is difficult to put an end in theory and can only be adjusted as iteratively as possible), the subsequent analysis is focused on the solution of the false positive detection process. Mutations were scored, source database opened, self-built database multiple filtered and annotated with tags as mentioned above. And finally outputting a result according to the unified header style, and uploading the result to a report system record so as to maintain and iteratively upgrade the self-built database for a long time and detect the training set of positive/negative Strelka.
The detection process of the genetic variation based on the low input initial quantity and the high flux can be summarized into three modules, namely a preprocessing module, an integrated variation result module, a filtering variation (mutation) module and a detection module, which are respectively used for preprocessing genome sequencing data, integrating variation results, filtering variation (mutation) and outputting a filtered credible variation result, wherein the flow chart is shown in figure 1. The concrete steps are as follows:
1. Data preprocessing
(1) Genome sequencing data pretreatment
Using bcl2fastq software, index sequence information was provided for each sample prior to sequencing, and the original off-machine data was split into fastq format data.
(2) Raw data processing
The inclusion of a portion of low quality data in the original genome Fastq data can interfere with subsequent analysis, and therefore, there is a need to remove the low quality data. Data quality processing is performed using software fastp, removing unclean linker sequences, removing consecutive low quality base sequences, discarding low quality sequences, discarding sequences of too short a length.
(3) Fastq sequence alignment with reference genome
And according to the consistency of the base sequence and the human reference genome hg19, the sequence is corresponding to the reference genome, and a comparison result bam file is generated.
2. Integrating the variant results
(4) Somatic mutation detection Using Mutect2+ Vcftools +VEP
The process of somatic mutation detection using Mutect in combination with Vcftools and VEP is a classical procedure. First Mutect2 was used to compare the sample and reference genome data to identify potential mutation sites. VCF file processing and filtering is then performed using Vcftools to improve the reliability of the variation. Finally, functional annotation and interpretation of mutations by VARIANT EFFECT Predictor (VEP) helps determine which mutations may be biologically significant.
(5) Somatic mutation detection Using Manta+ Strelka2+VEP
Manta is used to detect structural variations such as insertions, deletions and inversions. Then Strelka2 was used to identify single nucleotide mutations and small indels. These mutations were then functionally annotated and interpreted with VEP to help determine their potential impact. Integration of these three tools allows for comprehensive capture of information about somatic mutations, thus effectively analyzing and understanding the variation in the genome.
3. Filtering variation (mutation)
(6) Integration of variants and scoring
① The mutation files of two sources (VCF formats) are respectively read to extract mutation basic information (including chromosome position, base before mutation and base after mutation), so that two sets taking the mutation basic information as elements are obtained, then intersection and union of the two sets are calculated, and finally all mutation mark sources in the two software are sequentially marked.
The marking method is that all the variations mentioned in the two software are marked as three sources, namely A if the variation is only from Mutect < 2 >, B if the variation is only from Strelka < 2 >, and C if the variation is detected in both software.
② Collecting evidence items from BAM
Firstly, according to the variation basic information obtained in the last step, using pandas module of Python to obtain reads record (namely alignment) corresponding to the variation in BAM (note that required evidence items are all obtained from BAM file), then sequentially analyzing each reads record, collecting evidence items required for evaluating variation reliability, giving each evidence item an add-drop mechanism conforming to logic rule, wherein the score of supporting evidence is positive, the score of anti-property evidence is negative, and finally, the higher the score is, the higher the variation reliability is, and dividing reasonable grades.
The evidence item comprises supporting evidence and objection evidence, and is specifically as follows:
I. Support evidence:
Mutation frequency (AF) the allele fraction of a mutation in a tumor sample is indicative of the relative abundance of the mutation. The higher the total specific gravity of the reads in which the mutation is detected, the higher the mutation authenticity. The score should be positive, the monotonically increasing function should be met, and the smaller the AF, the smaller the score proportion, i.e. the slope is proportional to AF;
log of pass reads-the more log of pass variants are measured, the higher the variant realism. The assignment should be a positive number, which is in accordance with a monotonically increasing function, and the higher the number, the smaller the specific gravity of the addition, i.e. the slope is inversely proportional to the logarithm of the measurement reads.
Allele Depth (AD) the more reads that detected the variation, the higher the authenticity of the variation. The term score should therefore be positive, should follow a monotonically increasing function, and the additive weight should be small when AD is small (AD < =8), i.e. the slope is proportional to AD.
Objection evidence:
Sequencing Depth (DP) the sequencing depth covering this position. Higher coverage generally means more reliable mutation detection.
The comparison quality (high_MQ_reads) is that the average MQ of the reads where the mutation is located is not less than 50% of the average MQ of the other reads, and the average MQ of the mutation is >10, this term is not deducted. Otherwise, the button is separated;
unique comparability (Uniq) variation is in a unique comparison region. 80% of reads have no XA, and the tag is True;
region complexity (Complex) variation is within a region of a Complex sequence (whether the sequence around the variation is a Complex sequence or not).
The source of the mutation is that if the same mutation is detected in two software at the same time, the higher the authenticity of the mutation is, the tag is True, otherwise, the penalty is needed.
③ Scoring the evidence items mentioned in the steps, wherein scoring details are as follows:
④ After the evidence items mentioned in the steps are sequentially scored, judging whether the four items of high_MQ_ reads, uniq, complex and the variation source are not satisfied or not, and carrying out additional scoring;
Evidence item |
Additional scoring |
None of the four items satisfies |
Additional penalty 30 |
Any three items do not satisfy |
Additional penalty 20 |
Any two do not satisfy |
Additional penalty of 10 |
None of the four items are buckled |
Additional sub-10 |
⑤ Mutations were classified into five classes according to scores:
grade E, false variation below 0;
Grade D, possibly false variation, 0-40;
Grade C, uncertainty of variation of authenticity;
class B, possibly true variation, 80-120;
Class A, 120 or more, credible variation;
(7) Annotating local frequencies from a local frequency library
Construction of a local frequency library:
In the above steps, we describe, for a sample, the location of each mutation that occurs therein (including chromosome number and the location of the base), as well as the base before mutation (REF) and the base after mutation (ALT) at that location. Then for samples in a large queue (e.g., 500 samples), we can count all mutations in all samples and sequentially calculate the number of samples (e.g., 250) in which each mutation occurred, then the local frequency of the mutation in the local frequency library of the queue is 250/500=0.5, i.e., for a fixed local frequency library, one mutation corresponds to one local frequency.
(8) Filtering variations
① Filtering using a local frequency library
The local frequency library records the frequency of the local detection frequency of a mutation accounting for the total number of analysis and detection. If the local frequency is detected to be greater than 20% and not in the common white and black lists, the mutation is considered to be a false mutation introduced by the reaction system, labeled LocalDB.
② Filtering according to scoring
Variations with a variation rating of A, B were retained, variations with a variation rating of C, D, E were filtered out, and low confidence variations were marked as LowQ.
③ Filtering according to mutation function
Removing the variation which has no actual influence on the function, wherein the variation is marked DontAffectFunc;
④ Checking the bam to verify the authenticity of the variation
4. Outputting the result
Filtering out the untrustworthy mutation marked in the step (8), reporting and judging the rest untrustworthy mutation, and carrying out mutation interpretation according to the literature report condition of the mutation, wherein the interpretation content comprises but is not limited to providing reference for disease diagnosis, prognosis, recurrence and treatment.
Examples
After receiving one child tumor patient DNA sequencing data, the following operations are performed:
(1) Genome sequencing data pretreatment
The raw data statistics are obtained as follows:
Sample |
Raw_reads |
Raw_bases |
Test1 |
146,875,740 |
22,178,236,740 |
(2) Raw data processing
After quality control, a high-quality sequence is obtained, and the data statistics are as follows:
Samples |
Clean_reads |
Clean_bases |
Q20(%) |
Q30(%) |
clean Bases% |
Test1 |
144,485,354 |
20,945,984,384 |
98.17% |
94.91% |
94.44 |
(3) fastq to reference genome alignment
The alignment of the sequence data with the human reference genome hg19 is as follows:
(4) Somatic mutation detection Using Mutect2+ Vcftools +VEP
CHR |
POS |
REF |
ALT |
AD |
AF |
Log of measurement penetration reads |
DP |
12 |
112888189 |
G |
A |
103 |
0.373 |
176 |
276 |
16 |
50827573 |
C |
A |
87 |
0.026 |
103 |
193 |
X |
41201995 |
T |
C |
32 |
0.1 |
221 |
321 |
(5) Somatic mutation detection Using Manta+ Strelka2+VEP
CHR |
POS |
REF |
ALT |
AD |
AF |
Log of measurement penetration reads |
DP |
10 |
871189 |
T |
C |
56 |
0.172 |
226 |
326 |
12 |
112888189 |
G |
A |
103 |
0.373 |
176 |
276 |
X |
41201995 |
T |
C |
32 |
0.1 |
221 |
321 |
(6) Integration of variants and scoring
CHR |
10 |
12 |
16 |
X |
POS |
871189 |
112888189 |
50827573 |
41201995 |
REF |
T |
G |
C |
T |
ALT |
C |
A |
A |
C |
AD |
40 |
40 |
40 |
40 |
AF |
50 |
50 |
50 |
50 |
Log of measurement penetration reads |
45 |
45 |
45 |
45 |
DP |
0 |
0 |
0 |
0 |
Source of variation |
-30(B) |
0(C) |
-30(A) |
0(C) |
High MQ reads |
0 |
0 |
0 |
-30 |
Uniq |
-30 |
0 |
0 |
-30 |
Complex |
0 |
0 |
0 |
0 |
Additional scoring |
-10 |
0 |
0 |
-10 |
Total score |
65 |
135 |
105 |
65 |
Rating of |
C |
A |
B |
C |
(7) Filtering mutations
CHROM |
POS |
REF |
ALT |
Total score |
Rating of |
Variant tag |
10 |
871189 |
T |
C |
65 |
C |
LowQ |
12 |
112888189 |
G |
A |
135 |
A |
PASS |
16 |
50827573 |
C |
A |
105 |
B |
LocalDB |
X |
41201995 |
T |
C |
65 |
C |
LowQ |
(8) Outputting the result
CHROM |
POS |
REF |
ALT |
Total score |
Rating of |
Variant tag |
12 |
112888189 |
G |
A |
135 |
A |
PASS |
Although the present disclosure is disclosed above, the scope of the present disclosure is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the disclosure, and these changes and modifications will fall within the scope of the disclosure.