[go: up one dir, main page]

CN119152934B - High-throughput genome sequencing variation detection system and method based on low input starting amount - Google Patents

High-throughput genome sequencing variation detection system and method based on low input starting amount Download PDF

Info

Publication number
CN119152934B
CN119152934B CN202411204402.3A CN202411204402A CN119152934B CN 119152934 B CN119152934 B CN 119152934B CN 202411204402 A CN202411204402 A CN 202411204402A CN 119152934 B CN119152934 B CN 119152934B
Authority
CN
China
Prior art keywords
variation
data
mutation
detection
sequencing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411204402.3A
Other languages
Chinese (zh)
Other versions
CN119152934A (en
Inventor
李天一
柳佳琦
罗玉洁
陈莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xinnuo Baishi Medical Laboratory Co ltd
Original Assignee
Shanghai Xinnuo Baishi Medical Laboratory Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xinnuo Baishi Medical Laboratory Co ltd filed Critical Shanghai Xinnuo Baishi Medical Laboratory Co ltd
Priority to CN202411204402.3A priority Critical patent/CN119152934B/en
Publication of CN119152934A publication Critical patent/CN119152934A/en
Application granted granted Critical
Publication of CN119152934B publication Critical patent/CN119152934B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a high-throughput genome sequencing variation detection system and a high-throughput genome sequencing variation detection method based on low input initial quantity, wherein the variation detection system is used for ensuring that low-frequency variation is detected as much as possible based on identification mutation sites and detection structure variation, so that false negative detection rate is reduced, combining two groups of variation data, and aiming at each combined variation, weighting and scoring the credible contribution degree of the variation based on variation information parameters to screen credible variation, so that false positive detection is avoided, and the low-frequency detection capability can be effectively improved as a whole while false negative/positive caused by experiment or sequencing influence is avoided.

Description

High-throughput genome sequencing variation detection system and method based on low input initial quantity
Technical Field
The invention belongs to the technical field of genome variation detection, and particularly relates to a high-throughput genome sequencing variation detection system and method based on low input initial quantity.
Background
In the early stages of production development, application scenarios, technical details and costs are all key factors that need to be considered. The use of second generation techniques is generally chosen for low input initiation amount library and bioinformatics software detection of low frequency mutations.
High throughput genomic sequencing technology (also known as second generation sequencing technology) has made significant progress in the last decade, and has also made a major breakthrough in single nucleotide variation detection:
1. The mutation is more comprehensive, the high-throughput genome sequencing technology can sequence the DNA of the whole genome or a specific region, and provides comprehensive single nucleotide mutation information (including single nucleotide mutation, insertion/deletion mutation, copy number mutation and the like)
2. The cost is lower, the one-time sequencing flux of the current sequencer is higher, and the one-time off-machine data can reach trillion base (T level);
3. data analysis supports the enormous amount of data generated by high throughput genome sequencing, requiring powerful data analysis and interpretation capabilities. A large number of bioinformatics tools and databases have been developed for mutation annotation, functional prediction and functional interpretation of high throughput sequencing data, providing more support and guidance to scientists.
The second generation sequencing technology is firstly applied to the scientific research field, the sequencing depth is not high, sequencing raw materials are easy to obtain, and the initial DNA input amount is generally more than 100 ng. However, there are additional challenges to tumor samples in the clinical screening field, namely that the raw materials of the tumor are not readily available, and that the extremely high sequencing depth in the clinical screening of the two are often accompanied by high DUP. This is because the frequency of somatic mutation in tumors is limited (typically, 0.5% or less), and mutation reads are more readily available at higher sequencing depths. To achieve this extremely high sequencing depth, we used PCR to repeatedly amplify the library in order to amplify the signal, even if redundant reads were introduced in the process. The more rounds of PCR in general, the more redundant reads are introduced. The larger the initial amount of sequencing library input, the fewer the number of PCR rounds.
In general, for higher mutation frequencies (. Gtoreq.20%), a sequencing depth of.gtoreq.200X is sufficient to identify 95% of the mutations, and for lower mutation frequencies (. Ltoreq.10%), the system should be improved instead of simply increasing the sequencing depth. This also means that for high throughput detection of low input starting amount sequencing libraries, it is still difficult and necessary to ensure accuracy of the detection.
Disclosure of Invention
The invention aims to provide a high-throughput genome sequencing variation detection system and method based on low input initial quantity, which can effectively improve the low-frequency detection capability and avoid false negative/positive caused by experimental or sequencing influence.
In view of this, the scheme of the invention is as follows:
In a first aspect of the present invention, a high throughput genomic sequencing variation detection system based on low input initiation amount is presented, comprising:
the data preprocessing module is used for comparing the sequencing data with a reference gene to obtain a comparison result;
The mutation detection module comprises a first detection module and a second detection module, wherein the first detection module is used for obtaining mutation data based on comparison results, the first detection module is used for obtaining first mutation data containing mutation sites, the second detection module is used for detecting structural mutation, and the second detection module is used for identifying second mutation data containing single nucleotide mutation and indel results;
the mutation integration filtering module is used for integrating and filtering mutation data, wherein the integration process is used for obtaining third mutation data by taking a union set of the first mutation data and the second mutation data, and weighting and scoring each mutation in the third mutation data according to the contribution degree of mutation information parameters to mutation credibility;
And the detection output module is used for outputting mutation data obtained by final filtering.
Further, the first detection module uses Mutect and Vcftools to identify and process mutation sites, and/or the second detection module uses Manta and Strelka2 to detect and identify single nucleotide mutations and indels.
Further, the mutation information parameters are divided into supporting parameters and objectivity parameters according to the contribution degree of mutation, the supporting parameters comprise mutation depth, mutation frequency and measurement penetration reads logarithm, the objectivity parameters comprise sequencing depth, mutation source, comparison quality, unique comparison and regional complexity, and the mutation source is the condition that mutation data are derived from first mutation data and second mutation data.
Preferably, the weighted scoring further comprises additional overall scoring of cases where the variation is derived from three sets of variation data, alignment quality, unique alignment, regional complexity penalty cases.
Further, the sequencing data has a sequencing depth of 200X or more and a mutation frequency of 10% or less, and is particularly suitable for sequencing with a sequencing depth of more than 10000X and a mutation frequency of 1% or less.
Further, the present detection system is suitable for use in sequencing, in which the initial amount of DNA input is 100ng or less, and more preferably 50ng or less.
Further, the sequencing data removes low quality data prior to alignment, including but not limited to excision of unclean linker sequences, excision of continuous low quality base sequences, discarding low quality sequences, discarding sequences of too short length.
Further, the filtering process further comprises the steps of filtering by using a local frequency library, and/or filtering variation without actual influence, and/or checking that bam is filtered according to the authenticity of variation.
In a second aspect of the present invention, a method for detecting high throughput genomic sequencing variation based on a low input starting amount is provided, comprising:
Comparing the sequencing data to a reference gene to obtain a comparison result;
Obtaining mutation data based on the alignment results, including obtaining first mutation data comprising mutation sites, and identifying second mutation data comprising single nucleotide mutations and indel results by detecting structural mutation;
the method comprises the steps of integrating and filtering mutation data, wherein the integration process takes a union set of first mutation data and second mutation data to obtain third mutation data, and weights and scores each mutation in the third mutation data according to the contribution degree of mutation information parameters to mutation credibility;
And outputting mutation data obtained by final filtering.
In a third aspect of the present invention, an electronic device is provided, including a processor and a memory, where the memory stores a computer program, and the processor implements the mutation detection method according to the second aspect when executing the computer program.
In a fourth aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the mutation detection method according to the second aspect.
Compared with the prior art, the invention has the beneficial effects that:
The mutation detection system provided by the invention is based on identifying mutation sites and detecting structural mutation, ensures that low-frequency mutation is detected as much as possible, and reduces false negative detection rate, and performs weighted scoring on the credible contribution degree of mutation based on mutation information parameters for each mutation by combining two groups of mutation data, so that credible mutation is screened, false positive detection is avoided, and false negative/positive caused by experimental or sequencing influence is avoided while the low-frequency detection capability is improved effectively as a whole.
The mutation detection system can improve detection precision and recall rate as much as possible when molecular biological diagnosis is carried out, and can ensure mutation detection with low detection limit under the condition of low input DNA initial quantity.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a high throughput genome sequencing variation detection system based on low input initiation amount according to the present invention.
Detailed Description
The following provides definitions of some of the terms used in this specification. Unless otherwise defined, all terms used herein are intended to have the meanings commonly understood by those skilled in the art to which the present scheme pertains.
Term interpretation:
A low-input starting amount library is generally one in which the initial DNA input amount of the library is 100ng or more and about 50ng of DNA input amount is called a low-input starting amount library.
The sequencing depth is also referred to as DP, and in the VCF file, the "total sequence coverage" for that location is also indicated by DP. I.e., the site was sequenced several times in units of X.
Mutation frequency, also called AF, and the ratio of the coverage depth of a certain mutation sequence to the total sequence coverage depth of a certain locus at the certain locus. Taking the field in the VCF file as an example, where DP represents the "total sequence coverage depth" for that location, AD represents the "mutation sequence coverage depth", and AF is calculated to be equal to = AD (mutation sequence coverage depth)/DP (total sequence coverage depth).
VCF the VCF (VARIANT CALL Format) Format is a common record Format of DNA sequence variation, commonly used for genomic research and genetic variation analysis. It is a text format that is used primarily to store and describe SNPs (single nucleotide polymorphisms), indels (insertions or deletions), and other types of DNA sequence variation information in single or multiple samples. The VCF format is typically composed of metadata (metadata) that includes information such as file format version, sample information, reference genome, etc., a header (header) that is composed of a series of annotation lines beginning with "#", describing the meaning of each column of the VCF file, and mutation information (variant information) that is arranged in columns, including chromosome position, reference sequence, mutation sequence, quality score, filtering status, annotation information, etc. The VCF format has the advantage that it can record DNA sequence variation information for multiple samples simultaneously and supports alignment and statistical analysis of variation sites between different samples. In addition, the VCF format also provides a standard file format for genomics research, facilitating data sharing and processing. The VCF format is an important DNA sequence variation recording format and is widely used in the fields of genomics research, genetic variation analysis and bioinformatics.
False positive detection-false positive detection refers to the situation in which an unrealistic variation or mutation is falsely marked as present. This may be the result of experimental errors, data processing errors, sample degradation, or other factors.
False negative detection refers to the case where the actually existing mutation or mutation is not detected or missed.
Variation ranking the class of variation was classified according to the american society of molecular and pathological conditions (AMP) in combination with the American Society of Clinical Oncology (ASCO) and the interpretation of tumor gene variation commonly issued by the american society of pathologists (CAP) into class 3, class 1, referring to variation with clear clinical significance, class 2, referring to variation with potential clinical significance, and class 3, referring to mutation with unknown clinical significance.
BAM is the most common alignment data storage format in current genetic data analysis, and is used for storing binary file format of large-scale sequencing data, in particular for storing alignment results of sequences and reference genome.
In order to ensure the accuracy of high-throughput genetic variation detection of a low-input initial-quantity library (about 50 ng), a method and a system for performing variation detection on the low-input initial-quantity library based on high-throughput genome sequencing data are provided, so that the low-frequency detection capability can be effectively improved, and meanwhile, false negative/positive caused by experimental or sequencing influence can be avoided.
The inventors of the present invention found that the simple use of mutect2 for a mutation having a sequencing depth of 200X or more and a mutation frequency of 10% or less does not ensure the detection effect, and that a large number of false positives are detected. False positive detection may be misleading to diagnosis and treatment schemes, so that the influence of false positive needs to be eliminated when analyzing and interpreting NGS detection results, and when sequencing depth is more than 10000X and mutation frequency is less than 1%, not only a large number of false positive detection exists, but also false negative detection becomes more common. False negative detection may lead to patient omission of critical diagnostic information or treatment opportunities, and care is required in NGS detection to reduce the false negative detection rate to ensure accuracy and reliability of the results.
To systematically address this problem, it is necessary to first ensure that the low frequency variation is detected as much as possible, thereby reducing the false negative detection rate. To achieve this, it is necessary to adjust the various mutation detection software (Mutect software) to be as sensitive as possible. However, through early tests, all the variations in the system cannot be detected by trying to adjust the parameters of more than ten combinations. Through investigation, we speculate that this is due to whether the default statistical test of Mutect software cannot accurately distinguish whether a low-quality variation is a true low-quality variation or a false low-quality variation, and under a more conservative consideration, the software loses a part of the ultralow-frequency variation (mutation frequency is less than one thousandth). So we have introduced a second piece of software strelka in the bioinformatics detection system, which is a piece of open source software developed by Illumina corporation, which is characterized by an average running speed of about 17-22 times that of Mutect software. The low frequency variation can be detected more easily by combining the results of the software Manta. Feature training can also be performed by training a true (false) negative dataset, thereby ensuring that mutations within the response system are detected as much as possible. It should be noted that the current bioinformatics software for mutation detection includes not only Mutect and Strelka2, but also conclusively detailed somatic mutation detection software such as Varscan, vardict, deepVariant. But with respect to the literature and industry use experience that has been published so far. Other software is far less efficient (including accuracy and time) than Mutect and Strelka2, and we use only Mutect and Strelka2 for mutation detection in order to reduce the redundancy of the system as much as possible.
By taking intersection and union of the detected mutations of the two mutation software according to the loci (chromosomal locations), the mutations can be distinguished into three sets of Mutect, strelka, and common mutations. And scoring the mutation by using a corresponding scoring equation according to the belonging of different sets, wherein the scoring is used for mutation filtering.
After the false negative detection phenomenon is substantially solved (which is difficult to put an end in theory and can only be adjusted as iteratively as possible), the subsequent analysis is focused on the solution of the false positive detection process. Mutations were scored, source database opened, self-built database multiple filtered and annotated with tags as mentioned above. And finally outputting a result according to the unified header style, and uploading the result to a report system record so as to maintain and iteratively upgrade the self-built database for a long time and detect the training set of positive/negative Strelka.
The detection process of the genetic variation based on the low input initial quantity and the high flux can be summarized into three modules, namely a preprocessing module, an integrated variation result module, a filtering variation (mutation) module and a detection module, which are respectively used for preprocessing genome sequencing data, integrating variation results, filtering variation (mutation) and outputting a filtered credible variation result, wherein the flow chart is shown in figure 1. The concrete steps are as follows:
1. Data preprocessing
(1) Genome sequencing data pretreatment
Using bcl2fastq software, index sequence information was provided for each sample prior to sequencing, and the original off-machine data was split into fastq format data.
(2) Raw data processing
The inclusion of a portion of low quality data in the original genome Fastq data can interfere with subsequent analysis, and therefore, there is a need to remove the low quality data. Data quality processing is performed using software fastp, removing unclean linker sequences, removing consecutive low quality base sequences, discarding low quality sequences, discarding sequences of too short a length.
(3) Fastq sequence alignment with reference genome
And according to the consistency of the base sequence and the human reference genome hg19, the sequence is corresponding to the reference genome, and a comparison result bam file is generated.
2. Integrating the variant results
(4) Somatic mutation detection Using Mutect2+ Vcftools +VEP
The process of somatic mutation detection using Mutect in combination with Vcftools and VEP is a classical procedure. First Mutect2 was used to compare the sample and reference genome data to identify potential mutation sites. VCF file processing and filtering is then performed using Vcftools to improve the reliability of the variation. Finally, functional annotation and interpretation of mutations by VARIANT EFFECT Predictor (VEP) helps determine which mutations may be biologically significant.
(5) Somatic mutation detection Using Manta+ Strelka2+VEP
Manta is used to detect structural variations such as insertions, deletions and inversions. Then Strelka2 was used to identify single nucleotide mutations and small indels. These mutations were then functionally annotated and interpreted with VEP to help determine their potential impact. Integration of these three tools allows for comprehensive capture of information about somatic mutations, thus effectively analyzing and understanding the variation in the genome.
3. Filtering variation (mutation)
(6) Integration of variants and scoring
① The mutation files of two sources (VCF formats) are respectively read to extract mutation basic information (including chromosome position, base before mutation and base after mutation), so that two sets taking the mutation basic information as elements are obtained, then intersection and union of the two sets are calculated, and finally all mutation mark sources in the two software are sequentially marked.
The marking method is that all the variations mentioned in the two software are marked as three sources, namely A if the variation is only from Mutect < 2 >, B if the variation is only from Strelka < 2 >, and C if the variation is detected in both software.
② Collecting evidence items from BAM
Firstly, according to the variation basic information obtained in the last step, using pandas module of Python to obtain reads record (namely alignment) corresponding to the variation in BAM (note that required evidence items are all obtained from BAM file), then sequentially analyzing each reads record, collecting evidence items required for evaluating variation reliability, giving each evidence item an add-drop mechanism conforming to logic rule, wherein the score of supporting evidence is positive, the score of anti-property evidence is negative, and finally, the higher the score is, the higher the variation reliability is, and dividing reasonable grades.
The evidence item comprises supporting evidence and objection evidence, and is specifically as follows:
I. Support evidence:
Mutation frequency (AF) the allele fraction of a mutation in a tumor sample is indicative of the relative abundance of the mutation. The higher the total specific gravity of the reads in which the mutation is detected, the higher the mutation authenticity. The score should be positive, the monotonically increasing function should be met, and the smaller the AF, the smaller the score proportion, i.e. the slope is proportional to AF;
log of pass reads-the more log of pass variants are measured, the higher the variant realism. The assignment should be a positive number, which is in accordance with a monotonically increasing function, and the higher the number, the smaller the specific gravity of the addition, i.e. the slope is inversely proportional to the logarithm of the measurement reads.
Allele Depth (AD) the more reads that detected the variation, the higher the authenticity of the variation. The term score should therefore be positive, should follow a monotonically increasing function, and the additive weight should be small when AD is small (AD < =8), i.e. the slope is proportional to AD.
Objection evidence:
Sequencing Depth (DP) the sequencing depth covering this position. Higher coverage generally means more reliable mutation detection.
The comparison quality (high_MQ_reads) is that the average MQ of the reads where the mutation is located is not less than 50% of the average MQ of the other reads, and the average MQ of the mutation is >10, this term is not deducted. Otherwise, the button is separated;
unique comparability (Uniq) variation is in a unique comparison region. 80% of reads have no XA, and the tag is True;
region complexity (Complex) variation is within a region of a Complex sequence (whether the sequence around the variation is a Complex sequence or not).
The source of the mutation is that if the same mutation is detected in two software at the same time, the higher the authenticity of the mutation is, the tag is True, otherwise, the penalty is needed.
③ Scoring the evidence items mentioned in the steps, wherein scoring details are as follows:
④ After the evidence items mentioned in the steps are sequentially scored, judging whether the four items of high_MQ_ reads, uniq, complex and the variation source are not satisfied or not, and carrying out additional scoring;
Evidence item Additional scoring
None of the four items satisfies Additional penalty 30
Any three items do not satisfy Additional penalty 20
Any two do not satisfy Additional penalty of 10
None of the four items are buckled Additional sub-10
⑤ Mutations were classified into five classes according to scores:
grade E, false variation below 0;
Grade D, possibly false variation, 0-40;
Grade C, uncertainty of variation of authenticity;
class B, possibly true variation, 80-120;
Class A, 120 or more, credible variation;
(7) Annotating local frequencies from a local frequency library
Construction of a local frequency library:
In the above steps, we describe, for a sample, the location of each mutation that occurs therein (including chromosome number and the location of the base), as well as the base before mutation (REF) and the base after mutation (ALT) at that location. Then for samples in a large queue (e.g., 500 samples), we can count all mutations in all samples and sequentially calculate the number of samples (e.g., 250) in which each mutation occurred, then the local frequency of the mutation in the local frequency library of the queue is 250/500=0.5, i.e., for a fixed local frequency library, one mutation corresponds to one local frequency.
(8) Filtering variations
① Filtering using a local frequency library
The local frequency library records the frequency of the local detection frequency of a mutation accounting for the total number of analysis and detection. If the local frequency is detected to be greater than 20% and not in the common white and black lists, the mutation is considered to be a false mutation introduced by the reaction system, labeled LocalDB.
② Filtering according to scoring
Variations with a variation rating of A, B were retained, variations with a variation rating of C, D, E were filtered out, and low confidence variations were marked as LowQ.
③ Filtering according to mutation function
Removing the variation which has no actual influence on the function, wherein the variation is marked DontAffectFunc;
④ Checking the bam to verify the authenticity of the variation
4. Outputting the result
Filtering out the untrustworthy mutation marked in the step (8), reporting and judging the rest untrustworthy mutation, and carrying out mutation interpretation according to the literature report condition of the mutation, wherein the interpretation content comprises but is not limited to providing reference for disease diagnosis, prognosis, recurrence and treatment.
Examples
After receiving one child tumor patient DNA sequencing data, the following operations are performed:
(1) Genome sequencing data pretreatment
The raw data statistics are obtained as follows:
Sample Raw_reads Raw_bases
Test1 146,875,740 22,178,236,740
(2) Raw data processing
After quality control, a high-quality sequence is obtained, and the data statistics are as follows:
Samples Clean_reads Clean_bases Q20(%) Q30(%) clean Bases%
Test1 144,485,354 20,945,984,384 98.17% 94.91% 94.44
(3) fastq to reference genome alignment
The alignment of the sequence data with the human reference genome hg19 is as follows:
(4) Somatic mutation detection Using Mutect2+ Vcftools +VEP
CHR POS REF ALT AD AF Log of measurement penetration reads DP
12 112888189 G A 103 0.373 176 276
16 50827573 C A 87 0.026 103 193
X 41201995 T C 32 0.1 221 321
(5) Somatic mutation detection Using Manta+ Strelka2+VEP
CHR POS REF ALT AD AF Log of measurement penetration reads DP
10 871189 T C 56 0.172 226 326
12 112888189 G A 103 0.373 176 276
X 41201995 T C 32 0.1 221 321
(6) Integration of variants and scoring
CHR 10 12 16 X
POS 871189 112888189 50827573 41201995
REF T G C T
ALT C A A C
AD 40 40 40 40
AF 50 50 50 50
Log of measurement penetration reads 45 45 45 45
DP 0 0 0 0
Source of variation -30(B) 0(C) -30(A) 0(C)
High MQ reads 0 0 0 -30
Uniq -30 0 0 -30
Complex 0 0 0 0
Additional scoring -10 0 0 -10
Total score 65 135 105 65
Rating of C A B C
(7) Filtering mutations
CHROM POS REF ALT Total score Rating of Variant tag
10 871189 T C 65 C LowQ
12 112888189 G A 135 A PASS
16 50827573 C A 105 B LocalDB
X 41201995 T C 65 C LowQ
(8) Outputting the result
CHROM POS REF ALT Total score Rating of Variant tag
12 112888189 G A 135 A PASS
Although the present disclosure is disclosed above, the scope of the present disclosure is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the disclosure, and these changes and modifications will fall within the scope of the disclosure.

Claims (9)

1.基于低投入起始量的高通量基因组测序变异检测系统,其特征在于,包括:1. A high-throughput genome sequencing variation detection system based on low input starting amount, characterized in that it includes: 数据预处理模块,用于将测序数据比对到参考基因获得比对结果;Data preprocessing module, used to align sequencing data to reference genes to obtain alignment results; 变异检测模块,包括第一检测模块和第二检测模块,用于分别基于比对结果获得变异数据;第一检测模块用于获得包含突变位点的第一变异数据,第二检测模块用于检测结构变异,识别包含单核苷酸突变和插入缺失结果的第二变异数据;A variation detection module, comprising a first detection module and a second detection module, for obtaining variation data based on the comparison results respectively; the first detection module is used to obtain first variation data including mutation sites, and the second detection module is used to detect structural variation and identify second variation data including single nucleotide mutations and insertion and deletion results; 变异整合过滤模块,用于对变异数据进行整合和过滤;整合过程取第一变异数据和第二变异数据的并集获得第三变异数据,对第三变异数据中每种变异按变异信息参数对变异可信的贡献度进行加权打分;过滤过程包括过滤得分低的变异;The variation integration and filtering module is used to integrate and filter the variation data; in the integration process, the union of the first variation data and the second variation data is taken to obtain the third variation data, and each variation in the third variation data is weighted and scored according to the contribution of the variation information parameter to the variation credibility; the filtering process includes filtering the variation with low score; 检测输出模块,用于输出最终过滤得到的变异数据;The detection output module is used to output the final filtered variant data; 所述变异信息参数按对变异的贡献度划分为支持性参数和反对性参数,分别赋分为正和负;所述支持性参数包括突变深度、突变频率、测穿reads对数;所述反对性参数包括测序深度、变异来源、比对质量、唯一比对性、区域复杂度;所述变异来源为变异数据来源于第一、第二变异数据的情况。The variation information parameters are divided into supporting parameters and opposing parameters according to their contribution to the variation, and are respectively divided into positive and negative; the supporting parameters include mutation depth, mutation frequency, and number of reads; the opposing parameters include sequencing depth, variation source, alignment quality, unique alignment, and regional complexity; the variation source refers to the situation where the variation data is derived from the first and second variation data. 2.根据权利要求1所述的检测系统,其特征在于,所述第一检测模块使用Mutect2和Vcftools 识别和处理突变位点;和/或,所述第二检测模块使用Manta 和Strelka2检测和识别单核苷酸突变和插入缺失。2. The detection system according to claim 1 is characterized in that the first detection module uses Mutect2 and Vcftools to identify and process mutation sites; and/or the second detection module uses Manta and Strelka2 to detect and identify single nucleotide mutations and insertions and deletions. 3.根据权利要求1所述的检测系统,其特征在于,所述加权打分还包括对变异来源、比对质量、唯一比对性、区域复杂度各项罚分情况进行额外整体打分。3. The detection system according to claim 1 is characterized in that the weighted scoring also includes additional overall scoring for each penalty situation of variation source, alignment quality, unique alignment, and regional complexity. 4.根据权利要求1所述的检测系统,其特征在于,所述测序数据的测序深度为200X以上,突变频率为10%以下;和/或,测序时DNA投入起始量在100ng以下。4. The detection system according to claim 1 is characterized in that the sequencing depth of the sequencing data is greater than 200X, the mutation frequency is less than 10%; and/or the starting amount of DNA input during sequencing is less than 100 ng. 5.根据权利要求1所述的检测系统,其特征在于,所述测序数据在比对之前去除低质量的数据。5. The detection system according to claim 1 is characterized in that low-quality data is removed from the sequencing data before alignment. 6.根据权利要求1所述的检测系统,其特征在于,所述过滤过程还包括:使用本地频率库进行过滤;和/或,对没有实际影响的变异进行过滤;和/或,查看bam根据变异真伪进行过滤。6. The detection system according to claim 1 is characterized in that the filtering process also includes: using a local frequency library for filtering; and/or filtering variants that have no actual impact; and/or checking bam to filter based on the authenticity of the variant. 7.基于低投入起始量的高通量基因组测序变异检测方法,以非诊断为目的,其特征在于,包括:7. A method for detecting variation based on high-throughput genome sequencing with low input starting amount for non-diagnostic purposes, characterized in that it includes: 将测序数据比对到参考基因获得比对结果;Align the sequencing data to the reference gene to obtain the alignment results; 基于比对结果获得变异数据,包括获得包含突变位点的第一变异数据,以及通过检测结构变异,识别包含单核苷酸突变和插入缺失结果的第二变异数据;Obtaining variation data based on the comparison results, including obtaining first variation data including the mutation site, and identifying second variation data including single nucleotide mutations and insertion/deletion results by detecting structural variation; 对变异数据进行整合和过滤;整合过程取第一变异数据和第二变异数据的并集获得第三变异数据,对第三变异数据中每种变异按变异信息参数对变异可信的贡献度进行加权打分;过滤过程包括过滤得分低的变异;Integrate and filter the variant data; in the integration process, take the union of the first variant data and the second variant data to obtain the third variant data, and weight and score each variant in the third variant data according to the contribution of the variant information parameter to the variant credibility; the filtering process includes filtering variants with low scores; 输出最终过滤得到的变异数据;Output the final filtered variant data; 所述变异信息参数按对变异的贡献度划分为支持性参数和反对性参数,分别赋分为正和负;所述支持性参数包括突变深度、突变频率、测穿reads对数;所述反对性参数包括测序深度、变异来源、比对质量、唯一比对性、区域复杂度;所述变异来源为变异数据来源于第一、第二变异数据的情况。The variation information parameters are divided into supporting parameters and opposing parameters according to their contribution to the variation, and are respectively divided into positive and negative; the supporting parameters include mutation depth, mutation frequency, and number of reads; the opposing parameters include sequencing depth, variation source, alignment quality, unique alignment, and regional complexity; the variation source refers to the situation where the variation data is derived from the first and second variation data. 8.一种电子装置,其特征在于,包括处理器及存储器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现如权利要求7所述的方法。8. An electronic device, comprising a processor and a memory, wherein the memory stores a computer program, and the processor implements the method according to claim 7 when executing the computer program. 9.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,处理器执行所述计算机程序,实现如权利要求7所述的方法。9. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and a processor executes the computer program to implement the method according to claim 7.
CN202411204402.3A 2024-08-29 2024-08-29 High-throughput genome sequencing variation detection system and method based on low input starting amount Active CN119152934B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411204402.3A CN119152934B (en) 2024-08-29 2024-08-29 High-throughput genome sequencing variation detection system and method based on low input starting amount

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411204402.3A CN119152934B (en) 2024-08-29 2024-08-29 High-throughput genome sequencing variation detection system and method based on low input starting amount

Publications (2)

Publication Number Publication Date
CN119152934A CN119152934A (en) 2024-12-17
CN119152934B true CN119152934B (en) 2025-05-09

Family

ID=93816496

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411204402.3A Active CN119152934B (en) 2024-08-29 2024-08-29 High-throughput genome sequencing variation detection system and method based on low input starting amount

Country Status (1)

Country Link
CN (1) CN119152934B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711487A (en) * 2024-02-05 2024-03-15 广州嘉检医学检测有限公司 Identification method and system for embryo SNV and InDel variation and readable storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109686439B (en) * 2018-12-04 2020-08-28 东莞博奥木华基因科技有限公司 Data analysis method, system and storage medium for genetic disease gene detection
CN113889187B (en) * 2021-09-24 2022-12-06 上海仁东医学检验所有限公司 Single-sample allele copy number variation detection method, probe set and kit
CN114807333A (en) * 2022-05-31 2022-07-29 华南农业大学 Method for identifying whole genome variation of gene editing animal
CN116469462B (en) * 2023-03-20 2024-09-20 重庆邮电大学 Ultra-low frequency DNA mutation identification method and device based on double sequencing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711487A (en) * 2024-02-05 2024-03-15 广州嘉检医学检测有限公司 Identification method and system for embryo SNV and InDel variation and readable storage medium

Also Published As

Publication number Publication date
CN119152934A (en) 2024-12-17

Similar Documents

Publication Publication Date Title
Babadi et al. GATK-gCNV enables the discovery of rare copy number variants from exome sequencing data
EP3837690B1 (en) Systems and methods for using neural networks for germline and somatic variant calling
CN106909806B (en) Method and device for spot detection of variants
JP5650083B2 (en) Automated analysis of multiple probe target interaction patterns: pattern matching and allele identification
KR20220069943A (en) Single-cell RNA-SEQ data processing
KR102812123B1 (en) Method and apparatus for classifying variation candidates within whole genome sequence
US20240120026A1 (en) Method and device for extracting somatic mutations from single-cell transcriptome sequencing data
WO2019046804A1 (en) Identifying false positive variants using a significance model
CN117275577A (en) Algorithm for detecting human mitochondrial genetic mutation sites based on second-generation sequencing technology
KR20250078852A (en) Method and apparatus for identifying genetic variation based on machine learning
KR101067352B1 (en) Systems and methods, including algorithms for working mechanisms of microarray experimental data using biological network analysis, experiment / process condition-specific network generation, and analysis of experiment / process condition relationships, and recording media with programs for performing the method.
Bonfiglio et al. Best practices for germline variant and DNA methylation analysis of second-and third-generation sequencing data
CN119207587B (en) Gene data analysis method based on large language model
CN112102944A (en) NGS-based brain tumor molecular diagnosis analysis method
CN119152934B (en) High-throughput genome sequencing variation detection system and method based on low input starting amount
CN117877575A (en) Method and apparatus for differentiating between germ line variation and somatic variation
Chong et al. SeqControl: process control for DNA sequencing
Eren et al. Improving the filtering of false positive single nucleotide variations by combining genomic features with quality metrics
CN116386713A (en) Detection method and device for gene editing enzyme off-target site and electronic equipment
Valdes et al. Methods to detect transcribed pseudogenes: RNA-Seq discovery allows learning through features
CN113793641B (en) Method for rapidly judging sample gender from FASTQ file
CN116209777A (en) Genetic relationship judging method and device based on noninvasive prenatal gene detection data
CN119170097B (en) IKZF1 gene exon deletion identification system and method based on high-throughput transcriptome sequencing
US20170226588A1 (en) Systems and methods for dna amplification with post-sequencing data filtering and cell isolation
CN119049569B (en) Genetic database updating method, classification method and system based on sequencing data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20241217

Assignee: SHANGHAI SIMPLEGEN MEDICAL INSPECTION AND VERIFICATION INSTITUTE CO.,LTD.

Assignor: Shanghai Xinnuo Baishi medical laboratory Co.,Ltd.

Contract record no.: X2025310000059

Denomination of invention: A high-throughput genomic sequencing variant detection system and method based on a low initial input amount

Granted publication date: 20250509

License type: Common License

Record date: 20250701