CN119152934B

CN119152934B - High-throughput genome sequencing variation detection system and method based on low input starting amount

Info

Publication number: CN119152934B
Application number: CN202411204402.3A
Authority: CN
Inventors: 李天一; 柳佳琦; 罗玉洁; 陈莉
Original assignee: Shanghai Xinnuo Baishi Medical Laboratory Co ltd
Current assignee: Shanghai Xinnuo Baishi Medical Laboratory Co ltd
Priority date: 2024-08-29
Filing date: 2024-08-29
Publication date: 2025-05-09
Anticipated expiration: 2044-08-29
Also published as: CN119152934A

Abstract

The invention discloses a high-throughput genome sequencing variation detection system and a high-throughput genome sequencing variation detection method based on low input initial quantity, wherein the variation detection system is used for ensuring that low-frequency variation is detected as much as possible based on identification mutation sites and detection structure variation, so that false negative detection rate is reduced, combining two groups of variation data, and aiming at each combined variation, weighting and scoring the credible contribution degree of the variation based on variation information parameters to screen credible variation, so that false positive detection is avoided, and the low-frequency detection capability can be effectively improved as a whole while false negative/positive caused by experiment or sequencing influence is avoided.

Description

High-throughput genome sequencing variation detection system and method based on low input initial quantity

Technical Field

The invention belongs to the technical field of genome variation detection, and particularly relates to a high-throughput genome sequencing variation detection system and method based on low input initial quantity.

Background

In the early stages of production development, application scenarios, technical details and costs are all key factors that need to be considered. The use of second generation techniques is generally chosen for low input initiation amount library and bioinformatics software detection of low frequency mutations.

High throughput genomic sequencing technology (also known as second generation sequencing technology) has made significant progress in the last decade, and has also made a major breakthrough in single nucleotide variation detection:

1. The mutation is more comprehensive, the high-throughput genome sequencing technology can sequence the DNA of the whole genome or a specific region, and provides comprehensive single nucleotide mutation information (including single nucleotide mutation, insertion/deletion mutation, copy number mutation and the like)

2. The cost is lower, the one-time sequencing flux of the current sequencer is higher, and the one-time off-machine data can reach trillion base (T level);

3. data analysis supports the enormous amount of data generated by high throughput genome sequencing, requiring powerful data analysis and interpretation capabilities. A large number of bioinformatics tools and databases have been developed for mutation annotation, functional prediction and functional interpretation of high throughput sequencing data, providing more support and guidance to scientists.

The second generation sequencing technology is firstly applied to the scientific research field, the sequencing depth is not high, sequencing raw materials are easy to obtain, and the initial DNA input amount is generally more than 100 ng. However, there are additional challenges to tumor samples in the clinical screening field, namely that the raw materials of the tumor are not readily available, and that the extremely high sequencing depth in the clinical screening of the two are often accompanied by high DUP. This is because the frequency of somatic mutation in tumors is limited (typically, 0.5% or less), and mutation reads are more readily available at higher sequencing depths. To achieve this extremely high sequencing depth, we used PCR to repeatedly amplify the library in order to amplify the signal, even if redundant reads were introduced in the process. The more rounds of PCR in general, the more redundant reads are introduced. The larger the initial amount of sequencing library input, the fewer the number of PCR rounds.

In general, for higher mutation frequencies (. Gtoreq.20%), a sequencing depth of.gtoreq.200X is sufficient to identify 95% of the mutations, and for lower mutation frequencies (. Ltoreq.10%), the system should be improved instead of simply increasing the sequencing depth. This also means that for high throughput detection of low input starting amount sequencing libraries, it is still difficult and necessary to ensure accuracy of the detection.

Disclosure of Invention

The invention aims to provide a high-throughput genome sequencing variation detection system and method based on low input initial quantity, which can effectively improve the low-frequency detection capability and avoid false negative/positive caused by experimental or sequencing influence.

In view of this, the scheme of the invention is as follows:

In a first aspect of the present invention, a high throughput genomic sequencing variation detection system based on low input initiation amount is presented, comprising:

the data preprocessing module is used for comparing the sequencing data with a reference gene to obtain a comparison result;

The mutation detection module comprises a first detection module and a second detection module, wherein the first detection module is used for obtaining mutation data based on comparison results, the first detection module is used for obtaining first mutation data containing mutation sites, the second detection module is used for detecting structural mutation, and the second detection module is used for identifying second mutation data containing single nucleotide mutation and indel results;

the mutation integration filtering module is used for integrating and filtering mutation data, wherein the integration process is used for obtaining third mutation data by taking a union set of the first mutation data and the second mutation data, and weighting and scoring each mutation in the third mutation data according to the contribution degree of mutation information parameters to mutation credibility;

And the detection output module is used for outputting mutation data obtained by final filtering.

Further, the first detection module uses Mutect and Vcftools to identify and process mutation sites, and/or the second detection module uses Manta and Strelka2 to detect and identify single nucleotide mutations and indels.

Further, the mutation information parameters are divided into supporting parameters and objectivity parameters according to the contribution degree of mutation, the supporting parameters comprise mutation depth, mutation frequency and measurement penetration reads logarithm, the objectivity parameters comprise sequencing depth, mutation source, comparison quality, unique comparison and regional complexity, and the mutation source is the condition that mutation data are derived from first mutation data and second mutation data.

Preferably, the weighted scoring further comprises additional overall scoring of cases where the variation is derived from three sets of variation data, alignment quality, unique alignment, regional complexity penalty cases.

Further, the sequencing data has a sequencing depth of 200X or more and a mutation frequency of 10% or less, and is particularly suitable for sequencing with a sequencing depth of more than 10000X and a mutation frequency of 1% or less.

Further, the present detection system is suitable for use in sequencing, in which the initial amount of DNA input is 100ng or less, and more preferably 50ng or less.

Further, the sequencing data removes low quality data prior to alignment, including but not limited to excision of unclean linker sequences, excision of continuous low quality base sequences, discarding low quality sequences, discarding sequences of too short length.

Further, the filtering process further comprises the steps of filtering by using a local frequency library, and/or filtering variation without actual influence, and/or checking that bam is filtered according to the authenticity of variation.

In a second aspect of the present invention, a method for detecting high throughput genomic sequencing variation based on a low input starting amount is provided, comprising:

Comparing the sequencing data to a reference gene to obtain a comparison result;

Obtaining mutation data based on the alignment results, including obtaining first mutation data comprising mutation sites, and identifying second mutation data comprising single nucleotide mutations and indel results by detecting structural mutation;

the method comprises the steps of integrating and filtering mutation data, wherein the integration process takes a union set of first mutation data and second mutation data to obtain third mutation data, and weights and scores each mutation in the third mutation data according to the contribution degree of mutation information parameters to mutation credibility;

And outputting mutation data obtained by final filtering.

In a third aspect of the present invention, an electronic device is provided, including a processor and a memory, where the memory stores a computer program, and the processor implements the mutation detection method according to the second aspect when executing the computer program.

In a fourth aspect of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the mutation detection method according to the second aspect.

Compared with the prior art, the invention has the beneficial effects that:

The mutation detection system provided by the invention is based on identifying mutation sites and detecting structural mutation, ensures that low-frequency mutation is detected as much as possible, and reduces false negative detection rate, and performs weighted scoring on the credible contribution degree of mutation based on mutation information parameters for each mutation by combining two groups of mutation data, so that credible mutation is screened, false positive detection is avoided, and false negative/positive caused by experimental or sequencing influence is avoided while the low-frequency detection capability is improved effectively as a whole.

The mutation detection system can improve detection precision and recall rate as much as possible when molecular biological diagnosis is carried out, and can ensure mutation detection with low detection limit under the condition of low input DNA initial quantity.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a high throughput genome sequencing variation detection system based on low input initiation amount according to the present invention.

Detailed Description

The following provides definitions of some of the terms used in this specification. Unless otherwise defined, all terms used herein are intended to have the meanings commonly understood by those skilled in the art to which the present scheme pertains.

Term interpretation:

A low-input starting amount library is generally one in which the initial DNA input amount of the library is 100ng or more and about 50ng of DNA input amount is called a low-input starting amount library.

The sequencing depth is also referred to as DP, and in the VCF file, the "total sequence coverage" for that location is also indicated by DP. I.e., the site was sequenced several times in units of X.

Mutation frequency, also called AF, and the ratio of the coverage depth of a certain mutation sequence to the total sequence coverage depth of a certain locus at the certain locus. Taking the field in the VCF file as an example, where DP represents the "total sequence coverage depth" for that location, AD represents the "mutation sequence coverage depth", and AF is calculated to be equal to = AD (mutation sequence coverage depth)/DP (total sequence coverage depth).

VCF the VCF (VARIANT CALL Format) Format is a common record Format of DNA sequence variation, commonly used for genomic research and genetic variation analysis. It is a text format that is used primarily to store and describe SNPs (single nucleotide polymorphisms), indels (insertions or deletions), and other types of DNA sequence variation information in single or multiple samples. The VCF format is typically composed of metadata (metadata) that includes information such as file format version, sample information, reference genome, etc., a header (header) that is composed of a series of annotation lines beginning with "#", describing the meaning of each column of the VCF file, and mutation information (variant information) that is arranged in columns, including chromosome position, reference sequence, mutation sequence, quality score, filtering status, annotation information, etc. The VCF format has the advantage that it can record DNA sequence variation information for multiple samples simultaneously and supports alignment and statistical analysis of variation sites between different samples. In addition, the VCF format also provides a standard file format for genomics research, facilitating data sharing and processing. The VCF format is an important DNA sequence variation recording format and is widely used in the fields of genomics research, genetic variation analysis and bioinformatics.

False positive detection-false positive detection refers to the situation in which an unrealistic variation or mutation is falsely marked as present. This may be the result of experimental errors, data processing errors, sample degradation, or other factors.

False negative detection refers to the case where the actually existing mutation or mutation is not detected or missed.

Variation ranking the class of variation was classified according to the american society of molecular and pathological conditions (AMP) in combination with the American Society of Clinical Oncology (ASCO) and the interpretation of tumor gene variation commonly issued by the american society of pathologists (CAP) into class 3, class 1, referring to variation with clear clinical significance, class 2, referring to variation with potential clinical significance, and class 3, referring to mutation with unknown clinical significance.

BAM is the most common alignment data storage format in current genetic data analysis, and is used for storing binary file format of large-scale sequencing data, in particular for storing alignment results of sequences and reference genome.

In order to ensure the accuracy of high-throughput genetic variation detection of a low-input initial-quantity library (about 50 ng), a method and a system for performing variation detection on the low-input initial-quantity library based on high-throughput genome sequencing data are provided, so that the low-frequency detection capability can be effectively improved, and meanwhile, false negative/positive caused by experimental or sequencing influence can be avoided.

The inventors of the present invention found that the simple use of mutect2 for a mutation having a sequencing depth of 200X or more and a mutation frequency of 10% or less does not ensure the detection effect, and that a large number of false positives are detected. False positive detection may be misleading to diagnosis and treatment schemes, so that the influence of false positive needs to be eliminated when analyzing and interpreting NGS detection results, and when sequencing depth is more than 10000X and mutation frequency is less than 1%, not only a large number of false positive detection exists, but also false negative detection becomes more common. False negative detection may lead to patient omission of critical diagnostic information or treatment opportunities, and care is required in NGS detection to reduce the false negative detection rate to ensure accuracy and reliability of the results.

To systematically address this problem, it is necessary to first ensure that the low frequency variation is detected as much as possible, thereby reducing the false negative detection rate. To achieve this, it is necessary to adjust the various mutation detection software (Mutect software) to be as sensitive as possible. However, through early tests, all the variations in the system cannot be detected by trying to adjust the parameters of more than ten combinations. Through investigation, we speculate that this is due to whether the default statistical test of Mutect software cannot accurately distinguish whether a low-quality variation is a true low-quality variation or a false low-quality variation, and under a more conservative consideration, the software loses a part of the ultralow-frequency variation (mutation frequency is less than one thousandth). So we have introduced a second piece of software strelka in the bioinformatics detection system, which is a piece of open source software developed by Illumina corporation, which is characterized by an average running speed of about 17-22 times that of Mutect software. The low frequency variation can be detected more easily by combining the results of the software Manta. Feature training can also be performed by training a true (false) negative dataset, thereby ensuring that mutations within the response system are detected as much as possible. It should be noted that the current bioinformatics software for mutation detection includes not only Mutect and Strelka2, but also conclusively detailed somatic mutation detection software such as Varscan, vardict, deepVariant. But with respect to the literature and industry use experience that has been published so far. Other software is far less efficient (including accuracy and time) than Mutect and Strelka2, and we use only Mutect and Strelka2 for mutation detection in order to reduce the redundancy of the system as much as possible.

By taking intersection and union of the detected mutations of the two mutation software according to the loci (chromosomal locations), the mutations can be distinguished into three sets of Mutect, strelka, and common mutations. And scoring the mutation by using a corresponding scoring equation according to the belonging of different sets, wherein the scoring is used for mutation filtering.

After the false negative detection phenomenon is substantially solved (which is difficult to put an end in theory and can only be adjusted as iteratively as possible), the subsequent analysis is focused on the solution of the false positive detection process. Mutations were scored, source database opened, self-built database multiple filtered and annotated with tags as mentioned above. And finally outputting a result according to the unified header style, and uploading the result to a report system record so as to maintain and iteratively upgrade the self-built database for a long time and detect the training set of positive/negative Strelka.

The detection process of the genetic variation based on the low input initial quantity and the high flux can be summarized into three modules, namely a preprocessing module, an integrated variation result module, a filtering variation (mutation) module and a detection module, which are respectively used for preprocessing genome sequencing data, integrating variation results, filtering variation (mutation) and outputting a filtered credible variation result, wherein the flow chart is shown in figure 1. The concrete steps are as follows:

1. Data preprocessing

(1) Genome sequencing data pretreatment

Using bcl2fastq software, index sequence information was provided for each sample prior to sequencing, and the original off-machine data was split into fastq format data.

(2) Raw data processing

The inclusion of a portion of low quality data in the original genome Fastq data can interfere with subsequent analysis, and therefore, there is a need to remove the low quality data. Data quality processing is performed using software fastp, removing unclean linker sequences, removing consecutive low quality base sequences, discarding low quality sequences, discarding sequences of too short a length.

(3) Fastq sequence alignment with reference genome

And according to the consistency of the base sequence and the human reference genome hg19, the sequence is corresponding to the reference genome, and a comparison result bam file is generated.

2. Integrating the variant results

(4) Somatic mutation detection Using Mutect2+ Vcftools +VEP

The process of somatic mutation detection using Mutect in combination with Vcftools and VEP is a classical procedure. First Mutect2 was used to compare the sample and reference genome data to identify potential mutation sites. VCF file processing and filtering is then performed using Vcftools to improve the reliability of the variation. Finally, functional annotation and interpretation of mutations by VARIANT EFFECT Predictor (VEP) helps determine which mutations may be biologically significant.

(5) Somatic mutation detection Using Manta+ Strelka2+VEP

Manta is used to detect structural variations such as insertions, deletions and inversions. Then Strelka2 was used to identify single nucleotide mutations and small indels. These mutations were then functionally annotated and interpreted with VEP to help determine their potential impact. Integration of these three tools allows for comprehensive capture of information about somatic mutations, thus effectively analyzing and understanding the variation in the genome.

3. Filtering variation (mutation)

(6) Integration of variants and scoring

① The mutation files of two sources (VCF formats) are respectively read to extract mutation basic information (including chromosome position, base before mutation and base after mutation), so that two sets taking the mutation basic information as elements are obtained, then intersection and union of the two sets are calculated, and finally all mutation mark sources in the two software are sequentially marked.

The marking method is that all the variations mentioned in the two software are marked as three sources, namely A if the variation is only from Mutect < 2 >, B if the variation is only from Strelka < 2 >, and C if the variation is detected in both software.

② Collecting evidence items from BAM

Firstly, according to the variation basic information obtained in the last step, using pandas module of Python to obtain reads record (namely alignment) corresponding to the variation in BAM (note that required evidence items are all obtained from BAM file), then sequentially analyzing each reads record, collecting evidence items required for evaluating variation reliability, giving each evidence item an add-drop mechanism conforming to logic rule, wherein the score of supporting evidence is positive, the score of anti-property evidence is negative, and finally, the higher the score is, the higher the variation reliability is, and dividing reasonable grades.

The evidence item comprises supporting evidence and objection evidence, and is specifically as follows:

I. Support evidence:

Mutation frequency (AF) the allele fraction of a mutation in a tumor sample is indicative of the relative abundance of the mutation. The higher the total specific gravity of the reads in which the mutation is detected, the higher the mutation authenticity. The score should be positive, the monotonically increasing function should be met, and the smaller the AF, the smaller the score proportion, i.e. the slope is proportional to AF;

log of pass reads-the more log of pass variants are measured, the higher the variant realism. The assignment should be a positive number, which is in accordance with a monotonically increasing function, and the higher the number, the smaller the specific gravity of the addition, i.e. the slope is inversely proportional to the logarithm of the measurement reads.

Allele Depth (AD) the more reads that detected the variation, the higher the authenticity of the variation. The term score should therefore be positive, should follow a monotonically increasing function, and the additive weight should be small when AD is small (AD < =8), i.e. the slope is proportional to AD.

Objection evidence:

Sequencing Depth (DP) the sequencing depth covering this position. Higher coverage generally means more reliable mutation detection.

The comparison quality (high_MQ_reads) is that the average MQ of the reads where the mutation is located is not less than 50% of the average MQ of the other reads, and the average MQ of the mutation is >10, this term is not deducted. Otherwise, the button is separated;

unique comparability (Uniq) variation is in a unique comparison region. 80% of reads have no XA, and the tag is True;

region complexity (Complex) variation is within a region of a Complex sequence (whether the sequence around the variation is a Complex sequence or not).

The source of the mutation is that if the same mutation is detected in two software at the same time, the higher the authenticity of the mutation is, the tag is True, otherwise, the penalty is needed.

③ Scoring the evidence items mentioned in the steps, wherein scoring details are as follows:

④ After the evidence items mentioned in the steps are sequentially scored, judging whether the four items of high_MQ_ reads, uniq, complex and the variation source are not satisfied or not, and carrying out additional scoring;

Evidence item	Additional scoring
		None of the four items satisfies	Additional penalty 30
Any three items do not satisfy	Additional penalty 20
		Any two do not satisfy	Additional penalty of 10
None of the four items are buckled	Additional sub-10

⑤ Mutations were classified into five classes according to scores:

grade E, false variation below 0;

Grade D, possibly false variation, 0-40;

Grade C, uncertainty of variation of authenticity;

class B, possibly true variation, 80-120;

Class A, 120 or more, credible variation;

(7) Annotating local frequencies from a local frequency library

Construction of a local frequency library:

In the above steps, we describe, for a sample, the location of each mutation that occurs therein (including chromosome number and the location of the base), as well as the base before mutation (REF) and the base after mutation (ALT) at that location. Then for samples in a large queue (e.g., 500 samples), we can count all mutations in all samples and sequentially calculate the number of samples (e.g., 250) in which each mutation occurred, then the local frequency of the mutation in the local frequency library of the queue is 250/500=0.5, i.e., for a fixed local frequency library, one mutation corresponds to one local frequency.

(8) Filtering variations

① Filtering using a local frequency library

The local frequency library records the frequency of the local detection frequency of a mutation accounting for the total number of analysis and detection. If the local frequency is detected to be greater than 20% and not in the common white and black lists, the mutation is considered to be a false mutation introduced by the reaction system, labeled LocalDB.

② Filtering according to scoring

Variations with a variation rating of A, B were retained, variations with a variation rating of C, D, E were filtered out, and low confidence variations were marked as LowQ.

③ Filtering according to mutation function

Removing the variation which has no actual influence on the function, wherein the variation is marked DontAffectFunc;

④ Checking the bam to verify the authenticity of the variation

4. Outputting the result

Filtering out the untrustworthy mutation marked in the step (8), reporting and judging the rest untrustworthy mutation, and carrying out mutation interpretation according to the literature report condition of the mutation, wherein the interpretation content comprises but is not limited to providing reference for disease diagnosis, prognosis, recurrence and treatment.

Examples

After receiving one child tumor patient DNA sequencing data, the following operations are performed:

(1) Genome sequencing data pretreatment

The raw data statistics are obtained as follows:

Sample	Raw_reads	Raw_bases
			Test1	146,875,740	22,178,236,740

(2) Raw data processing

After quality control, a high-quality sequence is obtained, and the data statistics are as follows:

Samples	Clean_reads	Clean_bases	Q20(%)	Q30(%)	clean Bases%
						Test1	144,485,354	20,945,984,384	98.17%	94.91%	94.44

(3) fastq to reference genome alignment

The alignment of the sequence data with the human reference genome hg19 is as follows:

(4) Somatic mutation detection Using Mutect2+ Vcftools +VEP

CHR	POS	REF	ALT	AD	AF	Log of measurement penetration reads	DP
								12	112888189	G	A	103	0.373	176	276
16	50827573	C	A	87	0.026	103	193
								X	41201995	T	C	32	0.1	221	321

(5) Somatic mutation detection Using Manta+ Strelka2+VEP

CHR	POS	REF	ALT	AD	AF	Log of measurement penetration reads	DP
								10	871189	T	C	56	0.172	226	326
12	112888189	G	A	103	0.373	176	276
								X	41201995	T	C	32	0.1	221	321

(6) Integration of variants and scoring

CHR	10	12	16	X
					POS	871189	112888189	50827573	41201995
REF	T	G	C	T
					ALT	C	A	A	C
AD	40	40	40	40
					AF	50	50	50	50
Log of measurement penetration reads	45	45	45	45
					DP	0	0	0	0
Source of variation	-30(B)	0(C)	-30(A)	0(C)
					High MQ reads	0	0	0	-30
Uniq	-30	0	0	-30
					Complex	0	0	0	0
Additional scoring	-10	0	0	-10
					Total score	65	135	105	65
Rating of	C	A	B	C

(7) Filtering mutations

CHROM

POS

REF

ALT

Total score

Rating of

Variant tag

10

871189

T

C

65

C

LowQ

12

112888189

G

A

135

A

PASS

16

50827573

C

A

105

B

LocalDB

X

41201995

T

C

65

C

LowQ

(8) Outputting the result

CHROM

POS

REF

ALT

Total score

Rating of

Variant tag

12

112888189

G

A

135

A

PASS

Although the present disclosure is disclosed above, the scope of the present disclosure is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the disclosure, and these changes and modifications will fall within the scope of the disclosure.

Claims

1. A high-throughput genome sequencing variation detection system based on low input starting amount, characterized in that it includes:

Data preprocessing module, used to align sequencing data to reference genes to obtain alignment results;

A variation detection module, comprising a first detection module and a second detection module, for obtaining variation data based on the comparison results respectively; the first detection module is used to obtain first variation data including mutation sites, and the second detection module is used to detect structural variation and identify second variation data including single nucleotide mutations and insertion and deletion results;

The variation integration and filtering module is used to integrate and filter the variation data; in the integration process, the union of the first variation data and the second variation data is taken to obtain the third variation data, and each variation in the third variation data is weighted and scored according to the contribution of the variation information parameter to the variation credibility; the filtering process includes filtering the variation with low score;

The detection output module is used to output the final filtered variant data;

The variation information parameters are divided into supporting parameters and opposing parameters according to their contribution to the variation, and are respectively divided into positive and negative; the supporting parameters include mutation depth, mutation frequency, and number of reads; the opposing parameters include sequencing depth, variation source, alignment quality, unique alignment, and regional complexity; the variation source refers to the situation where the variation data is derived from the first and second variation data.

2. The detection system according to claim 1 is characterized in that the first detection module uses Mutect2 and Vcftools to identify and process mutation sites; and/or the second detection module uses Manta and Strelka2 to detect and identify single nucleotide mutations and insertions and deletions.

3. The detection system according to claim 1 is characterized in that the weighted scoring also includes additional overall scoring for each penalty situation of variation source, alignment quality, unique alignment, and regional complexity.

4. The detection system according to claim 1 is characterized in that the sequencing depth of the sequencing data is greater than 200X, the mutation frequency is less than 10%; and/or the starting amount of DNA input during sequencing is less than 100 ng.

5. The detection system according to claim 1 is characterized in that low-quality data is removed from the sequencing data before alignment.

6. The detection system according to claim 1 is characterized in that the filtering process also includes: using a local frequency library for filtering; and/or filtering variants that have no actual impact; and/or checking bam to filter based on the authenticity of the variant.

7. A method for detecting variation based on high-throughput genome sequencing with low input starting amount for non-diagnostic purposes, characterized in that it includes:

Align the sequencing data to the reference gene to obtain the alignment results;

Obtaining variation data based on the comparison results, including obtaining first variation data including the mutation site, and identifying second variation data including single nucleotide mutations and insertion/deletion results by detecting structural variation;

Integrate and filter the variant data; in the integration process, take the union of the first variant data and the second variant data to obtain the third variant data, and weight and score each variant in the third variant data according to the contribution of the variant information parameter to the variant credibility; the filtering process includes filtering variants with low scores;

Output the final filtered variant data;

8. An electronic device, comprising a processor and a memory, wherein the memory stores a computer program, and the processor implements the method according to claim 7 when executing the computer program.

9. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and a processor executes the computer program to implement the method according to claim 7.