CN119673284A

CN119673284A - Third-generation sequencing read analysis methods, applications and devices

Info

Publication number: CN119673284A
Application number: CN202311228907.9A
Authority: CN
Inventors: 韦锟
Original assignee: Wuhan Huada Gene Technology Service Co ltd; BGI Technology Solutions Co Ltd
Current assignee: Wuhan Huada Gene Technology Service Co ltd; BGI Technology Solutions Co Ltd
Priority date: 2023-09-21
Filing date: 2023-09-21
Publication date: 2025-03-21

Abstract

The application provides a third-generation sequencing read analysis method, application and a device. The method comprises the steps of obtaining a read, wherein the read is obtained by sequencing a sequencing library by adopting a three-generation sequencing platform, molecules of the sequencing library comprise a sequencing primer identification sequence, a tag sequence and a UMI sequence, determining an identity mark area of the read based on the sequencing primer identification sequence in the read, comparing the identity mark area with at least one standard tag sequence, determining the tag sequence of the read in the identity mark area based on the comparison result, and determining the UMI sequence of the read in the identity mark area based on the tag sequence. The method fully considers the characteristic of high insertion and deletion error rate in the third-generation full-length transcriptome sequencing result, improves the utilization rate of third-generation sequencing data based on a sequence comparison method, and improves the accuracy of transcript quantification based on UMI sequence correction of K-mers.

Description

Three-generation sequencing read analysis method, application and device

Technical Field

The application relates to the field of biotechnology, in particular to the field of gene sequencing, and more particularly relates to a three-generation sequencing read analysis method, a single-cell sequencing method, a three-generation sequencing read analysis device, a single-cell sequencing system, a computer program product, a server and a computer readable storage medium.

Background

Single cell transcriptome sequencing (SINGLE CELL RNA sequencing, abbreviated as scRNA-seq) has been widely used to analyze cell type heterogeneity in various biological tissues. Traditional single-cell transcriptome sequencing based on second-generation sequencing technology mainly focuses on the difference in transcriptome expression level between different cell types at the gene level, but cannot accurately measure the difference in expression of transcript levels. Third generation sequencing techniques can overcome the limitations described above and allow high throughput analysis of single cell full length transcript information. By skillfully combining the existing single cell technology and a third generation sequencing platform, the full-length transcriptome sequencing of single cells can be realized in high flux.

However, the high frequency of indel errors in existing third generation sequencing data presents challenges for analysis of third generation full length single cell transcriptome sequencing data.

Therefore, there is still a need for improved methods for analysis of third generation full-length single-cell transcriptome sequencing data.

Disclosure of Invention

The present application has been completed based on the following findings:

Currently, there are some limitations to analytical methods for three generation full length single Cell transcriptome sequencing data, focusing mainly on the challenges in processing Cell barcode (Cell barcode) sequences and Unique Molecular Identity (UMI) sequences in three generation sequencing data. The method is characterized by comprising the following steps:

1) During existing data processing, fixed location information is typically used to determine the location of Cell barcode and UMI sequences in sequencing reads. However, insertion and deletion errors often exist in the third generation sequencing data, resulting in the length of the Cell barcode and UMI sequences not being fixed in position in the sequencing reads.

2) To determine the relationship between Cell barcode and standard tag sequences (pre-established white list of Cell barcode) in sequencing data, a fixed distance is often used as a decision criterion. However, insertion and deletion errors are also present in the third generation sequencing data, resulting in a possible exceeding of the preset fixed distance between the Cell barcode and the corresponding standard tag sequence in the sequencing data.

3) The criteria used to determine whether to pool sequenced reads with the same UMI sequence are also typically a fixed distance. However, also affected by insertion and deletion errors in the third generation sequencing data, resulting in that the distance between identical UMI sequences may exceed a preset fixed distance.

The present application aims to solve at least one of the technical problems existing in the prior art. To this end, an object of the present application is to propose a means that enables an efficient analysis of third generation sequencing reads.

In a first aspect of the application, the application provides a third generation sequencing read analysis method. According to an embodiment of the application, the method comprises the steps of obtaining a read, wherein the read is obtained by sequencing a sequencing library by adopting a three-generation sequencing platform, molecules of the sequencing library comprise a sequencing primer identification sequence, a tag sequence and a UMI sequence, determining an identity tag region of the read based on the sequencing primer identification sequence in the read, comparing the identity tag region with at least one standard tag sequence, determining the tag sequence of the read in the identity tag region based on the comparison result, and determining the UMI sequence of the read in the identity tag region based on the tag sequence.

According to the embodiment of the application, the method increases the utilization rate of sequencing data and can obviously improve the accuracy of single-cell transcript expression quantity identification.

In a second aspect of the application, the application provides a single cell sequencing method. According to an embodiment of the application, the method comprises obtaining a plurality of single cells, constructing a sequencing library based on the plurality of single cells, wherein the sequencing library molecules comprise a sequencing primer recognition sequence, a tag sequence comprising at least one of a first tag for distinguishing single cells and a second tag for distinguishing sample sources, and a UMI sequence for distinguishing inserts of the sequencing library molecules, sequencing the sequencing library, and analyzing using the method of the first aspect to determine read sequences of the single cells.

According to the embodiment of the application, the problem of inaccurate sequencing result analysis caused by high-frequency indel errors in the sequencing result can be solved by utilizing the method to perform single-cell sequencing, so as to obtain a more accurate single-cell sequencing data analysis result.

In a third aspect of the application, the application provides a third generation sequencing read analysis device. According to an embodiment of the application, the device comprises a sequencing unit, an identity tag region determining unit, an aligning unit and a UMI sequence determining unit, wherein the sequencing unit is used for acquiring reads, the reads are obtained by sequencing a sequencing library by adopting a three-generation sequencing platform, molecules of the sequencing library comprise sequencing primer identification sequences, tag sequences and UMI sequences, the identity tag region determining unit is used for determining an identity tag region of the reads based on the sequencing primer identification sequences in the reads, the aligning unit is used for aligning the identity tag region with at least one standard tag sequence and determining the tag sequences of the reads in the identity tag region based on the aligning results, and the UMI sequence determining unit is used for determining UMI sequences of the reads in the identity tag region based on the tag sequences.

According to the embodiment of the application, the device is used for implementing the third-generation sequencing reading analysis method in any embodiment, and the analysis accuracy of the third-generation sequencing reading is obviously improved. The device can automatically analyze sequencing reads, and improves the analysis efficiency of the reads.

In a fourth aspect of the application, the application provides a single cell sequencing system. According to an embodiment of the application, the system comprises a sequencing library construction module for obtaining a plurality of single cells, constructing a sequencing library based on the plurality of single cells, wherein the sequencing library molecules comprise a sequencing primer recognition sequence, a tag sequence comprising at least one of a first tag for distinguishing single cells and a second tag for distinguishing sample sources, and a UMI sequence for distinguishing inserts of the sequencing library molecules, and a sequencing module for sequencing the sequencing library and analyzing using the apparatus of the third aspect of the application to determine read sequences of the single cells.

According to an embodiment of the application, the device is used to perform any one of the embodiments of the single cell sequencing method to achieve an efficient and accurate analysis.

In a fifth aspect of the application, the application proposes a computer program product. According to an embodiment of the application, the computer program product comprises computer instructions which, when run on a computer, implement the method according to the first or second aspect of the application.

In a sixth aspect of the application, the application provides an electronic device. According to an embodiment of the application, the electronic device comprises at least one processor and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods of the first and second aspects of the application.

In a seventh aspect of the present application, the present application proposes a computer-readable storage medium containing a computer program. According to an embodiment of the present application, the computer program, when executed by one or more processors, implements the methods of the first and second aspects of the present application.

It should be noted that, in the present application, the features and advantages described with respect to the methods, apparatuses and systems in any one of the above aspects, any implementation manner or examples are equally applicable to other aspects, and are not described herein.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flow chart of a third generation sequencing read analysis method according to one embodiment of the present application;

FIG. 2 is a schematic diagram of a third generation sequencing read analysis device in accordance with one embodiment of the present application;

FIG. 3 is a schematic diagram of a third generation sequencing read analysis device in accordance with one embodiment of the present application;

FIG. 4 is a schematic diagram of a third generation sequencing read analysis device (with calibration unit) according to another embodiment of the present application;

FIG. 5 is a schematic diagram of a third generation sequencing read analysis device (with calibration unit) according to another embodiment of the present application;

FIG. 6 is a schematic diagram of an electronic device according to one embodiment of the application;

FIG. 7 is a UMI correction effort verification diagram in accordance with an embodiment of the present application;

FIG. 8 is a schematic representation of a three-generation sequencing read analysis method positioning according to one embodiment of the present application.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

Definition of the definition

As used herein, the singular forms "a," "an," and "the" include plural referents (one or more). "set" or "plurality" refers to two or more.

In this document, "comprising" or "including" is an open-ended expression, including what is indicated or exemplified hereafter, and also includes what is applicable or consistent with the stated situation, but not specifically recited.

Herein, "sequencing" refers to determining the order of the primary structural base arrangement of nucleic acid molecules, wherein second generation sequencing can be achieved using sequencing by synthesis (sequencing by synthesis Sequencing by Synthesis, SBS) and/or sequencing by ligation (sequencing by ligation, SBL) principles, and third generation sequencing can be achieved using Single-molecule real-time (SMRT) principles. Sequencing may include DNA sequencing and/or RNA sequencing. Including long fragment sequencing and/or short fragment sequencing, where the long and short fragments are relative, such as nucleic acid molecules longer than 1Kb, 2Kb, 5Kb, or 10Kb may be referred to as long fragments, shorter than 1Kb or 800bp may be referred to as short fragments, may include double-ended sequencing, single-ended sequencing, and/or paired-ended sequencing, etc., where double-ended sequencing or paired-ended sequencing may refer to the readout of any two or more portions of the same nucleic acid molecule that do not overlap completely.

Sequencing may be performed by a sequencing platform, and according to embodiments of the present application, alternative sequencing platforms include, but are not limited to, the REVIO platform of PacBIO, the PromethION platform of Oxford Nanopore Technologies, sequencing modes may be selected from single ended sequencing, double ended sequencing, or sequencing modes supported by an automated sequencing platform selected for use, and the like.

Sequencing the sequence read out is called sequencing sequence, also called reads (reads), and the length of the sequencing sequence or read is also called read length. "sequencing read" is used interchangeably with "read" or "read" and refers to a nucleic acid sequence obtained upon sequencing, which is referred to herein as a "sequencing read" or "read", typically three-generation sequencing reads have a read length of between thousands and tens of thousands of bp.

As used herein, "sequencing library" refers to a collection of samples of DNA, RNA or DNA fragments to be sequenced that are prepared into a particular form when performing a sequencing study such as a genome, transcriptome or proteome. The sequencing library typically contains several sequences including a linker sequence, a sequencing primer recognition sequence, a cell tag sequence, a sample tag sequence, and UMI sequence, a cDNA insert sequence, and the like.

In this context, the term "read identity tag region" and "cell barcode" mean the same meaning for identifying and distinguishing RNA or DNA molecules from different cells. In single cell transcriptome sequencing, cell barcode is typically located at the 5 'end or 3' end of the RNA molecule to be tested. Each cell has a unique cellular barcode sequence, and RNA molecules from different cells can be distinguished by subjecting RNA molecules in the same cell to the same cellular barcode label. In this way, during sequencing, transcriptome data from different cells can be distinguished, thereby allowing single cell level transcriptome analysis.

Herein, "standard tag sequence" is synonymous with "Cell barcode whitelist sequence" and means one or more Cell barcode sequences that are known/predefined. Unique identification sequences for identifying different cells. By using a predefined white list sequence, it can be ensured that only predetermined cell identities are identified and isolated, thereby reducing the risk of false identification while improving the quality and accuracy of the sequencing data.

Herein, "tag sequence" means the Cell barcode sequence of any sequencing read in the sequencing data.

Herein, a "UMI (Unique Molecular Identifier )" sequence is one of the molecular identifiers used in single cell transcriptome sequencing and other high throughput sequencing experiments. UMI sequences are a unique short DNA sequence attached to the end of each RNA molecule that is used to identify the origin of each RNA molecule. The UMI is used for eliminating deviation and error introduced by PCR amplification, thereby improving the accuracy and reliability of sequencing data.

The reference sequence (ref) and the reference chromosome sequence are defined sequences, and may be DNA and/or RNA sequences which are assembled by self-predetermined measurement, or may be DNA and/or RNA sequences which are disclosed by other people measurement, and may be any reference template in a biological class of a sample source individual/target individual obtained in advance, for example, all or at least part of the disclosed genomic assembly sequences in the same biological class. If the sample source or target individual is a human, the genomic reference sequence (also referred to as a reference genome or reference chromosome) may be selected from human reference genomes provided by UCSC, NCBI or ENSEMBL databases, such as HG19, HG38, GRCh36, GRCh37, GRCh38, etc., and the corresponding relationship of each reference genome version may be known to those skilled in the art through the description of the databases, and the version used may be selected. Furthermore, a resource library containing more reference sequences can be pre-configured, for example, before comparison, sequences which are closer to or have a certain characteristic can be selected or determined and assembled according to factors such as the sex, the race, the region and the like of the target individual to serve as the reference sequences, so that more accurate sequence analysis results can be obtained later.

The reference sequence can be constructed when the target sample is detected, or can be pre-constructed and stored and called when the prepared sample is detected. In certain embodiments, the test sample is from a human, and the reference sequence is a human reference genome or a human autosomal group.

Herein, the "adjacency matrix" is a matrix structure for representing the relationship between the respective nodes (nodes) in the Graph (Graph). In graph theory, a graph is a data structure made up of a set of nodes and edges connecting the nodes. The adjacency matrix provides a convenient way to represent the connections between nodes in the graph. In the adjacency matrix, rows and columns represent nodes in the drawing, respectively, and elements of the matrix represent connection relations between the nodes. If there is a connection between node i and node j, then the element of row i, column j in the adjacency matrix will be set to 1 (or other non-zero value, as the case may be). If there is no connection between node i and node j, the corresponding matrix element is typically 0.

Prior Art

PacBIO and Oxford Nanopore Technologies have now proposed some protocols for the analysis of third generation full length single cell transcriptome sequencing data. The method comprises the following steps:

1) A fixed position determination strategy was employed to determine the exact position of single Cell barcode (Cell barcode) sequences and UMI (Unique Molecular Identifier ) sequences in sequencing reads;

2) In determining the relationship between the cellular barcode sequence in the sequencing data and the predefined cellular barcode whitelist sequence, a specific fixed distance is employed as a criterion;

3) Based on the fixed distance, it is determined when to merge adjacent UMI sequences for further processing of the data.

First, the location of the cellular barcode sequence and UMI sequence in the sequencing reads is determined based on the fixed location information. This rigid localization strategy may lead to misjudgments in the face of sequencing errors, indels, possible sequence variations, etc., thus affecting the correct extraction of cell identity and UMI.

Second, taking a fixed distance as a criterion for determining the relationship between the cell barcode sequence and the predetermined whitelist sequence may not adequately cope with sample-specific differences, especially where there is a shift or variation between cell barcodes. This determination method may lead to erroneous cell identification, thereby affecting subsequent cell classification and expression level analysis.

Finally, at a fixed distance to determine whether to merge adjacent UMI sequences, the true diversity of transcripts may not be captured, especially in splice-isoform rich genes. This approach may lead to underestimation or overestimation of the expression level of transcripts, which affects the accurate understanding of the dynamics of gene expression.

Three-generation sequencing read analysis method

The technical scheme of the present application will be described in detail with reference to the accompanying drawings.

In a first aspect of the application, the application proposes a third generation sequencing read analysis method, referring to fig. 1, according to an embodiment of the application, the method comprises:

s101, determining the identity mark region of the read based on the sequencing result

By sequencing the sequencing library and obtaining a plurality of sequencing reads based on the sequencing results. According to an embodiment of the application, the sequencing library molecules comprise a plurality of sequencing primer recognition sequences, and may also comprise a plurality of tag sequences. Wherein the plurality of sequencing reads is obtained based on a three-generation sequencing platform, wherein the manner of obtaining is not particularly limited.

The identity tag region (cell barcode) in the read is obtained by locating the sequencing primer recognition sequence in a plurality of sequencing reads. According to specific embodiments of the application, the sequencing primer recognition sequence includes, but is not limited to, a P3 primer. The Cell barcode sequence region in the sequencing reads was located based on the sequencing primer recognition sequences.

According to a specific embodiment of the present application, as shown in FIG. 8, the localization was performed by locating the P3 primer sequence adjacent to the Cell barcode sequence in each third generation sequencing read by the sequence similarity alignment method (green), and by using m bases upstream and n bases downstream of the P3 primer in each third generation sequencing reads, with m+n bases as the Cell barcode sequence recognition region. Where the m length is typically selected to be 10 and the n length is greater than the Cell barcode (blue) length, typically no more than 1.5 times the Cell barcode length. Thus, the m+n base length sequence can completely contain the Cell barcode sequence and the content of the non-Cell barcode sequence can be reduced, so that the interference caused by the non-Cell barcode sequence can be reduced in the subsequent analysis.

The sequence similarity alignment method includes but is not limited to all kinds of algorithms or software which can solve the problem of double sequence similarity alignment existing or developed in the future, such as Smith-Waterman algorithm, blast, BWA or Needleman-Wunsch algorithm, etc.

The sequence of the called P3 amplification primer is known and provided by a single Cell experimental technology platform selected, the single Cell experimental technology platform determines the sequence structure of a sequencing library, the sequence structure of the sequencing library determines that the P3 amplification primer immobilized sequence is directly adjacent to the Cell barcode sequence, and the position of the P3 amplification primer sequence in the third generation sequencing reads is positioned.

S102, comparing the identity mark area with at least one standard tag sequence to determine the tag sequence of the read

And (3) comparing the identity marked region of the read determined in the step (S101) with at least one standard tag sequence (Cell barcode white list sequence) through a sequence similarity comparison method, and determining the tag sequence of the read in the identity marked region based on the comparison result.

The sequence similarity alignment methods include, but are not limited to, all of the various algorithms or software that can solve the problem of double sequence similarity alignment, existing or developed in the future.

The so-called Cell barcode white list sequences are from a known list of all possible Cell barcode sequences in the sequencing data provided by the selected single Cell experimental technology platform.

According to the embodiment of the application, the identity mark area is compared with at least one standard tag sequence to obtain a plurality of candidate tag sequences, the similarity between the identity mark area and each candidate tag is determined, and the candidate tag with the highest similarity is selected as the tag sequence of the reading segment.

The so-called similarity is determined by determining the edit distance, the similarity score, or the similarity P value. These similarity evaluation indexes are determined based on the selected sequence similarity alignment method. In some embodiments, the edit distance is preferentially selected as the evaluation index.

According to an embodiment of the application, the candidate tag having the highest similarity and the similarity exceeding a predetermined threshold is further selected as the tag sequence of the read.

According to an embodiment of the application, the comparison result is filtered based on a predetermined threshold. When the edit distance is used as an evaluation index, the predetermined threshold is not more than 3, preferably not more than 2. In some examples, the predetermined threshold is no more than 5% of the corresponding sequence length.

According to an embodiment of the present application, when the similarity P value is used as the evaluation index, the predetermined threshold is P <0.05.

According to embodiments of the present application, the threshold value of the similarity score is generally related to the algorithm of the similarity score and the length of the sequence. In some examples, the predetermined threshold is greater than 80% of the similarity score obtained when two sequences of corresponding lengths are perfectly matched.

According to an embodiment of the application, when a plurality of candidate tags with highest similarity exist, the method further comprises determining the occurrence frequency of each candidate tag in the same batch of sequencing, and selecting the candidate tag with the highest occurrence frequency as a tag sequence of a read.

The frequency of occurrence is referred to as the number of times a candidate tag appears in the same batch of sequencing data.

Among them, methods for determining the occurrence frequency of each candidate tag in the present sequencing include, but are not limited to, using a statistical counting method, a calculation frequency method, or a comparison frequency method. The so-called statistical counting method represents, for each candidate tag sequence, traversing the entire sequencing dataset, counting its occurrences in all reads. The so-called frequency of occurrence method represents dividing the number of occurrences of the candidate tag sequence in the whole dataset by the total number of reads to obtain the frequency of occurrence of the candidate tag. The so-called comparison frequency method means that the frequencies of occurrence of all candidate tags are compared. The higher the frequency of occurrence of the candidate tag, the more frequently the tag is present in the sequencing data, indicating more reliable.

S103, determining UMI sequence of read based on tag sequence

Based on the tag sequences determined in the step S102, checking Cell-based code comparison results corresponding to all the three-generation sequencing reads, if the three-generation sequencing reads fail to match any white list Cell-based code sequences, the three-generation sequencing reads cannot be allocated Cell-based code and should be removed, if the three-generation sequencing reads exactly match 1 white list Cell-based code sequences, the Cell-based code sequences are allocated as Cell-based codes of the three-generation sequencing reads, and if the three-generation sequencing reads match a plurality of white list Cell-based code sequences, the Cell-based code with the highest occurrence frequency of the Cell-based code is selected as the Cell-based code of the three-generation sequencing reads.

It should be noted that, every time a Cell barcode is assigned to three-generation sequencing reads, the frequency of occurrence of the Cell barcode in the actual Cell barcode background distribution is updated. It will be appreciated that when a three-generation sequencing read is aligned, it will be assigned a particular Cell barcode if it matches one or more sequences in the white list of Cell barcodes. Once the third generation sequencing reads are assigned a Cell barcode, the frequency of occurrence of the Cell barcode is updated.

The term "actual Cell barcode background distribution" refers to a distribution of Cell barcode sequences actually existing in a sample.

According to an embodiment of the application, the Cell barcode whitelist is from a known list of all possible Cell barcode sequences in the sequencing data provided by the selected single Cell experimental technology platform.

UMI sequences were isolated from the three-generation sequencing reads based on the obtained Cell barcode sequence region and the alignment with the Cell barcode whitelist sequence. The single Cell transcriptome experimental technology platform determines the arrangement of Cell barcode sequences adjacent to UMI sequences. The length of the UMI sequence extends from the tail of the Cell barcode sequence to the position where the PolyA sequence first appears in the third generation sequencing reads.

By the sequence similarity alignment method, the P5 amplification primer sequences are positioned in the third generation sequencing reads from which the UMI sequences are split, and all sequences including the P5 amplification primer sequences are truncated. Subsequently, in the third generation sequencing reads after cutting off the P5 amplification primer sequence, the PolyA sequence was also located by alignment using sequence similarity, and the PolyA sequence and the sequence therewith were removed, thereby obtaining the sequence of the cDNA insert. The sequence of each cDNA insert corresponds to a specific assigned Cell barcode sequence and split UMI sequence. After the complete cDNA insert sequences are obtained, reference sequence alignments are performed to determine transcripts corresponding to each cDNA insert sequence.

It should be noted that the sequence of the P5 amplification primer is provided by the single cell experimental platform of choice and is known.

In some embodiments, a plurality of UMI sequences for a plurality of the reads are obtained, a plurality of K-mer sequences are generated based on the plurality of UMI sequences, a K-mer sequence corresponding to a given read is corrected based on mutual similarity between the plurality of K-mer sequences, and the UMI sequence for the given read is corrected based on the corrected K-mer sequences.

Specifically, correcting the K-mer sequence corresponding to a given read based on mutual similarity between the plurality of K-mer sequences further comprises:

the method comprises the steps of calculating an editing distance between any two K-mer sequences, obtaining an adjacency matrix of the UMI sequences based on the editing distance, obtaining a UMI sequence set (connected component) based on the adjacency matrix, and selecting any sequence in the UMI sequence set to replace the UMI sequence set.

In some examples, the edit distance selects an integer no greater than 2. For example, the edit distance is 1 or 2.

In some examples, all UMI sequences are converted to a length-5K-mer sequence. On this basis, the edit distance of the K-mer sequence between different UMIs, i.e. the number of different characters between the K-mers, is calculated. In other embodiments, the K-mer sequence length can be selected to be 3,4, 6, 7, 8, 9, 10.

Based on these edit distances, an adjacency matrix describing the relationships between all UMI sequences is established. In the adjacency matrix, if the edit distance between the K-mer sequences of two UMIs does not exceed 3, it is indicated that the two UMIs are connected in the adjacency matrix. In other embodiments, the edit distance between the K-mer sequences of two UMIs does not exceed 2, indicating that the two UMIs are contiguous in the adjacency matrix.

Finally, the connected components in the adjacent matrix are re-marked, so that UMI sequences in the same connected component are unified into one identifier. This process helps identify the UMI sequences of common features and combine them.

The third-generation sequencing reading analysis method fully considers the characteristic of high insertion and deletion error rate in the third-generation full-length transcriptome sequencing result, improves the utilization rate of third-generation sequencing data based on a sequence comparison method, and improves the accuracy of transcript quantification based on UMI sequence correction of K-mers.

In addition, the application range of the third generation sequencing read analysis method includes but is not limited to the field of single cell transcriptome science and technology service or the field of tumor and fertility diagnosis by single cell transcriptome and single cell genome technology in the future.

Single cell sequencing method

In a second aspect of the application, the application provides a single cell sequencing method. The method comprises the steps of obtaining a plurality of single cells, constructing a sequencing library based on the single cells, wherein the sequencing library molecules comprise a sequencing primer identification sequence, a tag sequence and a UMI sequence, the tag sequence comprises at least one of a first tag used for distinguishing single cells and a second tag used for distinguishing sample sources, the UMI sequence is used for distinguishing insertion fragments of the sequencing library molecules, sequencing the sequencing library, and analyzing by the three-generation sequencing read method according to any embodiment of the application so as to determine read sequences of the single cells.

In some examples, the plurality of single cells may be derived from different biological samples.

In some examples, a sequencing library is constructed based on the plurality of single cells collected. The sequencing library contains the molecules to be sequenced. Each sequencing library molecule includes a sequencing primer recognition sequence, a tag sequence, and a UMI sequence. Wherein the tag sequence may comprise at least one tag, e.g., a first tag for distinguishing individual cells and a second tag for distinguishing the origin of the sample. UMI sequences are also included in the sequencing library and are used to distinguish inserts between different sequencing library molecules. And sequencing the constructed sequencing library, and then performing data analysis by using a third-generation sequencing read analysis method. The read sequence of the individual cells is determined, and transcriptome information about the cells is obtained.

Three-generation sequencing reading analysis device

In a third aspect of the application, the application provides a third generation sequencing read analysis device. Referring to fig. 2, comprising:

The reading acquisition unit S200 is used for acquiring a reading, wherein the reading is obtained by sequencing a sequencing library by adopting a three-generation sequencing platform, and the molecules of the sequencing library comprise a sequencing primer identification sequence, a tag sequence and a UMI sequence.

An identity tag region determining unit S300, where the identity tag region determining unit S300 is connected to the read acquiring unit S200, and is configured to determine an identity tag region of the read based on the sequencing primer identification sequence in the read.

The comparing unit S400 is connected to the identity tag region determining unit S300, and is configured to compare the identity tag region with at least one standard tag sequence, and determine the tag sequence of the read in the identity tag region based on the comparison result.

And a UMI sequence determining unit S500, wherein the UMI sequence determining unit S500 is connected to the comparing unit S400, and is configured to determine, in the identity tag region, the UMI sequence of the read based on the tag sequence.

In some embodiments of the present application, as shown in fig. 3, the identification mark area determining unit S300 further includes:

A positioning component S301 for positioning in a plurality of sequencing reads by sequencing primer recognition sequences, obtaining a cell barcode region in the reads.

An intercepting component S302, wherein the intercepting component S302 is connected with the positioning component S301 and is used for intercepting m bases upstream of a sequencing primer (such as a P3 primer) and n bases downstream of the sequencing primer (such as a P3 primer) in each third generation sequencing reads, and m+n bases are taken as a Cell barcode sequence region.

In some embodiments of the present application, as shown in fig. 3, the alignment unit S400 further includes:

The first comparison component S401 compares it with at least one standard tag sequence (Cell barcode white list sequence) by a sequence similarity comparison method.

And a filtering component S402, where the filtering component S402 is connected to the first comparing component S401, and is configured to select, as a tag sequence of the read, the candidate tag with the highest similarity and the similarity exceeding a predetermined threshold, based on a similarity comparison result.

The first calculating component S403 is connected with the filtering component S402, when a plurality of candidate labels with highest similarity exist, the occurrence frequency of each candidate label in the sequencing of the same batch is calculated, and the candidate label with the highest occurrence frequency is selected as a label sequence of a read.

In some embodiments of the present application, as shown in fig. 3, the UMI sequence determining unit S500 further includes:

The checking component S501 is used for checking Cell barcode comparison results corresponding to all the three-generation sequencing reads, if the three-generation sequencing reads fail to match any white list Cell barcode sequences, the three-generation sequencing reads cannot be allocated Cell barcodes and should be removed, if the three-generation sequencing reads exactly match 1 white list Cell barcode sequences, the Cell barcode sequences are allocated as the Cell barcodes of the three-generation sequencing reads, and if the three-generation sequencing reads match a plurality of white list Cell barcode sequences, the Cell barcodes with the highest occurrence frequency are selected as the Cell barcodes of the three-generation sequencing reads.

And the updating component S502 is connected with the checking component S501 and is used for updating the occurrence frequency of the Cell barcode in the actual Cell barcode background distribution when the Cell barcode is allocated to the three-generation sequencing reads.

And the separation component S503 is connected with the updating component S502, and is used for separating UMI sequences from the third-generation sequencing reads based on the obtained Cell barcode sequence region and the comparison result with the Cell barcode white list sequences. The length of the UMI sequence extends from the tail of the Cell barcode sequence to the position where the PolyA sequence first appears in the third generation sequencing reads.

A first truncating component S504, wherein the first truncating component S504 is connected with the separating component S503 and is used for positioning the P5 amplification primer sequence in the third-generation sequencing reads of the split UMI sequence by a sequence similarity comparison method, and truncating all sequences including the P5 amplification primer sequence.

A second truncating component S505, wherein the second truncating component S505 is connected with the first truncating component S504 and is used for positioning the PolyA sequence in the third generation sequencing reads after truncating the P5 amplified primer sequence, and removing the PolyA sequence and the sequence therewith by utilizing sequence similarity alignment, so as to obtain the sequence of the cDNA insert. The sequence of each cDNA insert corresponds to a specific assigned Cell barcode sequence and split UMI sequence.

And a second alignment component S506, wherein the second alignment component S506 is connected with the second truncation component S505, and is used for aligning the reference sequences of the complete cDNA insert sequences after obtaining the complete cDNA insert sequences so as to determine transcripts corresponding to each cDNA insert sequence.

In some embodiments, as shown in fig. 4, the apparatus may further include a correction unit S600, where the correction unit S600 is connected to the UMI sequence determining unit S500, and is configured to obtain a plurality of UMI sequences of a plurality of the reads, generate a plurality of K-mer sequences based on the plurality of UMI sequences, correct a K-mer sequence corresponding to a given read based on mutual similarity between the plurality of K-mer sequences, and correct the UMI sequence of the given read based on the corrected K-mer sequences.

In some embodiments of the present application, as shown in fig. 5, the correction unit S600 further includes:

And the conversion component S601 is used for converting all UMIs into a K-mer sequence with the length of K, wherein the K value comprises 4, 5, 6, 7, 8, 9 or 10 and the like.

And the second calculating component S602 is connected with the converting component S601 and is used for calculating the editing distance between every two K-mers of different UMIs.

A construction component S603, where the construction component S603 is connected to the second calculation component S602, and is configured to construct an adjacency matrix describing all UMI sequence relationships.

And the judging component S604 is connected with the constructing component S603 and is used for judging whether the editing distance between the two K-mers exceeds N, if the editing distance is smaller than N, two UMIs corresponding to the two K-mers are communicated in the adjacent matrix, and the value of N can be 3 or 2.

And the labeling component S605 is connected with the judging component S604, and is used for re-labeling each communication component in the adjacency matrix into a unified UMI sequence.

It should be noted that, in the present application, the features and advantages described with respect to the methods, apparatuses and systems in any of the above aspects, any implementation manner or examples are applicable to other aspects as well, and are not repeated herein.

The foregoing explanation of the method embodiment is also applicable to the apparatus of this embodiment, and the principle is the same, and this embodiment is not limited thereto.

Computer program product, electronic device, and computer-readable storage medium

According to embodiments of the present application, the present application also provides a computer program product, an electronic device, and a computer-readable storage medium. The present embodiment will be described in detail with reference to an electronic device as an example.

The term electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The computing device may also represent various forms of mobile apparatuses, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the electronic device 500 includes a computing unit 501 that can perform various appropriate actions and processes according to a computer program stored in a ROM (Read-Only Memory) 502 or a computer program loaded from a storage unit 508 into a RAM (Random Access Memory ) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The computing unit 501, ROM 502, and RAM 503 are connected to each other by a bus 504. An I/O (Input/Output) interface 505 is also connected to bus 504.

The various components in the device 500 are connected to an I/O interface 505, including an input unit 506, e.g., a keyboard, a mouse, etc., an output unit 507, e.g., various types of displays, speakers, etc., a storage unit 508, e.g., a magnetic disk, optical disk, etc., and a communication unit 509, e.g., a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 501 include, but are not limited to, a CPU (Central Processing Unit ), a GPU (Graphic Processing Units, graphics processing unit), various specialized AI (ARTIFICIAL INTELLIGENCE ) computing chips, various computing units running machine learning model algorithms, a DSP (DIGITAL SIGNAL Processor ), and any suitable Processor, controller, microcontroller, etc. The calculation unit 501 performs the respective methods and processes described above, for example, the construction method of the scRNA-Seq cell type annotation database. For example, in some embodiments, the method of construction of the scRNA-Seq cell type annotation database may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by computing unit 501, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the aforementioned method of construction of the scRNA-Seq cell type annotation database in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated Circuit System, FPGA (Field Programmable GATE ARRAY ), ASIC (Application-SPECIFIC INTEGRATED Circuit, application-specific integrated Circuit), ASSP (Application SPECIFIC STANDARD Product, application-specific standard Product), SOC (System On Chip ), CPLD (Complex Programmable Logic Device, complex programmable logic device), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out the methods disclosed herein may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, RAM, ROM, EPROM (ELECTRICALLY PROGRAMMABLE READ-Only-Memory, erasable programmable read-Only Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (Cathode-Ray Tube) or LCD (Liquid CRYSTAL DISPLAY) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include LAN (Local Area Network ), WAN (Wide Area Network, wide area network), the Internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual PRIVATE SERVER" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be noted that, artificial intelligence is a subject of studying a certain thought process and intelligent behavior (such as learning, reasoning, thinking, planning, etc.) of a computer to simulate a person, and has a technology at both hardware and software level. The artificial intelligence hardware technology generally comprises technologies such as a sensor, a special artificial intelligence chip, cloud computing, distributed storage, big data processing and the like, and the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

It should be noted that the features and technical effects described in the different aspects herein may be mutually referred to, and are not described herein again.

The present application is illustrated below by way of examples, but it should not be construed that the scope of the inventive subject matter is limited to the following examples. All techniques implemented based on the above description of the application are within the scope of the application.

Example 1:

exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings.

The data sample sources in this example were two mouse kidney fresh tissue single cell dissociated samples, designated b1_s1 and b1_s2. The data were obtained by first constructing a single-cell full-length transcriptome sequencing library by the MGIDNBelab C V2 single-cell platform technology, which was commercialized, the former library was subjected to three-generation sequencing by the PromethION sequencing platform of ONT, and the latter library was subjected to two-generation sequencing by BGISEQ 2000. The third generation sequencing library data is processed according to the scheme of the embodiment of the invention and is used for calculating the final Cell battery resolution efficiency and comparing the final third generation single Cell gene expression quantitative result with the second generation single Cell gene expression quantitative result.

The following describes a three-generation sequencing read analysis method of an embodiment of the present disclosure with reference to the accompanying drawings.

As shown in fig. 1, this embodiment includes the steps of:

step S101, based on the sequencing result, determining the identity mark region of the read

The library structure contained in each read in the sequencing result is firstly determined, and according to the single cell technology platform used, the correct library structure in each read can be known to contain the P3 primer sequence (22 bp) +the identity mark sequence (31 bp) +the UMI sequence (10 bp) +the cDNA insert (the length of which is related to the specific RNA molecule to be detected) +the P5 primer sequence (22 bp), and the relative positions of the sequences are sequentially connected, but the specific positions in each read cannot be determined in advance.

The Cell barcode sequence region in the sequencing reads was located based on the known P3 primer sequences. The specific embodiment is based on a BLAST method of sequence alignment, the position of the P3 primer sequence in each reading segment is determined, the position of the end of the P3 primer sequence is taken as the reference, and the base 10bp upstream of the position and the base 45bp downstream of the position are taken as the Cell barcode sequence identification region, wherein 55bp is taken as the total.

Step S102, comparing the identity mark area with at least one standard tag sequence to determine the tag sequence of the read

Based on step S101, a 55bp Cell barcode region is extracted from each read in the raw data, and the Cell barcode region is aligned with a known Cell barcode whitelist sequence by a sequence alignment BLAST method. The similarity score threshold was set to be higher than 37 according to the BLAST scoring algorithm and the Cell barcode sequence length, i.e., cell barcode whitelist sequences with similarity scores higher than 37 were retained for each Cell barcode region as alignment results.

As can be seen from the BLAST scoring algorithm and the Cell barcode sequence length, the score of two completely matched sequences is 43, and therefore the frequency of occurrence of Cell barcode is calculated from the comparison result of 43. The basis for this is that a perfectly matched white list sequence appears in the sequencing result, indicating that this white list sequence must be a valid Cell barcode sequence in the data.

Step S103, based on the tag sequence, determining UMI sequence of the read

Based on the comparison result of step S102, all sequencing reads can be divided into three cases, namely 1, the sequencing reads fail to match any white list Cell barcode sequence, the sequencing reads are directly removed from the original data, 2, the sequencing reads are exactly matched with 1 white list Cell barcode, the Cell barcode sequence is allocated as the Cell barcode of the third generation sequencing read, 3, the sequencing reads are matched with a plurality of white list Cell barcode sequences, and the Cell barcode with the highest occurrence frequency is selected and allocated as the Cell barcode of the third generation sequencing read. Note that the Cell barcode occurrence frequency of step S102 needs to be updated when the latter two cases occur. For each read, a corresponding Cell barcode white list sequence for that read is now obtained.

The position of the Cell barcode white list sequence in the Cell barcode sequence region can be determined according to the comparison result of the Cell barcode sequence region and the white list of the Cell barcode, and the position of the Cell barcode sequence region in the original read is known, so that the position of the Cell barcode white list sequence in the original read can be deduced.

Starting from the end of the Cell barcode white list sequence, finding the position where the PolyA sequence appears for the first time later as the end, and separating the sequence from the start to the end as UMI sequence.

And (3) sequentially aligning and removing the PolyA sequence and the P5 primer sequence in the rest sequence of the read by a sequence alignment BLAST method, wherein finally, one read only comprises the cDNA insert sequence corresponding to the read.

The resulting cDNA insert sequences were aligned to the mouse reference genome, and for each read, the Cell barcode white list sequence corresponding to that read was now obtained, as well as the transcript molecule corresponding to that read.

And finally, correcting the separated reading UMI. All reads of a transcript molecule corresponding to a Cell barcode are in turn in units. Extracting the K-mer sequences with the length of 5 of UMI of the read sections, calculating the edit distance of every two K-mers, marking the adjacent matrix of the corresponding UMI as 1 if the edit distance of every two K-mers is less than or equal to 2, calculating the connected components (UMI sequence set) in the adjacent matrix of the UMI, randomly selecting 1 UMI in the connected components and replacing all UMIs of the connected components with the randomly selected UMI sequence. For each read, a Cell barcode white list sequence corresponding to the read is obtained, a transcript molecule corresponding to the read is also obtained, and a corrected UMI sequence corresponding to the read is also obtained.

TABLE 1 cell Battery resolution efficiency results

Sample name	Number of reads	Effective bar code reading number	Effective bar code read ratio (%)
				b1_S1	93451540	54074499	57.86
b2_S2	87414454	52380789	59.92

The first column is the sample name, the second column is the data size of the original three-generation sequencing reads corresponding to each sample, the third column is the number of reads which can be distributed to Cell barcode, UMI correction is completed and cDNA insert fragments can be split according to the final satisfaction of the processing method, the fourth column is the ratio of the data of the third column to the data of the second column, the higher the ratio reflects the final data utilization rate, and the higher the ratio is, the higher the data utilization rate is.

As a result, as shown in fig. 7, pearson correlation coefficients of the second-generation gene expression amounts of the respective cells of each sample are shown in a violin pattern, and the distribution of the coefficients can be used to verify the final UMI correction effect, and in the case of using the second-generation single-cell gene expression amounts as a comparison reference, a higher correlation indicates that the third-generation single-cell gene expression amounts identified by the method are more consistent with the reference, thereby exhibiting a better UMI correction effect.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the spirit and scope of the invention as defined by the appended claims and their equivalents.

Claims

1.A three-generation sequencing read analysis method, comprising:

Obtaining a read, wherein the read is obtained by sequencing a sequencing library by adopting a three-generation sequencing platform, and the molecules of the sequencing library comprise a sequencing primer identification sequence, a tag sequence and a UMI sequence;

Determining an identity tag region of the read based on the sequencing primer recognition sequence in the read;

Comparing the identity tag region with at least one standard tag sequence and determining the tag sequence of the read in the identity tag region based on the comparison, and

In the identity tag region, a UMI sequence of the read is determined based on the tag sequence.

2. The method of claim 1, wherein aligning the identity tag region with at least one standard tag sequence and determining the tag sequence of the read in the identity tag region based on the alignment further comprises:

Comparing the identity mark area with at least one standard tag sequence to obtain a plurality of candidate tag sequences;

Determining the similarity of the identification mark area and each candidate label respectively, and

And selecting the candidate tag with the highest similarity as a tag sequence of the read.

3. The method according to claim 2, characterized in that the candidate tag with the highest similarity and the similarity exceeding a predetermined threshold is selected as the tag sequence of the read.

4. The method of claim 3, further comprising, when there are a plurality of the candidate tags having highest similarity:

Determining the frequency of occurrence of each of said candidate tags in a same sequencing batch, and

And selecting the candidate label with the highest occurrence frequency as a label sequence of the read.

5. The method of claim 2, wherein the similarity is determined by determining an edit distance, a similarity score, or a similarity P value;

Preferably, the similarity is determined by editing the distance.

6. A method according to claim 3, characterized in that the predetermined threshold value does not exceed 3, preferably 2.

7. The method as recited in claim 1, further comprising:

obtaining a plurality of said UMI sequences for a plurality of said reads;

Generating a plurality of K-mer sequences based on a plurality of the UMI sequences;

Correcting the K-mer sequence corresponding to a given read based on mutual similarity between the plurality of K-mer sequences;

Correcting the UMI sequence for the given read based on the corrected K-mer sequence.

8. The method of claim 7, wherein correcting the K-mer sequence corresponding to a given read based on mutual similarity between the plurality of K-mer sequences further comprises:

Calculating the editing distance between any two K-mer sequences;

obtaining an adjacency matrix of the UMI sequence based on the edit distance;

Obtaining a set of UMI sequences based on the adjacency matrix;

Any sequence in the UMI set is selected to replace the UMI sequence set.

9. The method of claim 7, wherein the K-mer sequence is 5 in length.

10. The method of claim 1, wherein the sequencing library molecule comprises a plurality of sequencing primer recognition sequences;

Optionally, the sequencing library molecule comprises a plurality of tag sequences.

11. A single cell sequencing method, comprising:

obtaining a plurality of single cells;

Constructing a sequencing library based on the plurality of single cells, wherein the sequencing library molecules comprise a sequencing primer recognition sequence, a tag sequence comprising at least one of a first tag for distinguishing single cells and a second tag for distinguishing sample sources, and a UMI sequence for distinguishing inserts of the sequencing library molecules;

sequencing the sequencing library and analyzing the sequencing library by the method of any one of claims 1-10 to determine the read sequence of the single cell.

12. A three-generation sequencing read analysis device, comprising:

the system comprises a read acquisition unit, a sequencing library acquisition unit and a sequencing unit, wherein the read acquisition unit is used for acquiring reads, the reads are obtained by adopting a three-generation sequencing platform to sequence a sequencing library, and the molecules of the sequencing library comprise a sequencing primer identification sequence, a tag sequence and a UMI sequence;

an identity tag region determining unit for determining an identity tag region of the read based on the sequencing primer recognition sequence in the read;

an alignment unit for comparing the identity tag region with at least one standard tag sequence and determining the tag sequence of the read in the identity tag region based on the comparison result, and

And a UMI sequence determining unit for determining the UMI sequence of the read based on the tag sequence in the identity tag region.

13. The apparatus as recited in claim 12, further comprising:

The correction unit is used for acquiring a plurality of UMI sequences of a plurality of read sections, generating a plurality of K-mer sequences based on the UMI sequences, correcting the K-mer sequences corresponding to a given read section based on mutual similarity among the K-mer sequences, and correcting the UMI sequences of the given read section based on the corrected K-mer sequences.

14. A single cell sequencing system, comprising:

A sequencing library construction module for obtaining a plurality of single cells, constructing a sequencing library based on the plurality of single cells, wherein the sequencing library molecules comprise a sequencing primer recognition sequence, a tag sequence comprising at least one of a first tag for distinguishing single cells and a second tag for distinguishing sample sources, and a UMI sequence for distinguishing inserts of the sequencing library molecules;

a sequencing module for sequencing the sequencing library and analyzing using the device of claim 12 or 13 to determine the read sequence of the single cell.

15. A computer program product comprising computer instructions which, when run on a computer, implement the method of any one of claims 1 to 11.

16. An electronic device, comprising:

At least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-11.

17. A computer readable storage medium containing a computer program, which when executed by one or more processors performs the method of any of claims 1-11.