US20200051667A1 - Method and systems for the efficient compression of genomic sequence reads - Google Patents
Method and systems for the efficient compression of genomic sequence reads Download PDFInfo
- Publication number
- US20200051667A1 US20200051667A1 US16/485,649 US201716485649A US2020051667A1 US 20200051667 A1 US20200051667 A1 US 20200051667A1 US 201716485649 A US201716485649 A US 201716485649A US 2020051667 A1 US2020051667 A1 US 2020051667A1
- Authority
- US
- United States
- Prior art keywords
- reads
- signaling
- descriptor
- read
- configuration parameters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B45/00—ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2365—Ensuring data consistency and integrity
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/602—Providing cryptographic facilities or services
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/048—Interaction techniques based on graphical user interfaces [GUI]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/10—Ploidy or copy number detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/10—Signal processing, e.g. from mass spectrometry [MS] or from PCR
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/10—Ontologies; Annotations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/30—Data warehousing; Computing architectures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/40—Encryption of genetic data
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/50—Compression of genetic data
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B99/00—Subject matter not provided for in other groups of this subclass
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3084—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
- H03M7/3086—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method employing a sliding window, e.g. LZ77
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/70—Type of the data to be coded, other than image and sound
Definitions
- This disclosure provides a novel method of representation of genome sequencing data which reduces the utilized storage space and improves access performance by providing new functionality that are not available with known prior art methods of representation.
- Genome sequencing data is fundamental to enable efficient genomic analysis applications such as genome variants calling and all other analysis performed with various purposes by processing the sequencing data and metadata.
- Human genome sequencing has become affordable by the emergence of high-throughput low cost sequencing technologies. Such opportunity opens new perspectives in several fields ranging from the diagnosis and treatment of cancer to the identification of genetic illnesses, from pathogen surveillance for the identification of antibodies to the creation of new vaccines, drugs and the customization of personalized treatments.
- the most used genome information representations of sequencing data are based on zipping FASTQ and SAM formats.
- the objective is to compress the traditionally used file formats (respectively FASTQ and SAM for non-aligned and aligned data).
- Such files are constituted by plain text characters and are compressed, as mentioned above, by using general purpose approaches such as LZ (from Lempel and Ziv, the authors who published the first versions) schemes (the well-known zip, gzip etc).
- LZ from Lempel and Ziv, the authors who published the first versions
- general purpose compressors such as gzip are used, the result of compression is usually a single blob of binary data.
- the information in such monolithic form results quite difficult to archive, transfer and elaborate particularly when like in the case of high throughput sequencing the volume of data are extremely large.
- the BAM format is characterized by poor compression performance due to the focus on compression of the inefficient and redundant SAM format rather than on extracting the actual genomic information conveyed by SAM files and due to the adoption of general purpose text compression algorithms such as gzip rather than exploiting the specific nature of each data source (the genomic data itself).
- CRAM provides more efficient compression for the adoption of differential encoding with respect to a reference (it partially exploits the data source redundancy), but it still lacks features such as incremental updates, support for streaming and selective access to specific classes of compressed data.
- CRAM relies on the concept of the CRAM record. Each CRAM record represents a single mapped or unmapped reads by coding all the elements necessary to reconstruct it.
- CRAM does not support data indexing and random access to data subsets sharing specific features. Data indexing is out of the scope of the specification (see section 12 of CRAM specification v 3.0) and it is implemented as a separate file. Conversely the approach of the invention described in this disclosure employs a data indexing method that is integrated with the encoding process and indexes are embedded in the encoded (i.e. compressed) bit stream.
- CRAM is built by core data blocks that can contain any type of mapped reads (perfectly matching reads, reads with substitutions only, reads with insertions or deletions (also referred to as “indels”)).
- mapped reads perfectly matching reads, reads with substitutions only, reads with insertions or deletions (also referred to as “indels”)).
- Indels reads with insertions or deletions
- CRAM is based on the concept of encapsulating each read into a “CRAM record”. This implies the need to inspect each complete “record” when reads characterized by specific biological features (e.g. reads with substitutions, but without “indels”, or perfectly mapped reads) are searched.
- each record field is associated to a specific flag and each flag must always have the same meaning as there is no notion of context since each CRAM record can contain any different type of data.
- This coding mechanism introduces redundant information and prevents the usage of efficient context based entropy coding.
- flag denoting data because this is intrinsically defined by the information “block” the data belongs to. This implies a largely reduced number of symbols to be used and a consequent reduction of the information source entropy which results into a more efficient compression.
- Such improvement is possible because the use of different “blocks” enables the encoder to reuse the same symbol across each block with different meanings according to the context.
- each flag must always have the same meaning as there is no notion of contexts and each CRAM record can contain any type of data.
- the present invention aims at compressing genomic sequences by classifying and partitioning sequencing data so that the redundant information to be coded is minimized and features such as selective access and support for incremental updates are directly enabled in the compressed domain.
- FIG. 1 shows the definition of read 1 and read 2 in a read pair and the definition of left-most and right-most nucleotide in a mapped read.
- FIG. 2 shows how an Access Unit encapsulates compressed descriptors representing sequence reads mapped in a contiguous interval of the reference sequence. Header information is prepended to the compressed descriptors in order to enable data parsing.
- FIG. 3 shows how an Access Unit of type P is composed by a header and the multiplexing of blocks of descriptors representing the reads mapping positions (pos), the reverse complement information (rcomp), the pairing information in case of paired end reads (pair), the reads length in case of variable reads length (rlen) and mapping flags (flags). It is used to encode reads of class P.
- FIG. 4 shows the calculation of the pos descriptor for a mapped read pair where read pair n is mapped at position P n and read pair n+1 is mapped at position P m .
- FIG. 5 shows how to How to calculate the absolute mapping position of one base on a Reference Sequence.
- FIG. 6 shows the use of the rcomp descriptor for paired end reads.
- FIG. 7 shows an example of how to calculate mismatches positions in a read pair.
- FIG. 8 shows how the Genomic Record Length and the Pairing Distance are calculated.
- FIG. 9 shows multiple alignments without splicing.
- the left-most read has N alignments.
- N is the first value of mmap to be decoded and signals the number of alignments of the first read.
- the following N values of the mmap descriptor are decoded and are used to calculate P which is the number of alignments of the second read.
- FIG. 10 shows how the pos, pair and mmap descriptors are used to encode multiple alignments without splices.
- the left-most read has N alignments.
- FIG. 11 shows multiple alignments with splices.
- FIG. 12 shows multiple alignments with splices. Use of the pos, pair, mmap and msar descriptors to represent multiple alignments with splices.
- FIG. 13 depicts an encoder apparatus comprising the steps of aligning genomic sequences with respect to a reference genome, generating descriptors representing the genomic sequences with respect to the reference genome, compressing each block of descriptors with a dedicated entropy coder.
- FIG. 14 shows the decoding process of a compressed bitstream comprising the steps of demultiplexing the incoming bitstream to extract the entropy coded descriptors, entropy decoding of each type of descriptors, decoding of aligned sequence reads using the reference genome.
- FIG. 15 shows how encoders of data of class N, M and I are configured with vectors of thresholds and generate separate subclasses of N, M and I data classes.
- FIG. 16 shows how half mapped read pairs (class HM) can be used to fill unknown regions of a reference sequence by assembling longer contigs with unmapped reads.
- FIG. 17 shows how reference transformations can be applied to remove mismatches from reads.
- reference transformations may generate new mismatches or change the type of mismatches found when referring to the reference before the transformation has been applied.
- FIG. 18 shows how reference transformations can change the class reads belong to when all or a subset of mismatches are removed (i.e. the read belonging to class M before transformation is assigned to Class P after the transformation of the reference has been applied).
- FIG. 19 shows how all classes of data can use the same transformed reference for re-encoding or a different transformation can be used for each class N, M and I, or any combination thereof.
- said encoding further comprises binarizing and entropy coding genomic descriptors.
- said binarizing and entropy coding genomic descriptors is such that the binarization and the entropy coding is different for the different descriptors.
- said descriptors comprise:
- mapping flags for enabling the aligner to further specify the result of the mapping process.
- the coding method further comprises coding the following descriptors:
- mmtype for signaling the types of mismatches with respect to reference sequences at the associated positions.
- the coding method further comprises coding the clips descriptor for signaling soft or hard clipped nucleotides.
- the coding method further comprises coding the rlen descriptor for signaling the length of each encoded sequence read.
- the coding method further comprises coding the following descriptors:
- mapping procedure for signaling the multiple mapping positions that are associated to a single read or read pair by the mapping procedure
- msar for signaling the identification of the existence of spliced reads (i.e. reads that when split in chunks find mapping positions with higher degrees of matching accuracy than when they are mapped as single contiguous read mapped on a single position on a reference sequence).
- the coding method further comprises coding the mscore descriptor to signal a mapping/alignment score per read as generated by genomic sequence reads aligners.
- the coding method further comprises coding the pair descriptor to signal, in case of paired end reads, how the reads are paired.
- the coding method further comprises coding the ureads descriptor to signal reads which could not be aligned at any position of the reference sequence.
- the coding method further comprises coding the rtype descriptor used to signal the subset of descriptors used to encode sequence reads that cannot be mapped at any position of the reference sequence with specified degrees of matching accuracy.
- the coding method further comprises coding the rgroup descriptor to signal to which read group the read belongs to.
- the coding method further comprises coding the following descriptors:
- rftt for signaling the type of mismatches between a contig and a reference sequence.
- said pos descriptor is binarized using a double truncated unary code or a single double truncated unary code
- said rcomp descriptor is binarized using a truncated unary code
- mapping flags descriptors are binarized using binary coding.
- said mmpos descriptor for signaling the position of mismatches in aligned reads with respect to reference sequences is binarized using a split unit wise truncated unary code
- said mmtype descriptor for signaling the types of mismatches with respect to reference sequences at the associated positions is binarized using a truncated unary code.
- said clips descriptor for signaling soft or hard clipped nucleotides is binarized using a concatenation of Signed Truncated Exponential Golomb, Truncated Unary, Signed Exponential Golomb and Binary Codes.
- said rlen descriptor signaling the length of each encoded sequence read is binarized using a Split Unit-wise Truncated Unary code.
- said mmap descriptor for signaling the multiple mapping positions that are associated to a single read or read pair by the mapping procedure is binarized using a Split Unit-wise Truncated Unary code
- said msar descriptor for signaling the identification of the existence of spliced reads is binarized using a Signed Exponential Golomb code.
- said mscore descriptor to signal a mapping/alignment score per read as generated by genomic sequence reads aligners is binarized using a Truncated Unary code
- said pair descriptor to signal, in case of paired end reads, how the reads are paired is binarized using a concatenation of Binary Coding and Split Unit-wise Truncated Unary code.
- said ureads descriptor to signal reads which could not be aligned at any position of the reference sequence is binarized using a Truncated Unary code.
- said rtype descriptor used to signal the subset of descriptors used to encode sequence reads that cannot be mapped at any position of the reference sequence with specified degrees of matching accuracy is binarized using a Truncated Unary code.
- said rgroup descriptor to signal to which read group the read belongs to is binarized using a Truncated Unary code.
- said rftp descriptor for signaling the position of mismatches between a contig and a reference sequence is binarized using a concatenation of Binary Coding and Split Unit-wise Truncated Unary code
- said rftt descriptor for signaling the type of mismatches between a contig and a reference sequence is binarized using a concatenation of Binary Coding and Truncated Unary code.
- the configuration parameters for the coding of said descriptors are contained in a syntax header.
- said configuration parameters are updated by creating updated syntax headers to be added to the encoded genomic file.
- said configuration parameters comprise a dataset type for signaling the type of data encoded in Access Units referring to this encoding parameters.
- said configuration parameters further comprise a reads length for signaling the length in nucleotides of sequence reads in case of constant reads length.
- said configuration parameters further comprise a quality values depth parameter for signaling the number of Quality Values associated to each coded nucleotide.
- configuration parameters further comprise an alignment score depth for signaling the number of alignments scores associated to each coded alignments.
- said configuration parameters further comprise a terminator size for signaling the size in bytes of the terminator symbol used for the mmpos descriptor.
- said configuration parameters further comprise a terminator value for signaling the value of the terminator symbol used for the mmpos descriptor.
- configuration parameters further comprise the number of classes for signaling the number of data classes encoded in all Access Units referring to said configuration parameters.
- said configuration parameters further comprise class identifiers to signal the identifiers associated to the data class defined in this disclosure (P, N, M, I, HM, U).
- said configuration parameters further comprise the number of descriptors for signaling the total number of descriptors contained in Access Units referring to said configuration parameters.
- said configuration parameters further comprise coding mode identifiers for signaling the coding modes defined in this disclosure.
- said configuration parameters further comprise a number of groups parameter for signaling the number of different values of the rgroup descriptor present in all Access Units referring to the current encoding parameters.
- said configuration parameters further comprise one or more group name parameters for signaling one or more read group identifiers.
- said configuration parameters further comprise a multiple alignments flag for signaling the presence of multiple alignments in the Access Unit.
- said configuration parameters further comprise a spliced reads flag for signaling the presence of spliced reads in the Access Unit. When set to 0 no spliced reads are present.
- said configuration parameters further comprise a multiple signature base flag for signaling the use of multiple signatures in an Access Unit containing unmapped sequence reads (Class U).
- said configuration parameters further comprise a signature size for signaling the size in bits of each integer representing an encoded signature.
- said configuration parameters further comprise a score exponent parameters for signaling the number of bits used to encode the exponent part of the multiple alignments score encoded in the mscore descriptor.
- said configuration parameters further comprise a score fractional parameter for signaling the number of bits used to encode the fractional part of the multiple alignments score encoded in the mscore descriptor.
- the present invention further provides a method for decoding encoded genomic data said genome sequence data comprising reads of sequences of nucleotides, said method comprising the steps of:
- decoding of multiplicity of blocks comprise decoding and de-binarizing genomic descriptors to extract aligned reads according to specific matching rules defining their classification with respect to one or more reference sequences.
- said descriptors comprise:
- a rcomp descriptor for signaling the DNA or RNA strand the reads was mapped on mapping flags for enabling the aligner to further specify the result of the mapping process.
- the decoding method further comprises decoding the following descriptors:
- mmtype for signaling the types of mismatches with respect to reference sequences at the associated positions.
- the decoding method further comprises decoding the clips descriptor for signaling soft or hard clipped nucleotides.
- the decoding method further comprises decoding the rlen descriptor for signaling the length of each encoded sequence read.
- the decoding method further comprises decoding the following descriptors:
- mapping procedure for signaling the multiple mapping positions that are associated to a single read or read pair by the mapping procedure
- msar for signaling the identification of the existence of spliced reads (i.e. reads that when split in chunks find mapping positions with higher degrees of matching accuracy than when they are mapped as single contiguous read mapped on a single position on a reference sequence).
- the decoding method further comprises decoding the mscore descriptor to signal a mapping/alignment score per read as generated by genomic sequence reads aligners.
- the decoding method further comprises decoding the pair descriptor to signal, in case of paired end reads, how the reads are paired.
- the decoding method further comprises decoding the ureads descriptor to signal reads which could not be aligned at any position of the reference sequence.
- the decoding method further comprises decoding the rtype descriptor used to signal the subset of descriptors used to encode sequence reads that cannot be mapped at any position of the reference sequence with specified degrees of matching accuracy.
- the decoding method further comprises decoding the rgroup descriptor to signal to which read group the read belongs to.
- the decoding method further comprises decoding the following descriptors:
- rftp for signaling the position of mismatches between a contig and a reference sequence. Positions of mismatches are terminated by a special terminator character.
- rftt for signaling the type of mismatches between a contig and a reference sequence.
- the configuration parameters for the decoding of said descriptors are extracted from a syntax header.
- said configuration parameters comprise a dataset type for signaling the type of data encoded in Access Units referring to this encoding parameters.
- configuration parameters further comprise a reads length for signaling the length in nucleotides of sequence reads in case of constant reads length.
- said configuration parameters further comprise a quality values depth parameter for signaling the number of Quality Values associated to each coded nucleotide.
- configuration parameters further comprise an alignment score depth for signaling the number of alignments scores associated to each coded alignments.
- said configuration parameters further comprise a terminator size for signaling the size in bytes of the terminator symbol used for the mmpos descriptor.
- said configuration parameters further comprise a terminator value for signaling the value of the terminator symbol used for the mmpos descriptor.
- configuration parameters further comprise the number of classes for signaling the number of data classes encoded in all Access Units referring to said configuration parameters.
- said configuration parameters further comprise class identifiers to signal the identifiers associated to the data class defined in this disclosure (P, N, M, I, HM, U).
- configuration parameters further comprise the number of descriptors for signaling the total number of descriptors contained in Access Units referring to said configuration parameters.
- said configuration parameters further comprise coding mode identifiers for signaling the coding modes defined in this disclosure.
- said configuration parameters further comprise a number of groups parameter for signaling the number of different values of the rgroup descriptor present in all Access Units referring to the current encoding parameters.
- said configuration parameters further comprise one or more group name parameters for signaling one or more read group identifiers.
- said configuration parameters further comprise a multiple alignments flag for signaling the presence of multiple alignments in the Access Unit.
- said configuration parameters further comprise a spliced reads flag for signaling the presence of spliced reads in the Access Unit. When set to 0 no spliced reads are present.
- said configuration parameters further comprise a multiple signature base flag for signaling the use of multiple signatures in an Access Unit containing unmapped sequence reads (Class U).
- configuration parameters further comprise a signature size for signaling the size in bits of each integer representing an encoded signature.
- configuration parameters further comprise a score exponent parameters for signaling the number of bits used to encode the exponent part of the multiple alignments score encoded in the mscore descriptor.
- configuration parameters further comprise a score fractional parameter for signaling the number of bits used to encode the fractional part of the multiple alignments score encoded in the mscore descriptor.
- the present invention further provides an encoding apparatus comprising encoding means for implementing all the aspects of the previously mentioned encoding methods.
- the present invention further provides a decoding apparatus for implementing all the aspects of the previously mentioned decoding methods.
- the present invention further provides a file format comprising the previously defined genomic descriptors
- the present invention further provides a computer-readable medium comprising instructions that when executed cause at least one processor to perform all the aspects of the previously mentioned coding methods.
- the present invention further provides a computer-readable medium comprising instructions that when executed cause at least one processor to perform all the aspects of the previously mentioned decoding methods.
- the present invention further provides a support data storing genomic encoded according perform all the aspects of the previously mentioned coding methods.
- genomic or proteomic sequences referred to in this invention include, for example, and not as a limitation, nucleotide sequences, Deoxyribonucleic acid (DNA) sequences, Ribonucleic acid (RNA), and amino acid sequences.
- DNA Deoxyribonucleic acid
- RNA Ribonucleic acid
- amino acid sequences amino acid sequences.
- Genome sequencing information is generated by High Throughput Sequencing (HTS) machines in the form of sequences of nucleotides (a. k. a. “bases”) represented by strings of letters from a defined vocabulary.
- the smallest vocabulary is represented by five symbols: ⁇ A, C, G, T, N ⁇ representing the 4 types of nucleotides present in DNA namely Adenine, Cytosine, Guanine, and Thymine.
- RNA Thymine is replaced by Uracil (U).
- U indicates that the sequencing machine was not able to call any base and so the real nature of the position is undetermined.
- the alphabet used for the symbols is (A, C, G, T, U, W, S, M, K, R, Y, B, D, H, V, N or -).
- sequence reads can be between a few dozens to several thousand nucleotides long. Some technologies produce sequence reads in “pairs” where one read is from one DNA strand and the second is from the other strand. A read associated to another read in a sequencing process producing pairs is said to be its mate.
- a reference sequence is a sequence of nucleotides associated to a mono-dimensional integer coordinate system for which each integer coordinate is associated to a single nucleotide. Coordinate values can only be equal or larger than zero.
- This coordinate system in the context of this invention is zero-based (i.e. the first nucleotide has coordinate 0 and it is said to be at position 0) and linearly increasing from left to right.
- a reference sequence is any sequence on which the nucleotides sequences produced by sequencing machines are aligned/mapped.
- One example of sequence could actually be a “reference genome”, a sequence assembled by scientists as a representative example of a species' set of genes.
- GRCh37 the Genome Reference Consortium human genome (build 37) is derived from thirteen anonymous volunteers from Buffalo, N.Y.
- a reference sequence could also consist of a synthetic sequence conceived and constructed to merely improve the compressibility of the reads in view of their further processing.
- a common element of efficient approaches to genomic sequence reads compression is the exploitation of the correlation of sequence data with respect to a reference sequence. Even if the somatic profile of the human population is extremely diversified, the actual portion of the number of nucleotides that differs from person to person is about only 0.1% of the total number of nucleotides composing an entire genome. Therefore, the specific genomic information characterizing each individual is very limited with respect to the entire information carried by an entire genome. When a pre-existing reference genome is available, be it for previous sequencing or as a published “average” consensus reference, the most common way as of today to encode the information is to identify and encode only the differences with respect to the reference genome.
- mapping sequence reads on a reference genome In order to do so with raw sequence reads, generally expressed in the form of FASTQ data files, a preliminary pre-processing step the mapping on a reference genome. In case an appropriate reference genome is not available, or if the bias introduced by the usage of a specific reference is not desirable, the construction of a new reference sequence by means of assembling the sequence reads at hand into longer sequences called contigs, is a possible alternative.
- mapping sequence reads on a reference sequence said reference sequence is used as axis of a mono-dimensional coordinate system in which the left-most position is denoted as position 0.
- nucleotide mapped at the reference sequence position identified by the smallest coordinate number is usually referred to as the “left-most” nucleotide, whereas the nucleotide mapped at the reference sequence position identified by the largest coordinate number is referred to as the “right-most” nucleotide. This is illustrated in FIG. 1 .
- a nucleotide is also referred to as a base.
- the coordinate of the left-most mapped base is said to represent the mapping position of the read on the reference sequence.
- a base present in the aligned read and not present in the reference sequence (a.k.a. insertion) and bases preserved by the alignment process but not mapped on the reference sequence (a.k.a. soft clips) do not have mapping positions.
- a reference genome is composed by one or more reference sequences and it is assembled by scientists as a representative example of a species' set of genes.
- GRCh37 the Genome Reference Consortium human genome (build 37) is derived from thirteen anonymous volunteers from Buffalo, N.Y.
- a reference sequence could also consist of a synthetic sequence conceived and merely constructed to improve the compressibility of the reads in view of their further processing.
- the read composing a read pair with a base mapping on the smallest coordinate on a reference sequence is referred to as “Read 1” whereas its mate is referred to as “Read 2”.
- the distance expressed as number of nucleotides (or bases), separating two reads generated as a pair, by a sequencing machine using current state of the art sequencing technology, is unknown, and it is determined by mapping both reads composing the pair (i.e. minimizing appropriate matching functions) to a reference sequence.
- genomic record is a data structure encoding either a single sequence read or a paired sequence read optionally associated with alignment information, read identifier and quality values.
- an Access Unit is defined as a logical data structure containing a coded representation of genomic information or related metadata to facilitate the bit stream access and manipulation. It is the smallest data organization that can be decoded by a decoding device implementing the invention described in this disclosure.
- an AU can be decoded either independently of any other AU or using information contained in other AUs.
- AUs can be classified into a multiplicity of types according to the nature of the coded sequence data.
- An Access Unit contains either a reference sequence, or a portion thereof, or encoded reads or read pairs belonging to a single class of data. Any single AU cannot contain two or more types of sequence data.
- an Access Unit may contain the entire chromosome 1 of GRCh37, the Genome Reference Consortium human genome (build 37).
- Another Access Unit may contain the coded representation of nucleotides of chromosome 1 of GRCh37 that are located between coordinates 50′000 and 150′000.
- Another Access Unit may contain only reads or read pairs that perfectly map on the reference sequence without any mismatch.
- Another Access Unit may contain reads or read pairs that only contain “N” symbols as mismatches with respect to the reference sequence.
- Another Access Unit may contain reads or read pairs that contain any type of substitutions (e.g. one base present in the read or read pair is different from the base at the corresponding mapping position in the reference sequence).
- Another Access Unit may contain reads or read pairs that contain mismatches, insertions, deletions and soft clipped bases.
- Another Access Unit may contain only read or read pairs that do not map on the reference sequence.
- Another Access Unit may contain only read pairs in which one read is mapped and the other is not mapped on the reference sequence.
- Another type of Access Unit may contain only encoded segments of a reference genome composed by one or more reference sequences (for example chromosomes).
- an Access Unit contains in compressed form all elements needed to reconstruct the genomic information (sequence reads or read pairs, reference sequences), associated alignment information and metadata of reads or read pairs it represents. In other words, to fully reconstruct the reads, read pairs or reference sequence and associated information carried by an Access Unit it is only necessary to retrieve the Access Unit itself and, when applicable, the Access Units containing the reference sequence it refers to.
- each Access Unit the descriptors listed in the next section representing the encoded read or read pairs are aggregated in separate data blocks—one per type—in order to exploit their homogeneous statistical properties for achieving high performance entropy coding.
- Each Access Unit contains the compressed sub-set of descriptors representing sequence reads or read pairs belonging to the same class mapped to a genomic region on a reference sequence.
- genomic region on the reference sequence is defined by a start coordinate (or start position) and an end coordinate (or end position).
- Access Units are composed by blocks of encoded genomic descriptors (described in the next section). To enable transport over a network, blocks are further decomposed into packets. When compressing genomic sequence reads, each Access Units contains compressed descriptor representing either sequence reads mapped to a genomic interval on the reference sequence or unmapped sequence reads. Access Units can be used to carry reference genomes or parts thereof. A reference sequence can be either encoded as a single long sequence of nucleotides or split into shorter sequences encoded as unmapped genomic sequence reads.
- access unit start position left-most genomic record position among all genomic records contained in the access unit.
- access unit end position right-most base position among all mapped bases of all genomic records contained in the access unit.
- access unit range the genomic range comprised between the access unit start position and the right-most genomic record position among all genomic records contained in the access unit.
- access unit size number of genomic records contained in an access unit.
- genomic range comprised between the Access Unit start position and the Access Unit end position.
- genomic dataset is a compression unit containing headers and access units.
- the set of access units composing the genomic dataset constitutes the genomic dataset payload.
- a collection of one or more genomic datasets is called dataset group.
- genomic descriptors are syntax elements representing part of the information (and also elements of a syntax structure of a file format and/or a bitstream) necessary to reconstruct (i.e. decode) coded reference sequences, sequence reads and the associated mapping information.
- the genomic descriptors disclosed in this invention are listed in Table 3.
- reference sequences or portion thereof, sequence reads and the associated alignment information are coded using a sub-set of the descriptors listed above which are then entropy coded using a multiplicity of entropy coders according to each descriptor specific statistical properties.
- Blocks of compressed descriptors with homogeneous statistical properties are structured in Access Units which represent the smallest coded representation of one or more genomic sequence that can be manipulated by a device implementing the invention described in this disclosure.
- Genomic descriptors are organized in blocks and streams as defined below.
- a block is defined as a data unit composed by a header and a payload, which is composed by portions of compressed descriptors of the same type.
- a descriptor stream is defined as a sequence of encoded descriptor blocks used to decode a descriptor of a specific Data Class.
- Sequencing devices can introduce errors in the sequence reads such as:
- the average coverage of aligned genome sequencing data is the average number of time each base at each position of the reference genome is present in the aligned data. For example, to reach a coverage of 30 ⁇ on a human genome (3.2 billion bases long) a sequencing machine shall produce a total of 30 ⁇ 3.2 billion bases so that in average each position in the reference is “covered” 30 times.
- the coverage is said to be:
- This invention aims at defining a genomic information representation format in which the relevant information is efficiently accessible and transportable and the weight of the redundant information is reduced.
- Sequence reads are classified and partitioned into data classes according to the results of the alignment with respect to reference sequences. Such classification and partitioning enables the selective access to encoded data according to criteria related to the alignment results and to the matching accuracy.
- the classified sequence reads and the associated metadata are represented by genomic descriptors organized in blocks with homogeneous statistical properties enabling the definition of distinct information sources characterized by a low information entropy.
- sequence reads generated by sequencing machines are classified by the disclosed invention into six different “classes” according to the matching results of the alignment with respect to one or more “pre-existing” reference sequences.
- n type mismatch Such sequences belong to “Class N” reads. Once the read is classified to belong to “Class N” it is useful to limit the degree of matching inaccuracy to a given upper bound and set a boundary between what is considered a valid matching and what it is not. Therefore, the reads assigned to Class N are also constrained by setting a threshold (MAXN) that defines the maximum number of undefined bases (i.e. bases called as “N”) that a read can contain.
- MAXN the maximum number of undefined bases
- N bases called as “N”
- classification implicitly defines the required minimum matching accuracy (or maximum degree of mismatch) that all reads belonging to Class N share when referred to the corresponding reference sequence, which constitutes an useful criterion for applying selective data searches to the compressed data.
- the classification specified in the previous section concerns single sequence reads.
- sequencing technologies that generates read in pairs (i.e. Illumina Inc.) in which two reads are known to be separated by an unknown sequence of variable length, it is appropriate to consider the classification of the entire pair to a single data class.
- a read that is coupled with another is said to be its “mate”.
- the entire pair is assigned to the same class for any class (i.e. P, N, M, I, U). In the case the two reads belong to a different class, but none of them belongs to the “Class U”, then the entire pair is assigned to the class with the highest priority defined according to the following expression:
- the table below summarizes the matching rules applied to reads in order to define the class of data each read belongs to.
- the rules are defined in the first five columns of the table in terms of presence or absence of type of mismatches (n, s, d, i and c type mismatches).
- the sixth column provide rules in terms of maximum threshold for each mismatch type and any function f(n,s) and w(n,s,d,i,c) of the possible mismatch types.
- the data classes of type N, M and I as defined in the previous sections can be further decomposed into an arbitrary number of distinct sub-classes with different degrees of matching accuracy. Such option is an important technical advantage in providing a finer granularity and as consequence a much more efficient selective access to each data class.
- Sub-Class N 1 a number k of subclasses
- Sub-Class N k it is necessary to define a vector with the corresponding components MAXN 1 , MAXN 2 , . . . , MAXN( k-1 ), MAXN( k ), with the condition that MAXN 1 ⁇ MAXN 2 ⁇ .
- a data classification unit 1501 contains Class P, N, M, I, U, HM encoder and encoders for annotations and metadata.
- Class N encoder is configured with a vector of thresholds, MAXN, to MAXN k 1502 which generates k subclasses of N data ( 1506 ).
- the same principle is applied by defining a vector with the same properties for MAXM and MAXTOT respectively and use each vector components as threshold for checking if the functions f(n,s) and w(n,s,d,l,c) satisfy the constraint.
- the assignment is given to the lowest sub-class for which the constraint is satisfied.
- the number of sub-classes for each class type is independent and any combination of subdivisions is admissible. This is shown in FIG. 15 where a Class M encoder 1503 and a Class I encoder 1504 are configured respectively with a vector of thresholds MAXIM, to MAXIM, and MAXTOT 1 to MAXTOT h .
- the two encoders generate respectively j subclasses of M data ( 1507 ) and h subclasses of I data ( 1508 ).
- N has the lowest priority and I has the highest priority.
- the mismatches found for the reads classified in the classes N, M and I can be used to create “transformed” references to be used to compress more efficiently the read representation.
- Reads classified as belonging to the Classes N, M or I (with respect to the “pre-existing” (i.e. “external”) reference sequence denoted as RS 0 ) can be coded with respect to the “transformed” reference sequence RS, according to the occurrence of the actual mismatches with the “transformed” reference.
- read M in belonging to Class M (denoted as the ith read of class M) containing mismatches with respect to the reference sequence RS n
- FIG. 19 shows an example on how reads containing mismatches (belonging to Class M) with respect to reference sequence 1 (RS 1 ) can be transformed into perfectly matching reads with respect to the reference sequence 2 (RS 2 ) obtained from RS, by modifying the bases corresponding to the mismatch positions. They remain classified and they are coded together the other reads in the same data class access unit, but the coding is done using only the descriptors and descriptor values needed for a Class P read. This transformation can be denoted as:
- FIG. 18 shows an example on how a reference transformation is applied to reduce the number of mismatches to be coded on the mapped reads.
- FIG. 19 further shows an example of how reads can change the type of coding from a data class to another by means of the appropriate set of descriptors (e.g. using the descriptors of a Class P to code a read from Class M) after a reference transformation is applied and the read is represented using the “transformed” reference.
- the definition of the set of descriptors used for each class of data is provided in the following sections.
- further processing consists in defining a set of distinct syntax elements which represent the remaining information enabling the reconstruction of the read sequence when represented as being mapped on a given reference sequence.
- the data structure of these syntax elements requires the storage of global parameters and metadata to be used by the decoding engine.
- These data are structured in a Genomic Dataset Header described in the table below.
- a dataset is defined as the ensemble of coding elements needed to reconstruct the genomic information related to a single genomic sequencing run and all the following analysis. If the same genomic sample is sequenced twice in two distinct runs, the obtained data will be encoded in two distinct datasets.
- dataset_header ⁇ dataset_group_ID identifier of Dataset Group containing the Dataset including this dataset_header.
- Block_header_flag if set, all Blocks composing the Dataset are preceded by a Block Header if (block_header_flag) ⁇ mit_flag if set, the indexing tool called Master Index Table is present cc_ mode_flag if set, two Access Units of a class cannot be separated by Access Units of a different class in the storage device.
- a sequence read i.e. a DNA segment
- a given reference sequence can be fully expressed optionally using a subset formed of various combinations of the following descriptors:
- Genomic descriptors and their meaning descriptor_id short name description 1 pos the mapping position of a read on a reference sequence 2 pair in case of paired end reads, it represents how the reads are paired 3 rlen the length of a sequence read 4 rcomp the DNA or RNA strand the reads was mapped on 5 mmpos the position of mismatches (i.e.
- mapping flags to enable the aligner to further specify the result of the mapping process such as if the sequence read is a PCR or optical duplicate 9 mmap the multiple mapping positions that are associated to a single read or read pair by the mapping procedure 10 msar the identification of the existence of spliced reads (i.e.
- the descriptor clips identifies those parts of the reads (typically the edges) that do not match, with a specified set of matching accuracy constraints, with the “internal” references.
- Descriptor ureads is used to encode verbatim the reads that cannot be mapped on any available reference being it a pre-existing (i.e. “external” like an actual reference genome) or an “internal” reference sequence.
- This classification creates groups of descriptors (syntax elements) that can be used to univocally represent genome sequence reads.
- the table below summarizes the syntax elements needed for each class of reads aligned with “external” (i.e. “pre-existing”) or “internal” (i.e. “constructed”) references.
- the asterisk “*” indicates mandatory descriptors always present in all encoded read for each class.
- Reads belonging to class P are characterized and can be perfectly reconstructed by only a position, a reverse complement information and an offset between mates in case they have been obtained by a sequencing technology yielding mated pairs, some flags and a read length.
- the pos descriptor is used to calculate the absolute mapping position on a Reference Sequence of the left-most mapped base of a Genomic Record.
- the value of each pos descriptor represents the difference between the coordinates, on the Reference Sequence, of the left-most mapping base of a Genomic Record and the previous one.
- FIG. 4 shows an example of calculation of the pos descriptor for a mapped read pair.
- the first value of the pos descriptor in each coded Block is always 0 since no differential coding is possible for the first mapped read or read pair coded in an Access Unit.
- the absolute position of the first mapped read or read pair coded in an Access Unit is contained in the Access Unit Header.
- p 0 is the mapping value retrieved from the Access Unit Header for the first Genomic Record in the Access Unit.
- mapping position p for one base on a Reference Sequence is shown in FIG. 5 .
- the rcomp descriptor conveys information about the strandedness of reads.
- Each bit of a decoded rcomp descriptor is a flag indicating if the read is on the forward (bit set to 0) or reverse (bit set to 1) strand.
- FIG. 6 shows values and semantics of the rcomp descriptor for paired end reads. In the figure r1 is read 1 and r2 is read 2. According to the mapping position of each read, the rcomp descriptor can have four different values.
- the flag descriptor is a set of flaas as described in Table 6.
- bit position from LSB Semantics 0 read is PCR or optical duplicate 1 read fails platform/vendor quality checks 2 read mapped in proper pair 3 reserved 4 reserved 5 reserved 6 reserved 7 reserved
- the mmpos descriptor represents the position, within a read or read pairs, of a mismatch with respect to the reference sequence.
- the position is represented as a distance from the position of the previous mismatch in the Genomic Record.
- the position of the first mismatch is represented as the distance from the left-most mapped base in the Genomic Record.
- the positions of mismatches in read 2 are offset by the length of read 1. For example in case of reads with constant length equal to 100, if the first mismatches in the pair is in read 2 at position 44, the first mmpos descriptor decoded for this Genomic Record assumes the value 144.
- the mismatch positions are not offset by the length of read 1. For example in case of fixed read length 100, if the first mutation in read 2 is at position 44, but read 2 is unpaired, the first mmpos descriptor decoded for this Genomic Record assumes the value 44.
- Each mmpos descriptor is associated to an mmtype descriptor representing the type of mismatch occurring in the decoded read or read pair at the position calculated using mmpos.
- FIG. 7 An example of how to calculate mismatches positions in a read pair is provided in FIG. 7 where len 1 is the length of read1.
- the mmtype descriptor specifies the type of mismatch occurring in the decoded read at the position calculated using the associated mmpos descriptor.
- the mmtype descriptor does not have a reserved value for the terminator since each Genomic Record shall contain the same number of mmtype and mmpos descriptors.
- Table 8 lists the values and corresponding semantics of the mmtype descriptor according to the used alphabet.
- the clips descriptor is used to represent clipped bases (a.k.a. soft or hard clips) in mapped reads or read pairs. This descriptor encodes soft clips as sequences of ASCII characters with additional elements to identify the position of the clipped bases in the read or read pair. In case of hard clips only the position and the number of clipped bases is encoded. Each descriptor contains a Genomic Record identifier followed by information related to the clipped bases position in the Genomic Record and the actual clipped bases in case of soft clips.
- record_id is a counter of Genomic Records encoded in the current Access Unit.
- clips_pos represents the position of the next clipped bases in the read or read pair. Values for positions have the following meaning:
- clips_pos value semantics 0x00 soft clips before start of read 1 0x01 soft clips after end of read 1 0x02 soft clips before start of read 2 0x03 soft clips after end of read2 0x04 hard clips before start of read 1 0x05 hard clips after end of read 1 0x06 hard clips before start of read 2 0x07 hard clips after end of read2
- soft_clipped_base is one of the symbols of the alphabet identified by alphabet_id.
- hard_clipped_bases represent the number of hard clipped bases at the position indicated by the corresponding clips_pos;
- the ureads descriptor represents verbatim sequence reads as a sequence of ASCII characters belonging to the currently used alphabet identified by alphabet_id.
- a decoded rlen descriptor represents the length of the current sequence read as number of bases including soft clips.
- the information required to reconstruct paired reads is encoded using the pair descriptor.
- the pairing information associating one genomic segment to another can be expressed in three ways:
- the pairing information is encoded as described in points 2 and 3 above when the first two bytes of a decoded pair descriptor have one of the values listed in Table 11.
- FIG. 8 shows how the Genomic Record Length is defined as the number of genomic positions on a reference sequence separating the left-most mapped base from the right-most mapped base of a read or read pair. In case of read pairs this is the number of genomic position on the reference sequence separating the left-most base of read 1 from the right-most base of its mate read 2 when both reads are mapped on the same reference sequence.
- the “Pairing Distance” is defined as the difference between the left-most position of Read1 and the left-most position of Read2.
- the “Pairing Distance” is represented as a signed integer valued of the pair descriptor.
- read 2 When aligning reads to a reference sequence, read 2 can be mapped to a position that is smaller (e.g. to the left) than the mapping position of read 1; in this case, the pairing distance used in case 1 above will be negative. This implies that the information about reads strandedness is encoded in the sign of the pairing distance descriptor.
- the reads distance is encoded as a 2-bytes signed integer, where:
- the absolute position shall be encoded in the pair descriptor after the special value 0x7ffd or 0x8003 as defined in Table 11 and the two reads are encoded in two separate Genomic Records (i.e. the pair is “split”).
- the mscore descriptor provides a score per alignment. It shall be used to represent mapping/alignment score per read generated by genomic sequence reads aligners. The score shall be expressed using an exponent and fractional part. The number of bits used to represent the exponent and the fractional part are specified in the encoding parameters (see Parameter Set below). Table 12 shows how this is specified in IEEE RFC 754 for a 11-bits exponent and a 52-bits fractional part.
- the score of each alignment shall be represented by:
- the score is calculated as:
- the rgroup descriptor identifies the read group the Genomic Record belongs to. It is an unsigned 8-bit integer with values going from 0 to num_groups ⁇ 1. The presence of read groups in an Access Unit is signaled by num_groups >0 in the Parameter Set as defined in the Parameter Set defined below.
- the msar Multiple Segments Alignment Record
- the msar supports spliced reads and alternative secondary alignments which contain indels or soft clips. msar is intended to convey information on:
- msar can is used to represent mismatches, insertions, deletions, strandedness and clipped bases of secondary alignments of an aligned read
- the mmap descriptor is used to signal at how many positions the read or the leftmost read of a pair has been aligned.
- a Genomic Record containing multiple alignments is associated with one multi-byte mmap descriptor.
- the value of N represents the number of values of the pos descriptor which are coded for the template in the current record.
- N is followed by one or more 8-bit unsigned integers M i as described in this disclosure.
- the rcomp descriptor defined in this disclosure is used to specify the strandedness of each read alignment using the same syntax specified above.
- one mscore as defined in this disclosure is assigned to each alignment.
- spliced_reads_flag is unset.
- the mmap descriptor is composed by a 16-bit unsigned integer N followed by one or more 8-bit unsigned integers M i , with i assuming values from 1 to the number of complete first (here, the left-most) read alignments.
- M i is used to signal how many segments are used to align the second read (in this case, without splices, this is equal to the number of alignments), and then how many values of the pair descriptor are coded for that alignment of the first read.
- FIG. 9 illustrates the meaning of N, P and M i in case of multiple alignments without splices and FIG. 10 shows how the pos, pair and mmap descriptors are used to encode the multiple alignments information.
- the msar descriptor enables representaiton of splices length and strandedness as defined in this disclosure.
- the decoder After having decoded the mmap and the msar descriptors, the decoder knows how many reads or read pairs have been encoded to represent the multiple mappings and how many segments are composing each read or read pair mapping. This is shown in FIG. 11 and FIG. 12 .
- the mscore descriptor allows signaling the mapping score of an alignment. In single-end sequencing it will have N 1 values per template; in paired-end sequencing it will have a value for each alignment of the entire template (number of different alignments of the first read possibly +the number of further second read alignments, i.e. when M i ⁇ 1>0).
- the number of scores associated to each alignment is signalled by the encoding parameter
- N pos descriptors are used These are floating point values as Read pair: defined in this disclosure.
- the optional pairing descriptor is used when alignments are present on different reference sequences than the one currently encoded N + 1 mmap descriptors read (pair) 0 0 uniquely mapped
- Table 14 shows how to calculate the number of descriptors needed to represent multiple alignments in one Genomic Record in case of multiple alignments with splices.
- N pos descriptors are used defined in this disclosure
- a pair descriptor shall be used to represent the absolute read positions when there is for example a chimeric alignment with the mate on another chromosome.
- the pair descriptor shall be used to signal the reference and the position of the next record containing further alignments for the same template.
- the last record e.g. the third if alternative mappings are coded in three different AUs shall contain the reference and position of the first record.
- a reserved value of the pair descriptor shall be used (not the same as the one used for alignments present to another reference in case of unique alignment).
- the reserved value shall be followed by the reference sequence identifier and the position of the leftmost alignment among all those contained in the next AU (i.e. the first decoded value of the pos descriptor for that record).
- msar descriptor shall be used to represent how secondary alignments map on the reference sequence in case they contain indels and/or soft clips. If msar is represented by the special symbol “*” for a secondary alignment, the decoder shall reconstruct the secondary alignment from the primary alignment and the secondary alignment mapping position.
- Raw reads belong to class U only. They are encoded as unmapped reads in aligned datasets.
- descriptor_id Descriptor Semantics Comments 1 pos read mapping reads positions on the reference are 0-based position (signed) delta encoded with respect to the previous read 4 rcomp strand information for N bits reads in a template 8 flags used to cover SAM This is a placeholder to understand the type of flags and possibly flags we need to associate to each template. other flags SAM flags can be a starting point. 5 mmpos mismatch position (unsigned) delta encoded with respect to the previous mismatch 6 mmtype type of edit fixed alphabet operations: substitutions deletions insertions 7 clips string of nucleotides ASCII characters + position in the template with variable length (e.g. soft clips) 11 ureads unmapped reads ASCII characters encoded verbatim 3 rlen read lengths 32-bit integer 12 rtype read type This identifies the subset of descriptors needed to decode the read.
- the ureads descriptor represents verbatim sequence reads as a sequence of ASCII characters belonging to the currently used alphabet.
- the rtype descriptor is used to signal the subset of descriptors used to encode one unmapped read or read pair in a Genomic Record as shown in Table 15.
- the rtype descriptor also enables mixing reference-based and reference-less compression in the same Dataset.
- rtype>0 signals the set of descriptors to be used for reference less compression(in this case descriptors refer to the computed reference. when needed).
- rtype type of encoded reads description not used aligned reference the entire Dataset is encoded using based reference based compression for mapped reads 0 reference based the Dataset contains both read (pairs) encoded using reference based compression and reference less compression.
- the present invention uses context-adaptive binary arithmetic coding (CABAC) for the compression of the genomic descriptors.
- CABAC context-adaptive binary arithmetic coding
- the process of binarization converts a non-binary-valued symbol (e.g. a mapping position, a mapped read length or a mismatch type) into a binary code prior to arithmetic coding.
- Each binarization algorithm used in this disclosure is identified by an identifier as shown in Table 16.
- binarization_id Type of binarization 0 Binary Coding (BI) 1 Truncated Unary (TU) 2 Exponential Golomb (EG) 3 Signed Exponential Golomb (SEG) 4 Truncated Exponential Golomb (TEG) 5 Signed Truncated Exponential Golomb (STEG) 6 Split Unit-wise Truncated Unary (SUTU) 7 Signed Split Unit-wise Truncated Unary (SSUTU) 8 Double Truncated Unary (DTU) 9 Signed Double Truncated Unary (SDTU)
- binValue is the binarized value which can be either 0 or 1.
- the parsing process for genomic descriptors binarized using this technique begins with reading the bits starting at the current location in the bitstream up to and including the first non-zero bit, and counting the number of leading bits that are equal to 0.
- variable sym Val is then assigned as follows:
- read_bits reads a number of bits from a storage medium equal to the parameter passed as input.
- the value returned from read_bits(leadingZeroBits) is interpreted as a binary representation of an unsigned integer with the most significant bit written first.
- Table 18 illustrates the structure of the Exp-Golomb code by separating the bit string into “prefix” and “suffix” bits.
- the “prefix” bits are those bits that are parsed as specified above for the computation of leadingZeroBits, and are shown as either 0 or 1 in the bit string column of Table 18.
- the “suffix” bits are those bits that are parsed in the computation of symVal and are shown as x i in Table 18, with i in the range of 0 to leadingZeroBits ⁇ 1, inclusive. Each x i is equal to either 0 or 1.
- Table 19 illustrates the explicit assignments of bit strings to symVal values.
- a binarized syntax element is decoded using one of the following methods:
- the genomic descriptor is associated to the symVal by ordering the syntax element by its absolute value in increasing order and representing the positive value for a given absolute value with the lower symVal.
- Table 20 shows the assignment rule.
- This binarization process requires the use of an additional input parameter tegParam which defines how the binarization is calculated.
- Output of this process is the TEG binarization of the syntax element.
- the SUTU binary string is a concatenation of repeated TU binarizations, where each TU binarization is applied to portions of symVal which are splitUnitSize bits long.
- bitstream syntax for this binarization process is described below.
- (tmp ⁇ i) ⁇ return output_sym ⁇
- This binarization process requires the use of two input parameters splitUnitSize and outputSymSize.
- the SSUTU binary string is obtained by an extension of the SUTU binarization process with the sign of symVal coded as a separate flag.
- sign_flag represents the cabac decoding of a bit on context variable identified by ctxldx.
- decode_cabac_SUTU( ) represents the cabac decoding process for the SUTU binarization.
- This binarization process requires the use of two input parameters splitUnitSize and outputSymSize.
- the DTU binary string is a concatenation of two binarizations, namely the TU binarization and the SUTU binarization.
- the parameter cMax is used for the TU binarization, and parameters splitUnitSize and outputSymSize are used for the SUTU binarization (where its cMax is derived internally).
- decode_cabac_TU( ) represents the cabac decoding process for TU binarization.
- decode_cabac_SUTU( ) represents the cabac decoding process for SUTU binarization.
- This binarization process requires the use of two additional input parameters splitUnitSize and outputSymSize.
- the SDTU binary string is obtained by an extension of the DTU binarization process with the sign of symVal coded as a flag.
- sign_flag represents the cabac decoding of a bit on context variable identified by ctxIdx.
- decode_cabac_DTU( ) represents the cabac decoding with DTU binarization.
- Each binarization algorithm introduced in the previous sections requires configuration parameters at the encoding and decoding ends.
- said configuration parameters are encapsulated in a data structure described in Table 28.
- Each binarization algorithm is identified by an identifier as listed in Table 16.
- Binarization ID parameters 0 cLength 1 cMax 2 — 3 — 4 tegParam 5 stegParam 6 splitUnitSize, outputSymSize 7 splitUnitSize, outputSymSize 8 cMax, splitUnitSize, outputSymSize 9 cMaxsplitUnitSize, outputSymSize
- cMax represents the largest value to be binarized. Larger values will be truncated.
- cLength represents the numbers of bits with which the value is binarized.
- tegParam represents the tegParam variable defined in this disclosure for TEG binarization.
- stegParam represents the stegParam variable defined in this disclosure for STEG binarization.
- splitUnitSize represents the splitUnitSize variable defined in this disclosure for SUTU, SSUTU and DTU binarizations.
- outputSymSize represents the outputSymSize variable defined in this disclosure for SUTU, SSUTU DTU and SDTU binarizations.
- Table 30 can be obtained.
- the improvement in compression performance of the method described in this disclosure can be appreciated by comparison with the corresponding file sizes of BAM and CRAM approaches and one of the best compressors in literature known as DeeZ (see Numanagic, I., et al “Comparison of high-throughput sequencing data compression tools”, Nature Methods (ISSN: 1548-7091), vol. 13, p. 1005-1008 London: Nature Publishing Group, 2016). It has to be appreciated that the DeeZ, BAM and CRAM compression performance are calculated by adding the size of the compressed reference genome used for alignment to the sizes of the compressed genome sequence data. According to the principles of the present disclosure, the reference genome is embedded in the compressed file.
- compressed reference genome is a FASTA (ASCII text) file compressed using general purpose compressors such as GZIP, LZMA, Bzip2.
- general purpose compressors such as GZIP, LZMA, Bzip2.
- the reference genome hs37d5.fa was compressed using the xz Linux command with the option of maximum compression (-9).
- Table 30 shows the binarization applied to the genomic descriptors defined in this disclosure. When a concatenation of several binarizations is indicated, the different binarizations are applied to the different elements composing each descriptor as defined in this disclosure.
- descriptor_id short name binarization_id 1 pos 8 (aligned reads) 9 (raw reads) 2 pair 0, 6, 6, 6 3 rlen 6 4 rcomp 1 5 mmpos 0, 6 6 mmtype 0, 0, 1, 1 7 clips 5, 1, 3, 0 8 flags 0 9 mmap 6 10 msar 3 11 ureads 1 12 rtype 1 13 rftp 0, 6 14 rftt 0, 0, 1, 1 15 rgroup 1 16 mscore 1
- the binarized values for rftp are calculates as follows:
- Proposed Compressor BAM CRAM Deez method 9827_2#49.bam 4,755,859,110 3,124,448,497 2,592,665,720 2,164,362,407 (ERR317482)- Low coverage Reference 707,712,316 707,712,316 707,712,316 N/A* genome hs37d5.fa Total 5,463,571,426 3,832,160,813 3,300,378,036 2,164,362,407 NA12878_S1.bam- 117,653,446,187 64,565,636,391 64,334,196,408 47,759,141,388 High Coverage Reference 707,712,316 707,712,316 707,712,316 N/A* genome hs37d5.fa Total 118,361,158,503 65,273,348,707 65,041,908,724 47,759,141,388 *
- the parameters needed to encode and decode each Access Unit are encapsulated in a data structure named Parameter Set as defined in Table 31.
- terminator value 1 represents the value of the terminator symbol used for the mmpos descriptor defined in Table 1.
- number of classes 1 number of data classes encoded in all Access Units referring to these encoding parameters.
- class ID number of identifier associated to one of the data class defined in classes this disclosure P, N, M, I, HM, U).
- coding mode ID number of one of the coding modes defined in this disclosure descriptors number of groups 1 number of different values of the rgroup descriptor listed in table 1 present in all Access Units referring to the current encoding parameters.
- group name number of null-terminated string identifier of a read group groups multiple alignments 1 flag signaling the presence of multiple alignments in the flag Access Unit. When set to 0 no multiple alignments are present spliced reads flag 1 flag signaling the presence of spliced reads in the Access Unit. When set to 0 no spliced reads are present.
- the mscore descriptor is defined in table 1.
- the mscore descriptor is defined in table 1.
- FIG. 13 shows an encoding apparatus according to the principles of this invention.
- the encoding apparatus receives as input a reference genome 1302 and unaligned genomic sequences 1300 , for example produced by a genome sequencing apparatus.
- Genome sequencing apparatus are known in the art, like the Illumina HiSeq 2500 , the Thermo-Fisher Ion Torrent devices or the Oxford Nanopore MinION.
- the unaligned sequence data 1300 are fed to a reads alignment unit 1301 , which maps the sequences on a reference genome 1302 .
- the aligned genomic sequences 303 are then fed to a reference based compressor 1305 which generates genomic descriptors 1306 representing both mapped and unmapped genomic sequences.
- the genomic descriptors 1306 generated by the reference based compressor 1305 are first binarized by several binarization units 1307 and then entropy coded by several entropy coders 1308 .
- the entropy coded genomic descriptors are then fed to a multiplexing apparatus 1310 to build one or more Access Unit composing a compressed bitstream 1311 .
- the multiplexed bitstream contains as well coding parameters structures 1304 built by a coding parameters encoder 1309 .
- Each Access Unit contains entropy coded descriptors representing alignment information and sequence reads belonging to one class of data as defined in this disclosure.
- FIG. 14 shows a decoding apparatus according to the principles of this disclosure.
- a demultiplexing unit 1401 receives a multiplexed bitstream 1400 from a network or a storage element and extracts the entropy coded payload of the Access Units composing said bitstream.
- Entropy decoders 1402 receive the extracted payloads and decode the different types of genomic descriptors into their binary representations. Said binary representations are then fed to several binary decoders 1404 which generate genomic descriptors 1405 .
- a coding parameter decoder 1403 receives coding parameters multiplexed with the genomic information and feds them to the decoding unit 1406 .
- Genomic descriptors 1405 representing genomic sequence reads are fed to a sequence reads reconstruction unit 1406 which reconstructs the aligned genomic sequences 1407 using an available reference genome 1408 .
- inventive techniques herewith disclosed may be implemented in hardware, software, firmware or any combination thereof. When implemented in software, these may be stored on a computer medium and executed by a hardware processing unit.
- the hardware processing unit may comprise one or more processors, digital signal processors, general purpose microprocessors, application specific integrated circuits or other discrete logic circuitry.
- the techniques of this disclosure may be implemented in a variety of devices or apparatuses, including mobile phones, desktop computers, servers, tablets and similar devices.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Databases & Information Systems (AREA)
- Bioethics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
- Detection And Prevention Of Errors In Transmission (AREA)
- Labeling Devices (AREA)
Abstract
Description
- This application claims priority to and the benefit of Patent Applications PCT/US2017/041579 filed on Jul. 11, 2017 and PCT/US17/17842 filed on Feb. 14, 2017.
- This disclosure provides a novel method of representation of genome sequencing data which reduces the utilized storage space and improves access performance by providing new functionality that are not available with known prior art methods of representation.
- An appropriate representation of genome sequencing data is fundamental to enable efficient genomic analysis applications such as genome variants calling and all other analysis performed with various purposes by processing the sequencing data and metadata. Human genome sequencing has become affordable by the emergence of high-throughput low cost sequencing technologies. Such opportunity opens new perspectives in several fields ranging from the diagnosis and treatment of cancer to the identification of genetic illnesses, from pathogen surveillance for the identification of antibodies to the creation of new vaccines, drugs and the customization of personalized treatments.
- Hospitals, genomic analysis providers, bioinformatics and large biological data storage centers are looking for affordable, fast, reliable and interconnected genomic information processing solutions which could enable scaling genomic medicine to a world-wide scale. Since one of the bottleneck in the sequencing process has become data storage, methods for representing genome sequencing data in a compressed form are increasingly investigated.
- The most used genome information representations of sequencing data are based on zipping FASTQ and SAM formats. The objective is to compress the traditionally used file formats (respectively FASTQ and SAM for non-aligned and aligned data). Such files are constituted by plain text characters and are compressed, as mentioned above, by using general purpose approaches such as LZ (from Lempel and Ziv, the authors who published the first versions) schemes (the well-known zip, gzip etc). When general purpose compressors such as gzip are used, the result of compression is usually a single blob of binary data. The information in such monolithic form results quite difficult to archive, transfer and elaborate particularly when like in the case of high throughput sequencing the volume of data are extremely large. The BAM format is characterized by poor compression performance due to the focus on compression of the inefficient and redundant SAM format rather than on extracting the actual genomic information conveyed by SAM files and due to the adoption of general purpose text compression algorithms such as gzip rather than exploiting the specific nature of each data source (the genomic data itself).
- A more sophisticated approach to genomic data compression that is less used, but more efficient than BAM is CRAM. CRAM provides more efficient compression for the adoption of differential encoding with respect to a reference (it partially exploits the data source redundancy), but it still lacks features such as incremental updates, support for streaming and selective access to specific classes of compressed data.
- These approaches generate poor compression ratios and data structures that are difficult to navigate and manipulate once compressed. Downstream analysis can be very slow due to the necessity of handling large and rigid data structures even to perform simple operation or to access selected regions of the genomic dataset. CRAM relies on the concept of the CRAM record. Each CRAM record represents a single mapped or unmapped reads by coding all the elements necessary to reconstruct it.
- CRAM presents the following drawbacks and limitations that are solved and removed by the invention described in this disclosure:
- 1. CRAM does not support data indexing and random access to data subsets sharing specific features. Data indexing is out of the scope of the specification (see section 12 of CRAM specification v 3.0) and it is implemented as a separate file. Conversely the approach of the invention described in this disclosure employs a data indexing method that is integrated with the encoding process and indexes are embedded in the encoded (i.e. compressed) bit stream.
- 2. CRAM is built by core data blocks that can contain any type of mapped reads (perfectly matching reads, reads with substitutions only, reads with insertions or deletions (also referred to as “indels”)). There is no notion of data classification and grouping of reads in classes according to the result of mapping with respect to a reference sequence. This means that all data need to be inspected even if only reads with specific features are searched. Such limitation is solved by the invention by classifying and partitioning data in classes before coding.
- 3. CRAM is based on the concept of encapsulating each read into a “CRAM record”. This implies the need to inspect each complete “record” when reads characterized by specific biological features (e.g. reads with substitutions, but without “indels”, or perfectly mapped reads) are searched.
- Conversely, in the present invention there is the notion of data classes coded separately in separated information blocks and there is no notion of record encapsulating each read. This enables more efficient access to set of reads with specific biological characteristics (e.g. reads with substitutions, but without “indels”, or perfectly mapped reads) without the need of decoding each (block of) read(s) to inspect its features.
- 4. In a CRAM record each record field is associated to a specific flag and each flag must always have the same meaning as there is no notion of context since each CRAM record can contain any different type of data. This coding mechanism introduces redundant information and prevents the usage of efficient context based entropy coding. Instead in the present invention, there is no notion of flag denoting data because this is intrinsically defined by the information “block” the data belongs to. This implies a largely reduced number of symbols to be used and a consequent reduction of the information source entropy which results into a more efficient compression. Such improvement is possible because the use of different “blocks” enables the encoder to reuse the same symbol across each block with different meanings according to the context. In CRAM each flag must always have the same meaning as there is no notion of contexts and each CRAM record can contain any type of data.
- 5. In CRAM substitutions, insertions and deletions are represented by using different syntax elements, option that increases the size of the information source alphabet and yields a higher source entropy. Conversely, the approach of the disclosed invention uses a single alphabet and encoding for substitutions, insertions and deletions. This makes the encoding and decoding process simpler and produces a lower entropy source model which coding yields bitstreams characterized by high compression performance.
- The present invention aims at compressing genomic sequences by classifying and partitioning sequencing data so that the redundant information to be coded is minimized and features such as selective access and support for incremental updates are directly enabled in the compressed domain.
- One of the aspects of the presented approach is the definition of classes of data and metadata structured in different blocks and encoded separately. The more relevant improvements of such approach with respect to existing methods consist in:
- 1. the increase of compression performance due to the reduction of the information source entropy constituted by providing an efficient source model for each class of data or metadata;
- 2. the possibility of performing selective accesses to portions of the compressed data and metadata for any further processing purpose directly in the compressed domain;
- 3. the possibility to incrementally (i.e. without the need of decoding and re-encoding) update compressed data and metadata with new sequencing data and/or metadata and/or new analysis results associated to specific sets of sequencing reads.
-
FIG. 1 shows the definition of read 1 and read 2 in a read pair and the definition of left-most and right-most nucleotide in a mapped read. -
FIG. 2 shows how an Access Unit encapsulates compressed descriptors representing sequence reads mapped in a contiguous interval of the reference sequence. Header information is prepended to the compressed descriptors in order to enable data parsing. -
FIG. 3 shows how an Access Unit of type P is composed by a header and the multiplexing of blocks of descriptors representing the reads mapping positions (pos), the reverse complement information (rcomp), the pairing information in case of paired end reads (pair), the reads length in case of variable reads length (rlen) and mapping flags (flags). It is used to encode reads of class P. -
FIG. 4 shows the calculation of the pos descriptor for a mapped read pair where read pair n is mapped at position Pn and read pair n+1 is mapped at position Pm. -
FIG. 5 shows how to How to calculate the absolute mapping position of one base on a Reference Sequence. -
FIG. 6 shows the use of the rcomp descriptor for paired end reads. -
FIG. 7 shows an example of how to calculate mismatches positions in a read pair. -
FIG. 8 shows how the Genomic Record Length and the Pairing Distance are calculated. -
FIG. 9 shows multiple alignments without splicing. The left-most read has N alignments. N is the first value of mmap to be decoded and signals the number of alignments of the first read. The following N values of the mmap descriptor are decoded and are used to calculate P which is the number of alignments of the second read. -
FIG. 10 shows how the pos, pair and mmap descriptors are used to encode multiple alignments without splices. The left-most read has N alignments. -
FIG. 11 shows multiple alignments with splices. -
FIG. 12 shows multiple alignments with splices. Use of the pos, pair, mmap and msar descriptors to represent multiple alignments with splices. -
FIG. 13 depicts an encoder apparatus comprising the steps of aligning genomic sequences with respect to a reference genome, generating descriptors representing the genomic sequences with respect to the reference genome, compressing each block of descriptors with a dedicated entropy coder. -
FIG. 14 shows the decoding process of a compressed bitstream comprising the steps of demultiplexing the incoming bitstream to extract the entropy coded descriptors, entropy decoding of each type of descriptors, decoding of aligned sequence reads using the reference genome. -
FIG. 15 shows how encoders of data of class N, M and I are configured with vectors of thresholds and generate separate subclasses of N, M and I data classes. -
FIG. 16 shows how half mapped read pairs (class HM) can be used to fill unknown regions of a reference sequence by assembling longer contigs with unmapped reads. -
FIG. 17 shows how reference transformations can be applied to remove mismatches from reads. In some cases reference transformations may generate new mismatches or change the type of mismatches found when referring to the reference before the transformation has been applied. -
FIG. 18 shows how reference transformations can change the class reads belong to when all or a subset of mismatches are removed (i.e. the read belonging to class M before transformation is assigned to Class P after the transformation of the reference has been applied). -
FIG. 19 shows how all classes of data can use the same transformed reference for re-encoding or a different transformation can be used for each class N, M and I, or any combination thereof. - The features of the claims below solve the problem of existing prior art solutions by providing a method for encoding genome sequence data, said genome sequence data comprising reads of sequences of nucleotides, said method comprising the steps of:
- aligning said reads to one or more reference sequences thereby creating aligned reads,
- classifying said aligned reads according to specified matching rules with said one or more reference sequences, thereby creating classes of aligned reads,
- encoding said classified aligned reads as a multiplicity of blocks of syntax elements, structuring said blocks of syntax elements with header information thereby creating successive Access Units,
- wherein said encoding further comprises binarizing and entropy coding genomic descriptors.
- In another aspect of the coding method said binarizing and entropy coding genomic descriptors is such that the binarization and the entropy coding is different for the different descriptors.
- In another aspect of the coding method said descriptors comprise:
- pos for signaling the mapping position of a read on a reference sequence,
- rcomp for signaling the DNA or RNA strand the reads was mapped on mapping flags for enabling the aligner to further specify the result of the mapping process.
- In another aspect the coding method further comprises coding the following descriptors:
- mmpos for signaling the position of mismatches in aligned reads with respect to reference sequences,
- mmtype for signaling the types of mismatches with respect to reference sequences at the associated positions.
- In another aspect the coding method further comprises coding the clips descriptor for signaling soft or hard clipped nucleotides.
- In another aspect the coding method further comprises coding the rlen descriptor for signaling the length of each encoded sequence read.
- In another aspect the coding method further comprises coding the following descriptors:
- mmap for signaling the multiple mapping positions that are associated to a single read or read pair by the mapping procedure,
- msar for signaling the identification of the existence of spliced reads (i.e. reads that when split in chunks find mapping positions with higher degrees of matching accuracy than when they are mapped as single contiguous read mapped on a single position on a reference sequence).
- In another aspect the coding method further comprises coding the mscore descriptor to signal a mapping/alignment score per read as generated by genomic sequence reads aligners.
- In another aspect the coding method further comprises coding the pair descriptor to signal, in case of paired end reads, how the reads are paired.
- In another aspect the coding method further comprises coding the ureads descriptor to signal reads which could not be aligned at any position of the reference sequence.
- In another aspect the coding method further comprises coding the rtype descriptor used to signal the subset of descriptors used to encode sequence reads that cannot be mapped at any position of the reference sequence with specified degrees of matching accuracy.
- In another aspect the coding method further comprises coding the rgroup descriptor to signal to which read group the read belongs to.
- In another aspect the coding method further comprises coding the following descriptors:
- rftp for signaling the position of mismatches between a contig and a reference sequence. Positions of mismatches are terminated by a special terminator character,
- rftt for signaling the type of mismatches between a contig and a reference sequence.
- In another aspect of the coding method
- said pos descriptor is binarized using a double truncated unary code or a single double truncated unary code,
- said rcomp descriptor is binarized using a truncated unary code,
- said mapping flags descriptors are binarized using binary coding.
- In another aspect of the coding method
- said mmpos descriptor for signaling the position of mismatches in aligned reads with respect to reference sequences is binarized using a split unit wise truncated unary code,
- said mmtype descriptor for signaling the types of mismatches with respect to reference sequences at the associated positions is binarized using a truncated unary code.
- In another aspect of the coding method said clips descriptor for signaling soft or hard clipped nucleotides is binarized using a concatenation of Signed Truncated Exponential Golomb, Truncated Unary, Signed Exponential Golomb and Binary Codes.
- In another aspect of the coding method said rlen descriptor signaling the length of each encoded sequence read is binarized using a Split Unit-wise Truncated Unary code.
- In another aspect of the coding method:
- said mmap descriptor for signaling the multiple mapping positions that are associated to a single read or read pair by the mapping procedure is binarized using a Split Unit-wise Truncated Unary code,
- said msar descriptor for signaling the identification of the existence of spliced reads is binarized using a Signed Exponential Golomb code.
- In another aspect of the coding method said mscore descriptor to signal a mapping/alignment score per read as generated by genomic sequence reads aligners is binarized using a Truncated Unary code,
- In another aspect of the coding method said pair descriptor to signal, in case of paired end reads, how the reads are paired is binarized using a concatenation of Binary Coding and Split Unit-wise Truncated Unary code.
- In another aspect of the coding method said ureads descriptor to signal reads which could not be aligned at any position of the reference sequence is binarized using a Truncated Unary code.
- In another aspect of the coding method said rtype descriptor used to signal the subset of descriptors used to encode sequence reads that cannot be mapped at any position of the reference sequence with specified degrees of matching accuracy is binarized using a Truncated Unary code.
- In another aspect of the coding method said rgroup descriptor to signal to which read group the read belongs to is binarized using a Truncated Unary code.
- In another aspect of the coding method:
- said rftp descriptor for signaling the position of mismatches between a contig and a reference sequence is binarized using a concatenation of Binary Coding and Split Unit-wise Truncated Unary code,
- said rftt descriptor for signaling the type of mismatches between a contig and a reference sequence is binarized using a concatenation of Binary Coding and Truncated Unary code.
- In another aspect of the coding method the configuration parameters for the coding of said descriptors are contained in a syntax header.
- In another aspect of the coding method said configuration parameters are updated by creating updated syntax headers to be added to the encoded genomic file.
- In another aspect of the coding method said configuration parameters comprise a dataset type for signaling the type of data encoded in Access Units referring to this encoding parameters.
- In another aspect of the coding method said configuration parameters further comprise a reads length for signaling the length in nucleotides of sequence reads in case of constant reads length.
- In another aspect of the coding method said configuration parameters further comprise a quality values depth parameter for signaling the number of Quality Values associated to each coded nucleotide.
- In another aspect of the coding method said configuration parameters further comprise an alignment score depth for signaling the number of alignments scores associated to each coded alignments.
- In another aspect of the coding method said configuration parameters further comprise a terminator size for signaling the size in bytes of the terminator symbol used for the mmpos descriptor.
- In another aspect of the coding method said configuration parameters further comprise a terminator value for signaling the value of the terminator symbol used for the mmpos descriptor.
- In another aspect of the coding method said configuration parameters further comprise the number of classes for signaling the number of data classes encoded in all Access Units referring to said configuration parameters.
- In another aspect of the coding method said configuration parameters further comprise class identifiers to signal the identifiers associated to the data class defined in this disclosure (P, N, M, I, HM, U).
- In another aspect of the coding method said configuration parameters further comprise the number of descriptors for signaling the total number of descriptors contained in Access Units referring to said configuration parameters.
- In another aspect of the coding method said configuration parameters further comprise coding mode identifiers for signaling the coding modes defined in this disclosure.
- In another aspect of the coding method said configuration parameters further comprise a number of groups parameter for signaling the number of different values of the rgroup descriptor present in all Access Units referring to the current encoding parameters.
- In another aspect of the coding method said configuration parameters further comprise one or more group name parameters for signaling one or more read group identifiers.
- In another aspect of the coding method said configuration parameters further comprise a multiple alignments flag for signaling the presence of multiple alignments in the Access Unit.
- In another aspect of the coding method said configuration parameters further comprise a spliced reads flag for signaling the presence of spliced reads in the Access Unit. When set to 0 no spliced reads are present.
- In another aspect of the coding method said configuration parameters further comprise a multiple signature base flag for signaling the use of multiple signatures in an Access Unit containing unmapped sequence reads (Class U).
- In another aspect of the coding method said configuration parameters further comprise a signature size for signaling the size in bits of each integer representing an encoded signature.
- In another aspect of the coding method said configuration parameters further comprise a score exponent parameters for signaling the number of bits used to encode the exponent part of the multiple alignments score encoded in the mscore descriptor.
- In another aspect of the coding method said configuration parameters further comprise a score fractional parameter for signaling the number of bits used to encode the fractional part of the multiple alignments score encoded in the mscore descriptor.
- The present invention further provides a method for decoding encoded genomic data said genome sequence data comprising reads of sequences of nucleotides, said method comprising the steps of:
- parsing Access Units containing said encoded genomic data to extract multiple blocks of genomic descriptors by employing header information,
- decoding said multiplicity of blocks,
- wherein said decoding of multiplicity of blocks comprise decoding and de-binarizing genomic descriptors to extract aligned reads according to specific matching rules defining their classification with respect to one or more reference sequences.
- In another aspect of the decoding method said descriptors comprise:
- a pos descriptor for signaling the mapping position of a read on a reference sequence,
- a rcomp descriptor for signaling the DNA or RNA strand the reads was mapped on mapping flags for enabling the aligner to further specify the result of the mapping process.
- In another aspect the decoding method further comprises decoding the following descriptors:
- mmpos for signaling the position of mismatches in aligned reads with respect to reference sequences,
- mmtype for signaling the types of mismatches with respect to reference sequences at the associated positions.
- In another aspect the decoding method further comprises decoding the clips descriptor for signaling soft or hard clipped nucleotides.
- In another aspect the decoding method further comprises decoding the rlen descriptor for signaling the length of each encoded sequence read.
- In another aspect the decoding method further comprises decoding the following descriptors:
- mmap for signaling the multiple mapping positions that are associated to a single read or read pair by the mapping procedure,
- msar for signaling the identification of the existence of spliced reads (i.e. reads that when split in chunks find mapping positions with higher degrees of matching accuracy than when they are mapped as single contiguous read mapped on a single position on a reference sequence).
- In another aspect the decoding method further comprises decoding the mscore descriptor to signal a mapping/alignment score per read as generated by genomic sequence reads aligners.
- In another aspect the decoding method further comprises decoding the pair descriptor to signal, in case of paired end reads, how the reads are paired.
- In another aspect the decoding method further comprises decoding the ureads descriptor to signal reads which could not be aligned at any position of the reference sequence.
- In another aspect the decoding method further comprises decoding the rtype descriptor used to signal the subset of descriptors used to encode sequence reads that cannot be mapped at any position of the reference sequence with specified degrees of matching accuracy.
- In another aspect the decoding method further comprises decoding the rgroup descriptor to signal to which read group the read belongs to.
- In another aspect the decoding method further comprises decoding the following descriptors:
- rftp for signaling the position of mismatches between a contig and a reference sequence. Positions of mismatches are terminated by a special terminator character.
- rftt for signaling the type of mismatches between a contig and a reference sequence.
- In another aspect of the decoding method the configuration parameters for the decoding of said descriptors are extracted from a syntax header.
- In another aspect of the decoding method said configuration parameters comprise a dataset type for signaling the type of data encoded in Access Units referring to this encoding parameters.
- In another aspect of the decoding method said configuration parameters further comprise a reads length for signaling the length in nucleotides of sequence reads in case of constant reads length.
- In another aspect of the decoding method said configuration parameters further comprise a quality values depth parameter for signaling the number of Quality Values associated to each coded nucleotide.
- In another aspect of the decoding method said configuration parameters further comprise an alignment score depth for signaling the number of alignments scores associated to each coded alignments.
- In another aspect of the decoding method said configuration parameters further comprise a terminator size for signaling the size in bytes of the terminator symbol used for the mmpos descriptor.
- In another aspect of the decoding method said configuration parameters further comprise a terminator value for signaling the value of the terminator symbol used for the mmpos descriptor.
- In another aspect of the decoding method said configuration parameters further comprise the number of classes for signaling the number of data classes encoded in all Access Units referring to said configuration parameters.
- In another aspect of the decoding method said configuration parameters further comprise class identifiers to signal the identifiers associated to the data class defined in this disclosure (P, N, M, I, HM, U).
- In another aspect of the decoding method said configuration parameters further comprise the number of descriptors for signaling the total number of descriptors contained in Access Units referring to said configuration parameters.
- In another aspect of the decoding method said configuration parameters further comprise coding mode identifiers for signaling the coding modes defined in this disclosure.
- In another aspect of the decoding method said configuration parameters further comprise a number of groups parameter for signaling the number of different values of the rgroup descriptor present in all Access Units referring to the current encoding parameters.
- In another aspect of the decoding method said configuration parameters further comprise one or more group name parameters for signaling one or more read group identifiers.
- In another aspect of the decoding method said configuration parameters further comprise a multiple alignments flag for signaling the presence of multiple alignments in the Access Unit.
- In another aspect of the decoding method said configuration parameters further comprise a spliced reads flag for signaling the presence of spliced reads in the Access Unit. When set to 0 no spliced reads are present.
- In another aspect of the decoding method said configuration parameters further comprise a multiple signature base flag for signaling the use of multiple signatures in an Access Unit containing unmapped sequence reads (Class U).
- In another aspect of the decoding method said configuration parameters further comprise a signature size for signaling the size in bits of each integer representing an encoded signature.
- In another aspect of the decoding method said configuration parameters further comprise a score exponent parameters for signaling the number of bits used to encode the exponent part of the multiple alignments score encoded in the mscore descriptor.
- In another aspect of the decoding method said configuration parameters further comprise a score fractional parameter for signaling the number of bits used to encode the fractional part of the multiple alignments score encoded in the mscore descriptor.
- The present invention further provides an encoding apparatus comprising encoding means for implementing all the aspects of the previously mentioned encoding methods.
- The present invention further provides a decoding apparatus for implementing all the aspects of the previously mentioned decoding methods.
- The present invention further provides a file format comprising the previously defined genomic descriptors
- The present invention further provides a computer-readable medium comprising instructions that when executed cause at least one processor to perform all the aspects of the previously mentioned coding methods.
- The present invention further provides a computer-readable medium comprising instructions that when executed cause at least one processor to perform all the aspects of the previously mentioned decoding methods.
- The present invention further provides a support data storing genomic encoded according perform all the aspects of the previously mentioned coding methods.
- The genomic or proteomic sequences referred to in this invention include, for example, and not as a limitation, nucleotide sequences, Deoxyribonucleic acid (DNA) sequences, Ribonucleic acid (RNA), and amino acid sequences. Although the description herein is in considerable detail with respect to genomic information in the form of a nucleotide sequence, it will be understood that the methods and systems for compression can be implemented for other genomic or proteomic sequences as well, albeit with a few variations, as will be understood by a person skilled in the art.
- Genome sequencing information is generated by High Throughput Sequencing (HTS) machines in the form of sequences of nucleotides (a. k. a. “bases”) represented by strings of letters from a defined vocabulary. The smallest vocabulary is represented by five symbols: {A, C, G, T, N} representing the 4 types of nucleotides present in DNA namely Adenine, Cytosine, Guanine, and Thymine. In RNA Thymine is replaced by Uracil (U). N indicates that the sequencing machine was not able to call any base and so the real nature of the position is undetermined. In case the IUPAC ambiguity codes are adopted by the sequencing machine, the alphabet used for the symbols is (A, C, G, T, U, W, S, M, K, R, Y, B, D, H, V, N or -).
- The nucleotides sequences produced by sequencing machines are called “reads”. Sequence reads can be between a few dozens to several thousand nucleotides long. Some technologies produce sequence reads in “pairs” where one read is from one DNA strand and the second is from the other strand. A read associated to another read in a sequencing process producing pairs is said to be its mate.
- The process of arranging sequence reads to identify regions of similarity with segments of a reference genome according to a set of matching rules is called “alignment” or “mapping”.
- Throughout this disclosure, a reference sequence is a sequence of nucleotides associated to a mono-dimensional integer coordinate system for which each integer coordinate is associated to a single nucleotide. Coordinate values can only be equal or larger than zero. This coordinate system in the context of this invention is zero-based (i.e. the first nucleotide has coordinate 0 and it is said to be at position 0) and linearly increasing from left to right. A reference sequence is any sequence on which the nucleotides sequences produced by sequencing machines are aligned/mapped. One example of sequence could actually be a “reference genome”, a sequence assembled by scientists as a representative example of a species' set of genes. For example GRCh37, the Genome Reference Consortium human genome (build 37) is derived from thirteen anonymous volunteers from Buffalo, N.Y. However, a reference sequence could also consist of a synthetic sequence conceived and constructed to merely improve the compressibility of the reads in view of their further processing.
- A common element of efficient approaches to genomic sequence reads compression is the exploitation of the correlation of sequence data with respect to a reference sequence. Even if the somatic profile of the human population is extremely diversified, the actual portion of the number of nucleotides that differs from person to person is about only 0.1% of the total number of nucleotides composing an entire genome. Therefore, the specific genomic information characterizing each individual is very limited with respect to the entire information carried by an entire genome. When a pre-existing reference genome is available, be it for previous sequencing or as a published “average” consensus reference, the most common way as of today to encode the information is to identify and encode only the differences with respect to the reference genome.
- In order to do so with raw sequence reads, generally expressed in the form of FASTQ data files, a preliminary pre-processing step the mapping on a reference genome. In case an appropriate reference genome is not available, or if the bias introduced by the usage of a specific reference is not desirable, the construction of a new reference sequence by means of assembling the sequence reads at hand into longer sequences called contigs, is a possible alternative. When mapping sequence reads on a reference sequence, said reference sequence is used as axis of a mono-dimensional coordinate system in which the left-most position is denoted as position 0. For each sequence read, mapped to a reference sequence, the nucleotide mapped at the reference sequence position identified by the smallest coordinate number is usually referred to as the “left-most” nucleotide, whereas the nucleotide mapped at the reference sequence position identified by the largest coordinate number is referred to as the “right-most” nucleotide. This is illustrated in
FIG. 1 . Throughout this disclosure, a nucleotide is also referred to as a base. - When a sequence read is mapped to a reference sequence, the coordinate of the left-most mapped base is said to represent the mapping position of the read on the reference sequence.
- A base present in the aligned read and not present in the reference sequence (a.k.a. insertion) and bases preserved by the alignment process but not mapped on the reference sequence (a.k.a. soft clips) do not have mapping positions.
- When a sequence read cannot be mapped to any mapped position of the used reference sequences according to the specified matching rules, it is said to be unmapped.
- The process of building longer genomic sequences by looking for overlapping regions among sequence reads is called assembly.
- A longer genomic sequence built assembling shorter reads is called contig (see https://en.wikipedia.org/wiki/Contig).
- Sequence reads that fail to build any contig during an assembly process are said to be unaligned.
- A reference genome is composed by one or more reference sequences and it is assembled by scientists as a representative example of a species' set of genes. For example GRCh37, the Genome Reference Consortium human genome (build 37) is derived from thirteen anonymous volunteers from Buffalo, N.Y. However, a reference sequence could also consist of a synthetic sequence conceived and merely constructed to improve the compressibility of the reads in view of their further processing.
- In this disclosure the read composing a read pair with a base mapping on the smallest coordinate on a reference sequence is referred to as “
Read 1” whereas its mate is referred to as “Read 2”. - The distance, expressed as number of nucleotides (or bases), separating two reads generated as a pair, by a sequencing machine using current state of the art sequencing technology, is unknown, and it is determined by mapping both reads composing the pair (i.e. minimizing appropriate matching functions) to a reference sequence.
- Throughout this disclosure, a genomic record is a data structure encoding either a single sequence read or a paired sequence read optionally associated with alignment information, read identifier and quality values.
- Throughout this disclosure, an Access Unit (AU) is defined as a logical data structure containing a coded representation of genomic information or related metadata to facilitate the bit stream access and manipulation. It is the smallest data organization that can be decoded by a decoding device implementing the invention described in this disclosure.
- According to the type of coded information, an AU can be decoded either independently of any other AU or using information contained in other AUs.
- AUs can be classified into a multiplicity of types according to the nature of the coded sequence data. An Access Unit contains either a reference sequence, or a portion thereof, or encoded reads or read pairs belonging to a single class of data. Any single AU cannot contain two or more types of sequence data. For example an Access Unit may contain the
entire chromosome 1 of GRCh37, the Genome Reference Consortium human genome (build 37). Another Access Unit may contain the coded representation of nucleotides ofchromosome 1 of GRCh37 that are located between coordinates 50′000 and 150′000. Another Access Unit may contain only reads or read pairs that perfectly map on the reference sequence without any mismatch. Another Access Unit may contain reads or read pairs that only contain “N” symbols as mismatches with respect to the reference sequence. Another Access Unit may contain reads or read pairs that contain any type of substitutions (e.g. one base present in the read or read pair is different from the base at the corresponding mapping position in the reference sequence). Another Access Unit may contain reads or read pairs that contain mismatches, insertions, deletions and soft clipped bases. Another Access Unit may contain only read or read pairs that do not map on the reference sequence. Another Access Unit may contain only read pairs in which one read is mapped and the other is not mapped on the reference sequence. Another type of Access Unit may contain only encoded segments of a reference genome composed by one or more reference sequences (for example chromosomes). - The essential characteristic of an Access Unit is that it contains in compressed form all elements needed to reconstruct the genomic information (sequence reads or read pairs, reference sequences), associated alignment information and metadata of reads or read pairs it represents. In other words, to fully reconstruct the reads, read pairs or reference sequence and associated information carried by an Access Unit it is only necessary to retrieve the Access Unit itself and, when applicable, the Access Units containing the reference sequence it refers to.
- In each Access Unit the descriptors listed in the next section representing the encoded read or read pairs are aggregated in separate data blocks—one per type—in order to exploit their homogeneous statistical properties for achieving high performance entropy coding.
- Each Access Unit contains the compressed sub-set of descriptors representing sequence reads or read pairs belonging to the same class mapped to a genomic region on a reference sequence. Such genomic region on the reference sequence is defined by a start coordinate (or start position) and an end coordinate (or end position).
- An example of Access Unit is illustrated in
FIG. 2 andFIG. 3 . Access Units are composed by blocks of encoded genomic descriptors (described in the next section). To enable transport over a network, blocks are further decomposed into packets. When compressing genomic sequence reads, each Access Units contains compressed descriptor representing either sequence reads mapped to a genomic interval on the reference sequence or unmapped sequence reads. Access Units can be used to carry reference genomes or parts thereof. A reference sequence can be either encoded as a single long sequence of nucleotides or split into shorter sequences encoded as unmapped genomic sequence reads. - In the context of this disclosure the following definitions apply:
- access unit start position: left-most genomic record position among all genomic records contained in the access unit.
- access unit end position: right-most base position among all mapped bases of all genomic records contained in the access unit.
- access unit range: the genomic range comprised between the access unit start position and the right-most genomic record position among all genomic records contained in the access unit.
- access unit size: number of genomic records contained in an access unit.
- access unit covered region: genomic range comprised between the Access Unit start position and the Access Unit end position.
- In the context of this disclosure, one or more access unit are organized in a structure called genomic dataset. A genomic dataset is a compression unit containing headers and access units. The set of access units composing the genomic dataset constitutes the genomic dataset payload.
- A collection of one or more genomic datasets is called dataset group.
- In the context of this disclosure, genomic descriptors are syntax elements representing part of the information (and also elements of a syntax structure of a file format and/or a bitstream) necessary to reconstruct (i.e. decode) coded reference sequences, sequence reads and the associated mapping information. The genomic descriptors disclosed in this invention are listed in Table 3.
- According to the method disclosed in this invention, reference sequences or portion thereof, sequence reads and the associated alignment information are coded using a sub-set of the descriptors listed above which are then entropy coded using a multiplicity of entropy coders according to each descriptor specific statistical properties. Blocks of compressed descriptors with homogeneous statistical properties are structured in Access Units which represent the smallest coded representation of one or more genomic sequence that can be manipulated by a device implementing the invention described in this disclosure.
- Genomic descriptors are organized in blocks and streams as defined below.
- A block is defined as a data unit composed by a header and a payload, which is composed by portions of compressed descriptors of the same type.
- A descriptor stream is defined as a sequence of encoded descriptor blocks used to decode a descriptor of a specific Data Class.
- Sequencing devices can introduce errors in the sequence reads such as:
-
- 1. the decision of skipping a base call due to the lack of confidence in calling any specific base. This is called an unknown base and labeled as “N” (denoted as mismatch of “n type”);
- 2. the use of a wrong symbol (i.e. representing a different nucleic acid) to represent the nucleic acid actually present in the sequenced sample; this is usually called “substitution error” (denoted as mismatch of “s type”);
- 3. the insertion in one sequence read of additional symbols that do not refer to any actually present nucleic acid; this is usually called “insertion error” (denoted as mismatch of “i type”);
- 4. the deletion from one sequence read of symbols that represent nucleic acids that are actually present in the sequenced sample; this is usually called “deletion error” (denoted as mismatch of “d type”);
- 5. the recombination of one or more fragments into a single fragment which does not reflect the reality of the originating sequence; this usually results in aligners decision to clip bases (denoted as mismatch of “c type”).
- In genome sequencing the term “coverage” is used to express the level of redundancy of the sequence data with respect to a “reference sequence”. The average coverage of aligned genome sequencing data is the average number of time each base at each position of the reference genome is present in the aligned data. For example, to reach a coverage of 30× on a human genome (3.2 billion bases long) a sequencing machine shall produce a total of 30×3.2 billion bases so that in average each position in the reference is “covered” 30 times.
- The coverage is said to be:
-
- partial (less than 1×) when some parts of the reference genome are not mapped by any available sequence read;
- single (1×) when all nucleotides of the reference genome are mapped by one and only one symbol present in the sequence reads;
- multiple (2×, 3×, N×) when each nucleotide of the reference genome is mapped multiple times.
- This invention aims at defining a genomic information representation format in which the relevant information is efficiently accessible and transportable and the weight of the redundant information is reduced.
- The main innovative aspects of the disclosed invention are the following.
- 1 Sequence reads are classified and partitioned into data classes according to the results of the alignment with respect to reference sequences. Such classification and partitioning enables the selective access to encoded data according to criteria related to the alignment results and to the matching accuracy.
- 2 The classified sequence reads and the associated metadata are represented by genomic descriptors organized in blocks with homogeneous statistical properties enabling the definition of distinct information sources characterized by a low information entropy.
- 3 The possibility of modeling each separated information source with distinct source model adapted to the statistical characteristics of each class and the possibility of changing the source model within each class of reads and within each descriptor block for each separately accessible data units (Access Units). The adoption of the appropriate transformation, binarization and context adaptive probability models and associated entropy coders according to the statistical properties of each source model.
- 4 The definition of correspondences and dependencies among the descriptors blocks to enable the selective access to the sequencing data and associated metadata without the need to decode all the descriptors blocks if not all information is required.
- 5 The coding of each sequence data class and associated metadata blocks with respect to “pre-existing” (also denoted as “external”) reference sequences or with respect to “transformed” reference sequences obtained by applying appropriate transformations to “pre-existing” reference sequences so as to reduce the entropy of the descriptors blocks information sources. Said descriptors represent the reads partitioned into the different data classes. Following any encoding of reads using the corresponding descriptors with reference to a “pre-existing” reference or “transformed” “pre-existing” reference sequence, the occurrence of the various mismatches can be used to define the appropriate transformations to the reference sequences in order to find a final coded representation with low entropy and achieve higher compression efficiency.
- 6 The construction of one or more reference sequences (also referred to as “internal” references to distinguish from the “pre-existing” also referred here as “external” reference sequences) used to encode the class of reads that present a degree of matching accuracy with respect to the pre-existing reference sequences not satisfying a set of constraints. Such constraints are set with the objective that the coding costs of representing in compressed form the class of reads aligned with respect to the “internal” reference sequences and the cost of representing the “internal” reference sequences themselves, is lower than encoding the unaligned class of reads verbatim, or using the “external” reference sequences without or with transformations.
- 7 The transmission of the configuration parameters governing the process of both encoding and decoding by means of data structures embedded in the compressed genomic data in the form of header information. Such configuration parameters can be updated during the encoding process in order to improve the compression performance. Such updates are conveyed in the compressed content in the form of updated configuration data structures. In the following, each of the above aspects will be further described in details.
- The sequence reads generated by sequencing machines are classified by the disclosed invention into six different “classes” according to the matching results of the alignment with respect to one or more “pre-existing” reference sequences.
- When aligning a DNA sequence of nucleotides with respect to a reference sequence the following cases can be identified:
-
- 1. A region in the reference sequence is found to match the sequence read without any error (i.e. perfect mapping). Such sequence of nucleotides is referenced to as “perfectly matching read” or denoted as “Class P”.
- 2. A region in the reference sequence is found to match the sequence read with a type and a number of mismatches determined only by the number of positions in which the sequencing machine generating the read was not able to call any base (or nucleotide). Such type of mismatches are denoted by an “N”, the letter used to indicate an undefined nucleotide base.
- In this disclosure this type of mismatch are referred to as “n type” mismatch. Such sequences belong to “Class N” reads. Once the read is classified to belong to “Class N” it is useful to limit the degree of matching inaccuracy to a given upper bound and set a boundary between what is considered a valid matching and what it is not. Therefore, the reads assigned to Class N are also constrained by setting a threshold (MAXN) that defines the maximum number of undefined bases (i.e. bases called as “N”) that a read can contain. Such classification implicitly defines the required minimum matching accuracy (or maximum degree of mismatch) that all reads belonging to Class N share when referred to the corresponding reference sequence, which constitutes an useful criterion for applying selective data searches to the compressed data.
-
- 3. A region in the reference sequence is found to match the sequence read with types and number of mismatches determined by the number of positions in which the sequencing machine generating the read was not able to call any nucleotide base, if present (i.e. “n type” mismatches), plus the number of mismatches in which a different base, than the one present in the reference, has been called. Such type of mismatch denoted as “substitution” is also called Single Nucleotide Variation (SNV) or Single Nucleotide Polymorphism (SNP). In this disclosure this type of mismatch is also referred to as “s type” mismatch. The sequence read is then referenced to as “M mismatching reads” and assigned to “Class M”. Like in the case of “Class N”, also for all reads belonging to “Class M” it is useful to limit the degree of matching inaccuracy to a given upper bound, and set a boundary between what is considered a valid matching and what it is not. Therefore, the reads assigned to Class M are also constrained by defining a set of thresholds, one for the number “n” of mismatches of “n type” (MAXN) if present, and another for the number of substitutions “s” (MAXS). A third constraint is a threshold defined by any function of both numbers “n” and “s”, f(n,s). Such third constraint enables to generate classes with an upper bound of matching inaccuracy according to any meaningful selective access criterion. For instance, and not as a limitation, f(n,s) can be (n+s)½ or (n+s) or any linear or non-linear expression that sets a boundary to the maximum matching inaccuracy level that is admitted for a read belonging to “Class M”. Such boundary constitutes a very useful criterion for applying the desired selective data searches to the compressed data when analyzing sequence reads for various purposes because it makes possible to set a further boundary to any possible combination of the number of “n type” mismatches and “s type” mismatches (substitutions) beyond the simple threshold applied to the one type or to the other.
- 4. A fourth class is constituted by sequencing reads presenting at least one mismatch of any type among “insertion”, “deletion” (a.k.a. indels) and “clipped”, plus, if present, any mismatches type belonging to class N or M. Such sequences are referred to as “I mismatching reads” and assigned to “Class I”. Insertions are constituted by an additional sequence of one or more nucleotides not present in the reference, but present in the read sequence. In this disclosure this type of mismatch is referred to as “i type” mismatch. In literature when the inserted sequence is at the edges of the sequence it is also referred to as “soft clipped” (i.e. the nucleotides are not matching the reference but are kept in the aligned reads contrarily to “hard clipped” nucleotides which are discarded). In this disclosure this type of mismatch is referred to as “c type” mismatch. Keeping or discarding nucleotides is a decisions taken by the aligner stage and not by the classifier of reads disclosed in this invention which receives and processes the reads as they are determined by the sequencing machine or by the following alignment stage. Deletion are “holes” (missing nucleotides) in the read with respect to the reference. In this disclosure this type of mismatch is referred to as “d type” mismatch. Like in the case of classes “N” and “M” it is possible and appropriate to define a limit to the matching inaccuracy. The definition of the set of constraints for “Class I” is based on the same principles used for “Class M” and is reported in Table 1 in the last table lines. Beside a threshold for each type of mismatch admissible for Class I data, a further constraint is defined by a threshold determined by any function of the number of the mismatches “n”, “s”, “d”, “i” and “c”, w(n,s,d,i,c). Such additional constraint make possible to generate classes with an upper bound of matching inaccuracy according to any meaningful user defined selective access criterion. For instance, and not as a limitation, w(n,s,d,i,c) can be (n+s+d+i+c)⅕ or (n+s+d+i+c) or any linear or non-linear expression that sets a boundary to the maximum matching inaccuracy level that is admitted for a read belonging to “Class I”. Such boundary constitutes a very useful criterion for applying the desired selective data searches to the compressed data when analyzing sequence reads for various purposes because it enables to set a further boundary to any possible combination of the number of mismatches admissible in “Class I” reads beyond the simple threshold applied to each type of admissible mismatch.
- 5. A fifth class includes all reads that do not find any mapping considered valid (i.e not satisfying the set of matching rules defining an upper bound to the maximum matching inaccuracy as specified in Table 1) for each data class when referring to the reference sequence. Such sequences are said to be “Unmapped” when referring to the reference sequences and are classified as belonging to the “Class U”.
- 6. In case of paired-end reads, a sixth class is defined in which one read of the pair cannot map at any position of the reference genome (i.e. belongs to Class U) and the other read belongs to one of the P, N, M, I classes. Such class is named “Class HM” from Half-Mapped.
- The classification specified in the previous section concerns single sequence reads. In the case of sequencing technologies that generates read in pairs (i.e. Illumina Inc.) in which two reads are known to be separated by an unknown sequence of variable length, it is appropriate to consider the classification of the entire pair to a single data class. A read that is coupled with another is said to be its “mate”.
- If both paired reads belong to the same class the assignment to a class of the entire pair is the following: the entire pair is assigned to the same class for any class (i.e. P, N, M, I, U). In the case the two reads belong to a different class, but none of them belongs to the “Class U”, then the entire pair is assigned to the class with the highest priority defined according to the following expression:
-
P<N<M<I - in which “Class P” has the lowest priority and “Class I” has the highest priority.
- In case only one of the reads belongs to “Class U” and its mate to any of the Classes P, N, M, I a sixth class is defined as “Class HM” which stands for “Half Mapped”.
- The definition of such specific class of reads and the associated assignment rules are motivated by the fact that such data is used for attempting to determine gaps or unknown regions of reference genomes (a.k.a. little known or unknown regions). Such regions are reconstructed by mapping pairs at the edges using the pair read that can be mapped on the known regions. The unmapped mate is then used to build the so called “contigs” of the unknown region as it is shown in
FIG. 16 . Therefore providing a selective access to only such type of read pairs greatly reduces the associated computation burden enabling much efficient processing of such data originated by large amounts of data sets that using the state of the art solutions would require to be entirely inspected. - The table below summarizes the matching rules applied to reads in order to define the class of data each read belongs to. The rules are defined in the first five columns of the table in terms of presence or absence of type of mismatches (n, s, d, i and c type mismatches). The sixth column provide rules in terms of maximum threshold for each mismatch type and any function f(n,s) and w(n,s,d,i,c) of the possible mismatch types.
-
TABLE 1 Type of mismatches and set of constrains that each sequence reads must satisfy to be classified in the data classes defined in this invention disclosure. Number and types of mismatches found when matching a read with a reference sequence Number of Set of unknown Number matching bases Number of Number of Number of of clipped accuracy (“N”) substitutions deletions Insertions bases constraints Class 0 0 0 0 0 0 P n > 0 0 0 0 0 n ≤ MAXN N n > MAXN U n ≥ 0 s > 0 0 0 0 n ≤ MAXN and M s ≤ MAXS and f(n,s) ≤ MAXM n > MAXN or U s > MAXS or f(n,s) > MAXM n ≥ 0 s ≥ 0 d ≥ 0 * i ≥ 0 * c ≥ 0 * n ≤ MAXN and I s ≤ MAXS and d ≤ MAXD and i ≤ MAXI and *At least one mismatch of type d, c ≤ MAXC i, c must be present (i.e. d > 0 or w(n,s,d,i,c) ≤ i > 0 or c > 0) MAXTOT d ≥ 0 i ≥ 0 c ≥ 0 n > MAXN or U s > MAXS or d > MAXD or i > MAXI or c > MAXC w(n,s,d,i,c) > MAXTOT - The data classes of type N, M and I as defined in the previous sections can be further decomposed into an arbitrary number of distinct sub-classes with different degrees of matching accuracy. Such option is an important technical advantage in providing a finer granularity and as consequence a much more efficient selective access to each data class. As an example and not as a limitation, to partition the Class N into a number k of subclasses (Sub-Class N1, . . . , Sub-Class Nk) it is necessary to define a vector with the corresponding components MAXN1, MAXN2, . . . , MAXN(k-1), MAXN(k), with the condition that MAXN1<MAXN2< . . . <MAXN(k-1_<MAXN and assign each read to the lowest ranked sub-class that satisfy the constrains specified in Table 1 when evaluated for each element of the vector. This is shown in
FIG. 15 where adata classification unit 1501 contains Class P, N, M, I, U, HM encoder and encoders for annotations and metadata. Class N encoder is configured with a vector of thresholds, MAXN, toMAXN k 1502 which generates k subclasses of N data (1506). - In the case of the classes of type M and I the same principle is applied by defining a vector with the same properties for MAXM and MAXTOT respectively and use each vector components as threshold for checking if the functions f(n,s) and w(n,s,d,l,c) satisfy the constraint. Like in the case of sub-classes of type N, the assignment is given to the lowest sub-class for which the constraint is satisfied. The number of sub-classes for each class type is independent and any combination of subdivisions is admissible. This is shown in
FIG. 15 where aClass M encoder 1503 and a Class I encoder 1504 are configured respectively with a vector of thresholds MAXIM, to MAXIM, and MAXTOT1 to MAXTOTh . The two encoders generate respectively j subclasses of M data (1507) and h subclasses of I data (1508). - When two reads in a pair are classified in the same sub-class, then the pair belongs to the same sub-class.
- When two reads in a pair are classified into sub-classes of different classes, then the pair belongs to the sub-class of the class of higher priority according to the following expression:
-
N<M<I - where N has the lowest priority and I has the highest priority.
- When two reads belong to different sub-classes of one of classes N or M or I, then the pair belongs to the sub-class with the highest priority according to the following expressions:
-
N1<N2< . . . <Nk -
M1<M2< . . . <Mj -
I1<I2< . . . <Ih - where the highest index has the highest priority.
- The mismatches found for the reads classified in the classes N, M and I can be used to create “transformed” references to be used to compress more efficiently the read representation. Reads classified as belonging to the Classes N, M or I (with respect to the “pre-existing” (i.e. “external”) reference sequence denoted as RS0) can be coded with respect to the “transformed” reference sequence RS, according to the occurrence of the actual mismatches with the “transformed” reference. For example if readM in belonging to Class M (denoted as the ith read of class M) containing mismatches with respect to the reference sequence RSn, then after “transformation” readM in=readP i(n+1) can be obtained with A(Refn)=Refn+1 where A is the transformation from reference sequence RSn to reference sequence RSn+1.
-
FIG. 19 shows an example on how reads containing mismatches (belonging to Class M) with respect to reference sequence 1 (RS1) can be transformed into perfectly matching reads with respect to the reference sequence 2 (RS2) obtained from RS, by modifying the bases corresponding to the mismatch positions. They remain classified and they are coded together the other reads in the same data class access unit, but the coding is done using only the descriptors and descriptor values needed for a Class P read. This transformation can be denoted as: -
RS 2 =A(RS 1) - When the representation of the transformation A which generates RS2when applied to RS1 plus the representation of the reads versus RS2 corresponds to a lower entropy than the representation of the reads of class M versus RS1, it is advantageous to transmit the representation of the transformation A and the corresponding representation of the read versus RS2 because an higher compression of the data representation is achieved.
- The coding of the transformation A for transmission in the compressed bitstream requires the definition of two additional syntax elements as defined in the table below.
-
Syntax elements Semantic Comments rftp Reference transformation position of difference between position reference and contig used for prediction rftt Reference transformation type of difference between type reference and contig used for prediction. Same syntax described for the mmtype descriptor defined below in this disclosure. -
FIG. 18 shows an example on how a reference transformation is applied to reduce the number of mismatches to be coded on the mapped reads. - It has to be observed that, in some cases the transformation applied to the reference:
-
- May introduce mismatches in the representations of the reads that were not present when referring to the reference before applying the transformation.
- May modify the types of mismatches, a read may contain A instead of G while all other reads contain C instead of G), but mismatches remain in the same position.
- Different data classes and subsets of data of each data class may refer to the same “transformed” reference sequences or to reference sequences obtained by applying different transformations to the same pre-existing reference sequence.
-
FIG. 19 further shows an example of how reads can change the type of coding from a data class to another by means of the appropriate set of descriptors (e.g. using the descriptors of a Class P to code a read from Class M) after a reference transformation is applied and the read is represented using the “transformed” reference. This occurs for example when the transformation changes all bases corresponding to the mismatches of a read in the bases actually present in the read, thus virtually transforming a read belonging to Class M (when referring to the original non “transformed” reference sequence) into a virtual read of Class P (when referring to the “transformed” reference). The definition of the set of descriptors used for each class of data is provided in the following sections. -
FIG. 19 shows how the different classes of data can use the same “transformed” reference R1=A0(R0) (1900) to re-encode the reads, or different transformations AN (1901), AM (1902), AI(1903) can be separately applied to each class of data to produce different reference genomes RN, RM, RI - Once the classification of reads is completed with the definition of the Classes, further processing consists in defining a set of distinct syntax elements which represent the remaining information enabling the reconstruction of the read sequence when represented as being mapped on a given reference sequence. The data structure of these syntax elements requires the storage of global parameters and metadata to be used by the decoding engine. These data are structured in a Genomic Dataset Header described in the table below. A dataset is defined as the ensemble of coding elements needed to reconstruct the genomic information related to a single genomic sequencing run and all the following analysis. If the same genomic sample is sequenced twice in two distinct runs, the obtained data will be encoded in two distinct datasets.
-
TABLE 2 Genomic Dataset Header. Element Description dataset_header { dataset_group_ID identifier of Dataset Group containing the Dataset including this dataset_header. dataset_ID identifier of the Dataset. major_brand the brand identifier, identifying the data (compression) format specification the Dataset complies with minor_version minor version number of the data (compression) format specification the Dataset complies with unmapped_indexing_flag flag indicating the presence of indexed Access Units of Class U byte_offset_size_flag unmber of bits used to index the Access Units non_overlapping_AU_range if set, all Access Units in the Dataset have non- overlapping ranges. block_header_flag if set, all Blocks composing the Dataset are preceded by a Block Header if (block_header_flag) { mit_flag if set, the indexing tool called Master Index Table is present cc_ mode_flag if set, two Access Units of a class cannot be separated by Access Units of a different class in the storage device. block_start_code_prefix_flag if set, all Access Unit Headers start by the three bytes 0x000001 } else { ordered_blocks_flag if set, Access Unit are ordered by increasing value of the mapping position of the first read encoded in each Access Unit } seq_count number of reference sequences used in this Dataset reference_ID unique identification number of the reference genome used in the Dataset for (seq=0;seq<seq_count;seq++) { seq_ID[seq] reference sequence identifier } for (seq=0;seq<seq_count;seq++) { seq_blocks[seq] number of Access Units per reference sequence } dataset_type 0 = non-aligned content; 1 = aligned content; 2 = reference num_classes number of data classes encoded in the Dataset for (ci=0;ci<num_classes;ci++) { clid[ci] identifies the class of data carried by the Access Unit num_descriptors[ci] maximum number of Descriptors per Class encoded in the Dataset for (desc_ID=0;desc_ID<num_descriptors[ci];desc_ID++) { descriptor_ID[ci][desc_ID] unambiguous genomic descriptor identifier as defined in Table 1 } alphabet_ID identifier of the alphabet used to encode the Class U data signatures num_U_clusters number of clusters of class U reads U_signature_constant_length 1 = constant length; 0 = variable length U_signature_size size in bits of each integer representing an encoded signature if (U_signature_constant_length){ U_signature_length length of class U clusters signatures as number of nucleotides } num_U_access_units total number of Access Units in the Dataset containing encoded data of class U multiple_alignment_flag if set it indicates the presence of multiple alignments in the Dataset multiple_signature_base default number of signatures for reads of class U tflag[0] flag indicating the presence of a threshold thres for the corresponding reference sequence thres[0] for (i=1;i<seq_count;i++) { tflag[i] flag indicating the presence of a threshold thres for the corresponding reference sequence if(tflag[i] == 1) thres[i] threshold indicating the maximum difference between the access unit covered region and the access unit range else /* tflag[i] == 0 */ thres[i] = thres[i−1] if tflag is not set for a reference sequence, the previous value apply } } - A sequence read (i.e. a DNA segment) referred to a given reference sequence can be fully expressed optionally using a subset formed of various combinations of the following descriptors:
-
TABLE 3 Genomic descriptors and their meaning descriptor_id short name description 1 pos the mapping position of a read on a reference sequence 2 pair in case of paired end reads, it represents how the reads are paired 3 rlen the length of a sequence read 4 rcomp the DNA or RNA strand the reads was mapped on 5 mmpos the position of mismatches (i.e. substitutions, deletion and insertions) in aligned reads with respect to reference sequences 6 mmtype the types of mismatches with respect to reference sequences at the associated positions 7 clips the bases that could not be mapped on the reference sequence by the mapping procedure and either kept (“soft clipped” bases) or discarded (“hard clipped” bases) 8 flags mapping flags to enable the aligner to further specify the result of the mapping process such as if the sequence read is a PCR or optical duplicate 9 mmap the multiple mapping positions that are associated to a single read or read pair by the mapping procedure 10 msar the identification of the existence of spliced reads (i.e. reads that when split in chunks find mapping positions with higher degrees of matching accuracy than when they are mapped as single contiguous read mapped on a single position on a reference sequence) 11 ureads the representation of sequence reads that cannot be mapped at any position of the reference sequence with specified degrees of matching accuracy, 12 rtype descriptor used to signal the subset of descriptors used to encode sequence reads that cannot be mapped at any position of the reference sequence with specified degrees of matching accuracy 13 rftp the position of mismatches between a contig and a reference sequence. Positions of mismatches are terminated by a special terminator character. 14 rftt the type of mismatches between a contig and a reference sequence. 15 rgroup label associated to each compressed sequence read used to create group of reads sharing the same label 16 mscore score per alignment. It is used to represent mapping/alignment score per read generated by genomic sequence reads aligners. - For class U, the descriptor clips identifies those parts of the reads (typically the edges) that do not match, with a specified set of matching accuracy constraints, with the “internal” references.
- Descriptor ureads is used to encode verbatim the reads that cannot be mapped on any available reference being it a pre-existing (i.e. “external” like an actual reference genome) or an “internal” reference sequence.
- This classification creates groups of descriptors (syntax elements) that can be used to univocally represent genome sequence reads. The table below summarizes the syntax elements needed for each class of reads aligned with “external” (i.e. “pre-existing”) or “internal” (i.e. “constructed”) references. The asterisk “*” indicates mandatory descriptors always present in all encoded read for each class.
-
TABLE 4 Genomic descriptors necessary to represent each class of data. descriptors short data classes descriptor_id name P N M I U HM 1 pos X* X* X* X* X* X* 2 pair X X X X 3 rlen X X X X X X 4 rcomp X* X* X* X* X* X* 5 mmpos X* X* X* X* X 6 mmtype X* X* X 7 clips X* X* X 8 flags X* X* X* X* X* X* 9 mmap X X 10 msar X X 11 ureads X X X* 12 rtype X 13 rftp X X X X X X 14 rftt X X X X X X 15 rgroup X X X X X X 16 mscore X X X X X - Reads belonging to class P are characterized and can be perfectly reconstructed by only a position, a reverse complement information and an offset between mates in case they have been obtained by a sequencing technology yielding mated pairs, some flags and a read length.
- The next section further details how these descriptors are defined for classes P, N, M and I while for class U they are described in a following section Class HM is applied to read pairs only and it is a special case for which one read belongs to class P, N, M or I and the other to class U.
- The pos descriptor is used to calculate the absolute mapping position on a Reference Sequence of the left-most mapped base of a Genomic Record. The value of each pos descriptor represents the difference between the coordinates, on the Reference Sequence, of the left-most mapping base of a Genomic Record and the previous one.
FIG. 4 shows an example of calculation of the pos descriptor for a mapped read pair. - The first value of the pos descriptor in each coded Block is always 0 since no differential coding is possible for the first mapped read or read pair coded in an Access Unit. The absolute position of the first mapped read or read pair coded in an Access Unit is contained in the Access Unit Header.
- The absolute position on the Reference Sequence of the left-most mapped base of the nth Genomic Record is therefore calculated as:
-
- where p0 is the mapping value retrieved from the Access Unit Header for the first Genomic Record in the Access Unit.
- In order to calculate the absolute position on a Reference Sequence of a base the following formula applies:
-
p=p start +n del −n ins +d start+delta - where
-
- p is the absolute position on the Reference Sequence of the base
- pstart is the mapping position of the Genomic Record covering the base
- nins is the number of inserted bases preceding the base in the same Genomic Record
- ndel is the number of deleted bases preceding the base in the same Genomic Record
- dstart is the offset of the base in the Genomic Record from the Genomic Record
- Position
-
- delta is the (signed) pairing distance between two reads in a read pair. This has to be used only for positions in the second read of the Genomic Record.
- NOTE In case of paired end reads, when calculating the offset dstart from the Genomic Record Position, the two reads are considered contiguous. The relative reads position is taken into account when adding delta.
- An example of calculation of the mapping position p for one base on a Reference Sequence is shown in
FIG. 5 . - The rcomp descriptor conveys information about the strandedness of reads. Each bit of a decoded rcomp descriptor is a flag indicating if the read is on the forward (bit set to 0) or reverse (bit set to 1) strand.
FIG. 6 shows values and semantics of the rcomp descriptor for paired end reads. In the figure r1 is read 1 and r2 is read 2. According to the mapping position of each read, the rcomp descriptor can have four different values. -
TABLE 5 Values and meaning of the rcomp descriptor. Semantics Value Single read Paired reads rc1 0x00 read on both reads on forward forward strand (strand 1) strand rc2 0x01 read on first read on forward reverse strand, second on strand reverse strand rc3 0x02 reserved first read on reverse strand, second on forward strand rc4 0x03 reserved both reads on reverse strand (strand 2) 0x04 . . . reserved reserved 0xff - The flag descriptor is a set of flaas as described in Table 6.
-
TABLE 6 Semantics for each bit of the flag descriptor. bit position from LSB Semantics 0 read is PCR or optical duplicate 1 read fails platform/ vendor quality checks 2 read mapped in proper pair 3 reserved 4 reserved 5 reserved 6 reserved 7 reserved - The mmpos descriptor represents the position, within a read or read pairs, of a mismatch with respect to the reference sequence. The position is represented as a distance from the position of the previous mismatch in the Genomic Record. The position of the first mismatch is represented as the distance from the left-most mapped base in the Genomic Record.
- In case of paired reads, or in general records containing more than one genomic segment, the gaps between consecutive segments are not considered in the calculation of distances between consecutive mismatches.
- If the encoded pair contains both read 1 and read 2, the positions of mismatches in
read 2 are offset by the length ofread 1. For example in case of reads with constant length equal to 100, if the first mismatches in the pair is inread 2 at position 44, the first mmpos descriptor decoded for this Genomic Record assumes the value 144. - If the described pair is missing read 1 (either because it is encoded in another block, or because
read 2 is unpaired), the mismatch positions are not offset by the length ofread 1. For example in case of fixed read length 100, if the first mutation inread 2 is at position 44, but read 2 is unpaired, the first mmpos descriptor decoded for this Genomic Record assumes the value 44. - Each mmpos descriptor is associated to an mmtype descriptor representing the type of mismatch occurring in the decoded read or read pair at the position calculated using mmpos.
- An example of how to calculate mismatches positions in a read pair is provided in
FIG. 7 where len1 is the length of read1. - The absolute position on a Reference Sequence of the ith mismatch in a Genomic Record shall be calculated as shown in Table 7.
-
TABLE 7 How to calculate the absolute position of mismatches in a Genomic Record. First mismatch ith mismatch (i > 0) Single read mmabs0 = p0 + mmabsi = mmabsi−1 + mmpos0 mmposi Read pair first mismatch in mmabs0 = p0 + if(( Σmmposi < len1 || read 1 mmpos0 (Σmmposi ≥ len1 && first mismatch in mmabso0 = p0 + Σmmposi−1 < len1)) read 2 mmpos0 + delta mmabsi = mmabsi−1 + mmposi else mmabsi = mmabsi−1 + mmposi + delta - In Table 7 the following variables are defined:
-
- mmabsi is the absolute position in the Reference Sequence of the ith mismatch in the read or read pair
- mmposi is the ith value of the mmpos descriptor in the Genomic Record
- len, is the length of
read 1 in a read pair - delta is the pairing distance between
read 1 and read 2 calculated as defined for the pair descriptor and shown inFIG. 5
- Sequences of mmpos descriptors referring to a Genomic Record are terminated with a reserved terminator value which cannot be interpreted as mismatch position.
- The mmtype descriptor specifies the type of mismatch occurring in the decoded read at the position calculated using the associated mmpos descriptor.
- The mmtype descriptor does not have a reserved value for the terminator since each Genomic Record shall contain the same number of mmtype and mmpos descriptors.
- Table 8 lists the values and corresponding semantics of the mmtype descriptor according to the used alphabet.
-
TABLE 8 mmtype descriptor values and semantics according to the used alphabet clips Value Semantics alphabet_id = 0 alphabet_id = 1 alphabet_id = 2 alphabet_id = 3 0x00 Substitute with ‘A’ Substitute with Substitute with ‘A’ Substitute with ‘A’ ‘A’ 0x01 Substitute with ‘C’ Substitute with Substitute with ‘C’ Substitute with ‘C’ ‘C’ 0x02 Substitute with ‘G’ Substitute with Substitute with ‘G’ Substitute with ‘G’ ‘G’ 0x03 Substitute with ‘T’ Substitute with Substitute with ‘U’ Substitute with ‘T’ ‘U’ 0x04 Substitute with ‘N’ Substitute with Substitute with ‘N’ Substitute with ‘N’ ‘N’ 0x05 Deletion Deletion Deletion Deletion 0x06 Insert ‘A’ Substitute with Insert ‘A’ Substitute with ‘R’ ‘R’ 0x07 Insert ‘C’ Substitute with Insert ‘C' Substitute with ‘Y’ ‘Y’ 0x08 Insert ‘G’ Substitute with Insert ‘G’ Substitute with ‘S’ ‘S’ 0x09 Insert ‘T’ Substitute with Insert ‘U’ Substitute with ‘W’ ‘W’ 0x0A Insert ‘N’ Substitute with Insert ‘N’ Substitute with ‘K’ ‘K’ 0x0B N/A Substitute with N/A Substitute with ‘M’ ‘M’ 0x0C N/A Substitute with N/A Substitute with ‘B’ ‘B’ 0x0D N/A Substitute with N/A Substitute with ‘D’ ‘D’ 0x0E N/A Substitute with N/A Substitute with ‘H’ ‘H’ 0x0F N/A Substitute with N/A Substitute with ‘V’ ‘V’ 0x10 N/A Substitute with ‘—’ N/A Substitute with ‘—’ 0x11 N/A Insert ‘A’ N/A Insert ‘A’ 0x12 N/A Insert ‘C’ N/A Insert ‘C’ 0x13 N/A Insert ‘G’ N/A Insert ‘G’ 0x14 N/A Insert ‘T’ N/A Insert ‘T’ 0x15 N/A Insert ‘N’ N/A Insert ‘N’ 0x16 N/A Insert ‘R’ N/A Insert ‘R’ 0x17 N/A Insert ‘Y’ N/A Insert ‘Y’ 0x18 N/A Insert ‘S’ N/A Insert ‘S’ 0x19 N/A Insert ‘W’ N/A Insert ‘W’ 0x1A N/A Insert ‘K’ N/A Insert ‘K’ 0x1B N/A Insert ‘M’ N/A Insert ‘M’ 0x1C N/A Insert ‘B’ N/A Insert ‘B’ 0x1D N/A Insert ‘D’ N/A Insert ‘D’ 0x1E N/A Insert ‘H’ N/A Insert ‘H’ 0x1F N/A Insert ‘V’ N/A Insert ‘V’ 0x20 N/A Insert ‘—’ N/A Insert ‘—’ - The clips descriptor is used to represent clipped bases (a.k.a. soft or hard clips) in mapped reads or read pairs. This descriptor encodes soft clips as sequences of ASCII characters with additional elements to identify the position of the clipped bases in the read or read pair. In case of hard clips only the position and the number of clipped bases is encoded. Each descriptor contains a Genomic Record identifier followed by information related to the clipped bases position in the Genomic Record and the actual clipped bases in case of soft clips.
- Syntax and semantics of the clips descriptor is provided in Table 9 and Table 10.
-
TABLE 9 Fields composing the clips descriptor. Size in bytes Semantics Remarks Soft clips 4 Genomic Record identifier (sequential counter of the repeated up to 2N Genomic Records encoded in the current Access Unit) times per Genomic 1 flag indicating the position (0x00 start of read1, 0x01 = Record where N is end of read 1, 0x02 = start ofread 2, 0x03 = end ofthe number of read2) of clipped bases segments in the M sequence of M bases encoded as ASCII characters. Genomic Record. 1 (optional) soft clips terminator (0xfe) present only if in the same If K is the number of Genomic Record more than 1 soft clips sequence is clipped segments, present this is present only after the first K-1 sections 1 Genomic Record terminator (0xff) present only after the Kth (last) sections Hard clips 4 Genomic Record identifier (sequential counter of the repeated up to 2N Genomic Records encoded in the current Access Unit) times per Genomic 1 flag indicating the position (0x04 start of read1, 0x05 = Record where N is end of read 1, 0x06 = start ofread 2, 0x07 = end ofthe number of read2) of clipped bases segments in the 4 number of hard clipped bases Genomic Record. 1 (optional) terminator (0xfe) present only if in the same Genomic If K is the number of Record more than 1 clipped sequence is present clipped segments, this is present only after the first K-1 sections 1 Genomic Record terminator (0xff) present only after the Kth (last) sections -
TABLE 10 Syntax of the clips descriptor. clips( ) { record_id do{ clips_pos if(clips_pos <= 3){ do{ soft_clipped_base } while(sclipped_base != 0xff OR sclipped_base != 0xfe) } else{ hard_clipped_bases } } while(sclipped_base != 0xff) } - record_id is a counter of Genomic Records encoded in the current Access Unit.
- clips_pos represents the position of the next clipped bases in the read or read pair. Values for positions have the following meaning:
-
clips_pos value semantics 0x00 soft clips before start of read 10x01 soft clips after end of read 10x02 soft clips before start of read 20x03 soft clips after end of read2 0x04 hard clips before start of read 10x05 hard clips after end of read 10x06 hard clips before start of read 20x07 hard clips after end of read2 - soft_clipped_base is one of the symbols of the alphabet identified by alphabet_id.
- hard_clipped_bases represent the number of hard clipped bases at the position indicated by the corresponding clips_pos;
- The ureads descriptor represents verbatim sequence reads as a sequence of ASCII characters belonging to the currently used alphabet identified by alphabet_id.
- The rlen descriptor is used only in case of variable length reads when reads_length=0 in the Parameter Set defined in this disclosure.
- A decoded rlen descriptor represents the length of the current sequence read as number of bases including soft clips.
- The information required to reconstruct paired reads is encoded using the pair descriptor. The pairing information associating one genomic segment to another can be expressed in three ways:
-
- 1. if both reads are mapped on the same reference and coded in the same genomic record, the pairing distance is defined as the distance between the left-most mapped base of Read1 and the left-most mapped base of Read2. An example of pairing distance is shown in
FIG. 8 . - 2. as an absolute mapping position of the second read on the same reference sequence as the first read.
- 3. as an absolute mapping position of the second read on a different reference sequence than the first read
- 1. if both reads are mapped on the same reference and coded in the same genomic record, the pairing distance is defined as the distance between the left-most mapped base of Read1 and the left-most mapped base of Read2. An example of pairing distance is shown in
- The pairing information is encoded as described in
2 and 3 above when the first two bytes of a decoded pair descriptor have one of the values listed in Table 11.points -
FIG. 8 shows how the Genomic Record Length is defined as the number of genomic positions on a reference sequence separating the left-most mapped base from the right-most mapped base of a read or read pair. In case of read pairs this is the number of genomic position on the reference sequence separating the left-most base ofread 1 from the right-most base of its mate read 2 when both reads are mapped on the same reference sequence. The “Pairing Distance” is defined as the difference between the left-most position of Read1 and the left-most position of Read2. The “Pairing Distance” is represented as a signed integer valued of the pair descriptor. - When aligning reads to a reference sequence, read 2 can be mapped to a position that is smaller (e.g. to the left) than the mapping position of
read 1; in this case, the pairing distance used incase 1 above will be negative. This implies that the information about reads strandedness is encoded in the sign of the pairing distance descriptor. -
TABLE 11 Reserved values for the read distance descriptors signaling how the read pair was encoded Value Semantics Followed by 0x7ffd read2 in pair is on the 4 bytes encoding read2 distance as same reference but absolute value on the reference coded separately 0x8003 read1 in pair is on the 4 bytes encoding read1 distance as same reference but absolute value on the reference coded separately 0x7ffe read2 is on a different 2 bytes with a reference identifier, 4 reference bytes with absolute position of read2 0x8002 read1 is on a different 2 bytes with a reference identifier, 4 reference bytes with absolute position of read1 0x7fff read1 unpaired N/A 0x8001 read2 unpaired N/A 0x8000 read unpaired N/A 0x8004 additional alignment 2 bytes with a reference identifier, 4 present on another bytes with absolute position of read1 Reference Sequence - The reads distance is encoded as a 2-bytes signed integer, where:
-
- the LSB is used to represent the sign (if the sign bit is 0, the number is non-negative, if the sign bit is 1 then the number is negative), and
- the remaining 15 bits are used to represent the absolute value of the pairing distance
- This approach enables representing pairing distances in a range from −32766 to 32766. In case the reads are separated by a larger gap, the absolute position shall be encoded in the pair descriptor after the special value 0x7ffd or 0x8003 as defined in Table 11 and the two reads are encoded in two separate Genomic Records (i.e. the pair is “split”).
- The decoding process of the reads distance is shown below:
- sign=ReadsDistance & 0x0001;
- ReadsDistance=ReadsDistance»1;
- if (sign) ReadsDistance=-ReadsDistance;
- The mscore descriptor provides a score per alignment. It shall be used to represent mapping/alignment score per read generated by genomic sequence reads aligners. The score shall be expressed using an exponent and fractional part. The number of bits used to represent the exponent and the fractional part are specified in the encoding parameters (see Parameter Set below). Table 12 shows how this is specified in IEEE RFC 754 for a 11-bits exponent and a 52-bits fractional part.
- The score of each alignment shall be represented by:
-
- One sign bit (S)
- 11 bits for the exponent (E)
- 53 bit for the mantissa (M)
- If the base (radix) to be used for the calculation of scores is 10, the score is calculated as:
- score=−1 s×10E×M
- The rgroup descriptor identifies the read group the Genomic Record belongs to. It is an unsigned 8-bit integer with values going from 0 to num_groups −1. The presence of read groups in an Access Unit is signaled by num_groups >0 in the Parameter Set as defined in the Parameter Set defined below.
- The msar (Multiple Segments Alignment Record) descriptor supports spliced reads and alternative secondary alignments which contain indels or soft clips. msar is intended to convey information on:
-
- a mapped segment length
- a different mapping contiguity (i.e. CIGAR string) for a secondary alignment and/or spliced read
- msar can is used to represent mismatches, insertions, deletions, strandedness and clipped bases of secondary alignments of an aligned read
- The following descriptors are defined for the support of multiple alignments.
- The mmap descriptor is used to signal at how many positions the read or the leftmost read of a pair has been aligned. A Genomic Record containing multiple alignments is associated with one multi-byte mmap descriptor. The first two bytes of a mmap descriptor represent an unsigned integer N which refers to the read as a single segment (if spliced_reads_flag=0 as defined in this disclosure) or instead to all the segments into which the read has been spliced for the several possible alignments (if spliced_reads_flag=1). The value of N represents the number of values of the pos descriptor which are coded for the template in the current record. N is followed by one or more 8-bit unsigned integers Mi as described in this disclosure.
- In case of multiple alignments, the rcomp descriptor defined in this disclosure is used to specify the strandedness of each read alignment using the same syntax specified above.
- In case of multiple alignments, one mscore as defined in this disclosure is assigned to each alignment.
- If no splices are present in the Access Unit, spliced_reads_flag is unset.
- In paired-end sequencing, the mmap descriptor is composed by a 16-bit unsigned integer N followed by one or more 8-bit unsigned integers Mi, with i assuming values from 1 to the number of complete first (here, the left-most) read alignments. For each first read alignment, spliced or not, Mi is used to signal how many segments are used to align the second read (in this case, without splices, this is equal to the number of alignments), and then how many values of the pair descriptor are coded for that alignment of the first read.
- The values of Mi shall be used to calculate P=Σi=1 NMi, which indicates the number of alignments of the second read.
- A special value of Mi (Mi=0) indicates that the alignment of the left-most read is paired with an alignment of the right-most read which is already paired with a kth alignment of the left-most read with k<i (then there is no new alignment detected, which is consistent with the equation above).
- As an example, in the simplest cases:
-
- 1. If there is a single alignment for the leftmost read and two alternative alignments for the rightmost, N assumes the
value 1 and M1 assumes thevalue 2. - 2. If two alternative alignments are detected for the leftmost read but only one for the rightmost, N assumes the
value 2, M1 assumes thevalue 1 and M2 assumes the value 0.
- 1. If there is a single alignment for the leftmost read and two alternative alignments for the rightmost, N assumes the
- When Mi is 0, the associated value of pair shall link to an existing second read alignment; a syntax error will be raised otherwise and the alignment considered broken.
- Example: if the first read has two mapping positions and the second read only one, N is 2, M1 is 1 and M2 is 0 as said earlier. If this is followed by another alternative secondary mapping for the entire template, N assumes the
value 3, and M3 assumes thevalue 1. -
FIG. 9 illustrates the meaning of N, P and Mi in case of multiple alignments without splices andFIG. 10 shows how the pos, pair and mmap descriptors are used to encode the multiple alignments information. - With respect to 10 the following applies:
-
- The right-most read has P=Σi=1 NMi alignments
- Some values of Mi can be =0 when the ith alignment of the left-most read is paired with an alignment of the right-most read that is already paired with a kth alignment of the left-most read with k<i
- One reserved value of the pair descriptor can be present to signal alignments belonging to other AUs ranges. If present it's always the first pair descriptor for the current record
- If the dataset is encoded with spliced reads, the msar descriptor enables representaiton of splices length and strandedness as defined in this disclosure.
- After having decoded the mmap and the msar descriptors, the decoder knows how many reads or read pairs have been encoded to represent the multiple mappings and how many segments are composing each read or read pair mapping. This is shown in
FIG. 11 andFIG. 12 . - With reference to
FIG. 11 the following applies: -
- The left-most read has N1 alignments with N splices (N1≤N).
- N represents the number of splices present in all alignments of the left-most read and it is encoded as first value of the mmap descriptor.
- The right-most read has P=Σi=1 N1 Mi splices, where Mi is the number of splices of the right-most read which are associated in a pair with the ith alignment of the left-most read (1≤i≤N1). In other words P represents the number of splices of the right-most read and is calculated using the N values following the first value of the mmap descriptor.
- N1 and N2 represent the number of alignments of the first and second read and are calculated using the N+P values of the msar descriptor.
- With reference to
FIG. 12 the following applies: -
- The left-most has N1 alignments with N splices (N1≤N). If N1=N AND N2=P no splices would be present.
- The right-most read has P=Σi=1 N1 Mi splices
t j 1≤j≤P and N2 (N2≤P) alignments. - The number of pair descriptors can be calculated as NP=Max(N1, P)+M0 where
- M0 is the number of Mi with value 0
- NP has to be incremented by 1 in case one special pair descriptor indicates the presence of alignments in other AUs.
- The mscore descriptor allows signaling the mapping score of an alignment. In single-end sequencing it will have N1 values per template; in paired-end sequencing it will have a value for each alignment of the entire template (number of different alignments of the first read possibly +the number of further second read alignments, i.e. when Mi−1>0).
- Number of scores=MAX(N1, N2)+M0
- where M0 represent the total number of Mi=0.
- The number of scores associated to each alignment is signalled by the encoding parameter
- as_depth as defined in this disclosure.
-
-
TABLE 13 How to calculate the number of descriptors needed to represent multiple alignments in one Genomic Record in case of multiple alignments without splices. Semantic mmap mscore Effect Read (or paired- Single read: Single read: N values Single read: end read) with N Read pair: MAX(N, Σ(Mi)) the read has multiple multiple Read pair: values + Σ(Mi = 0) mappings and it is encoded as mappings, not N, M, Introducing a separator would a sequence of N consecutive spliced where enable having an arbitrary segments belonging to the 1 ≤ i ≤ N number of scores. Otherwise a 0 class with the highest ID. should be used if not present. N pos descriptors are used These are floating point values as Read pair: defined in this disclosure. the read pair has multiple mappings and it is encoded as a sequence of • N segments for the first read • P = Σ(Mi) pairings to the alignments of the second read N pos descriptors are used N × P pair descriptors are used with NP =Σi=1 N Mi + (no. of Mi = 0) + (optional) 1 The optional pairing descriptor is used when alignments are present on different reference sequences than the one currently encoded N + 1 mmap descriptors read (pair) 0 0 uniquely mapped - Table 14 shows how to calculate the number of descriptors needed to represent multiple alignments in one Genomic Record in case of multiple alignments with splices.
-
TABLE 14 Descriptors used to represent multiple alignments and associated scores. Semantic mmap mscore Effect Read (or paired- Single read: Single read: N1 values Single read: end read) with N Read pair: Max(N1, N2) + Σ(Mi == the read has multiple multiple Read pair: 0) values mappings and it is encoded as mappings, with N, Mi N1 and N2 are calculated using a sequence of N consecutive splices where the N + P msar descriptors segments belonging to the i ≤ i ≤ N P = Σi=1 N1 Mi class with the highest ID. These are floating point values as N pos descriptors are used defined in this disclosure Read pair: the read pair has multiple mappings and it is encoded as a sequence of • N segments for the first read • P = Σ(Mi) pairings to the alignments of the second read N pos descriptors are used read (pair) 0 0 uniquely mapped - It may happen that the alignment process finds alternative mappings to another reference sequence than the one where the primary mapping is positioned.
- For read pairs which are uniquely aligned, a pair descriptor shall be used to represent the absolute read positions when there is for example a chimeric alignment with the mate on another chromosome. The pair descriptor shall be used to signal the reference and the position of the next record containing further alignments for the same template. The last record (e.g. the third if alternative mappings are coded in three different AUs) shall contain the reference and position of the first record.
- In case one or more alignments for the leftmost read in a pair are present on a different reference sequence than the one related to the currently encoded AU, a reserved value of the pair descriptor shall be used (not the same as the one used for alignments present to another reference in case of unique alignment). The reserved value shall be followed by the reference sequence identifier and the position of the leftmost alignment among all those contained in the next AU (i.e. the first decoded value of the pos descriptor for that record).
- When an alternative secondary mapping does not preserve the contiguity of the reference region where the sequence is aligned, it may be impossible to reconstruct the exact mapping generated by the aligner because the actual sequence (and then the descriptors related to mismatches such as substitutions or indels) is only coded for the primary alignment. The msar descriptor shall be used to represent how secondary alignments map on the reference sequence in case they contain indels and/or soft clips. If msar is represented by the special symbol “*” for a secondary alignment, the decoder shall reconstruct the secondary alignment from the primary alignment and the secondary alignment mapping position.
- Raw reads belong to class U only. They are encoded as unmapped reads in aligned datasets.
- Some of the descriptors defined for reads aligned to an external or internal reference are used to encode raw reads. This is motivated by the fact that raw reads are encoded using reference sequences built from the data to be encoded.
-
descriptor_id Descriptor Semantics Comments 1 pos read mapping reads positions on the reference are 0-based position (signed) delta encoded with respect to the previous read 4 rcomp strand information for N bits reads in a template 8 flags used to cover SAM This is a placeholder to understand the type of flags and possibly flags we need to associate to each template. other flags SAM flags can be a starting point. 5 mmpos mismatch position (unsigned) delta encoded with respect to the previous mismatch 6 mmtype type of edit fixed alphabet operations: substitutions deletions insertions 7 clips string of nucleotides ASCII characters + position in the template with variable length (e.g. soft clips) 11 ureads unmapped reads ASCII characters encoded verbatim 3 rlen read lengths 32-bit integer 12 rtype read type This identifies the subset of descriptors needed to decode the read. - The ureads descriptor represents verbatim sequence reads as a sequence of ASCII characters belonging to the currently used alphabet.
- The rtype descriptor is used to signal the subset of descriptors used to encode one unmapped read or read pair in a Genomic Record as shown in Table 15.
- The rtype descriptor also enables mixing reference-based and reference-less compression in the same Dataset. In this scenario rtype=0 signals reference based encoded records, while rtype>0 signals the set of descriptors to be used for reference less compression(in this case descriptors refer to the computed reference. when needed).
-
TABLE 15 Semantics of the rtype descriptor. rtype type of encoded reads description not used aligned reference the entire Dataset is encoded using based reference based compression for mapped reads 0 reference based the Dataset contains both read (pairs) encoded using reference based compression and reference less compression. Descriptors for this record use the external or embedded reference according to the Class of the AU >0 raw and aligned 1 = class P descriptors used reference less 2 = class N descriptors used 3 = class M descriptors used 4 = class I descriptors used 5 = class U descriptors used - In an embodiment, the present invention uses context-adaptive binary arithmetic coding (CABAC) for the compression of the genomic descriptors. CABAC first converts to a binary representation all symbols to be encoded. The process of binarization converts a non-binary-valued symbol (e.g. a mapping position, a mapped read length or a mismatch type) into a binary code prior to arithmetic coding.
- The selection of appropriate binarizations adapted to the statistical properties of each descriptor provides better compression ratios than existing formats based on general purpose compressor applied on blocks of heterogeneous elements. In the following sections these variables are defined:
-
- symVal: non-binary value of the genomic descriptor to be binarized.
- cLength: represents the numbers of bits with which the value is binarized.
- cMax: is the largest possible value to be binarized. Larger values will be truncated.
- While the following binarization tables are calculated for fixed values of these variables, it has to be appreciated that the present principles are not limited to these values, and thus other values can also be used in accordance with the present principles, while maintaining the spirit of the present principles.
- Each binarization algorithm used in this disclosure is identified by an identifier as shown in Table 16.
-
TABLE 16 Type of binarizations and respective identifiers. binarization_id Type of binarization 0 Binary Coding (BI) 1 Truncated Unary (TU) 2 Exponential Golomb (EG) 3 Signed Exponential Golomb (SEG) 4 Truncated Exponential Golomb (TEG) 5 Signed Truncated Exponential Golomb (STEG) 6 Split Unit-wise Truncated Unary (SUTU) 7 Signed Split Unit-wise Truncated Unary (SSUTU) 8 Double Truncated Unary (DTU) 9 Signed Double Truncated Unary (SDTU) - This is a standard binary representation whereby each numerical value is coded in its binary representation. The variable cLength—shown in Table 28 when binarization_id =0—represents the numbers of bits with which the value will be represented.
- A TU binary string is the concatenation of sym Val ones followed by one zero. If sym Val==cMax, the trailing 0-bit is discarded. Table 17 illustrates the bin strings of this truncated unary binarization with cMax=3.
-
TABLE 17 Bin string of the truncated unary binarization with cMax == 3. symVal Binary string 0 0 1 1 0 2 1 1 0 3 1 1 1 binIdx 0 1 2 - The syntax for this binarization process along with arithmetic decoding is described below.
-
decode_cabac_TU(ctxTable, ctxIdx, cMax) { for (binIdx=0; binIdx<cMax; binIdx++) { binValue if (binValue == 0) break } return binIdx } - binValue is the binarized value which can be either 0 or 1.
- The parsing process for genomic descriptors binarized using this technique begins with reading the bits starting at the current location in the bitstream up to and including the first non-zero bit, and counting the number of leading bits that are equal to 0.
- This process is specified as follows:
-
leadingZeroBits= −1 for( b = 0; !b; leadingZeroBits++ ) b = read_bits( 1 ) - The variable sym Val is then assigned as follows:
-
symVal=2 leadingZeroBits−1+read_bits(leadingZeroBits) - where the function call read_bits reads a number of bits from a storage medium equal to the parameter passed as input. The value returned from read_bits(leadingZeroBits) is interpreted as a binary representation of an unsigned integer with the most significant bit written first.
- Table 18 illustrates the structure of the Exp-Golomb code by separating the bit string into “prefix” and “suffix” bits. The “prefix” bits are those bits that are parsed as specified above for the computation of leadingZeroBits, and are shown as either 0 or 1 in the bit string column of Table 18. The “suffix” bits are those bits that are parsed in the computation of symVal and are shown as xi in Table 18, with i in the range of 0 to leadingZeroBits −1, inclusive. Each xi is equal to either 0 or 1.
-
TABLE 18 Binary representations for values of symVal from 0 to 62. Bit string form Range of symVal 1 0 0 1 x0 1 . . . 2 0 0 1 x1 x0 3 . . . 6 0 0 0 1 x2 x1 x0 7 . . . 14 0 0 0 0 1 x3 x2 x1 x0 15 . . . 30 0 0 0 0 0 1 x4 x3 x2 x1 x0 31 . . . 62 . . . . . . - Table 19 illustrates the explicit assignments of bit strings to symVal values.
-
TABLE 19 Exp-Golomb bit strings and symVal in explicit form. Bit string symVal 1 0 0 1 0 1 0 1 1 2 0 0 1 0 0 3 0 0 1 0 1 4 0 0 1 1 0 5 0 0 1 1 1 6 0 0 0 1 0 0 0 7 0 0 0 1 0 0 1 8 0 0 0 1 0 1 0 9 . . . . . . - Depending on the genomic descriptor, the value of a binarized syntax element is decoded using one of the following methods:
-
- The value of the decoded genomic descriptor is equal to the symVal value corresponding to the binarized descriptor
- The value of the decoded genomic descriptor is calculated by applying signed 0-order Exponential-Golomb decoding as defined for example in https://en.wikipedia.org/wiki/Exponential-Golomb_coding with symVal as input.
- According to this binarization method the genomic descriptor is associated to the symVal by ordering the syntax element by its absolute value in increasing order and representing the positive value for a given absolute value with the lower symVal. Table 20 shows the assignment rule.
-
TABLE 20 Assignment of syntax element to symVal for signed Exp-Golomb coded genomic descriptors. symVal syntax element value 0 0 1 1 2 −1 3 2 4 −2 5 3 6 −3 k (−1)k+1 Ceil(k ÷ 2) - This binarization process requires the use of an additional input parameter tegParam which defines how the binarization is calculated.
- Output of this process is the TEG binarization of the syntax element.
- A TEG bin string is the concatenation of 1 (in case of symVal==0) or 2 (in case of symVal>0) types of binarization:
-
- The truncated unary binarization with cMax=tegParam for the value Min(symVal, tegParam)
- If symVal!=0, the exponential golomb binarization for the value Abs(symVal)−tegParam
- Table 21 illustrates the bin strings of this Truncated Exponential Golomb binarization with tegParam==2.
-
TABLE 21 Bin string of the Truncated Exponential Golomb binarization with tegParam = 2. symVal Truncated Unary Exponential Golomb 0 0 — — — — 1 1 0 — — — 2 1 1 1 — — 3 1 1 0 1 0 4 1 1 0 1 1 . . . binIdx 0 1 2 3 4 - This binarization process requires the use of an additional input parameter stegParam. A STEG binary string is the concatenation of 1 (in case of symVal ==0) or 2 (for other cases) binarizations:
-
- 1. The truncated exponential golomb binarization for Abs(symVal)
- 2. If symVal!=0, a one-bit flag equal to 1 (if symVal<0) or equal to 0 (if symVal>0). Table 22 illustrates the bin strings of this Signed Truncated Exponential Golomb binarization with stegParam=2.
-
TABLE 22 Binary string of the Signed Truncated Exponential Golomb binarization with stegParam = 2. Truncated Exponential Golomb symVal Truncated Exponential . . . Unary Golomb Sign Flag −4 1 1 0 1 1 1 −3 1 1 0 1 0 1 −2 1 1 1 — — 1 −1 1 0 — — — 1 0 0 — — — — — 1 1 0 — — — 0 2 1 1 1 — — 0 3 1 1 0 1 0 0 4 1 1 0 1 1 0 . . . binIdx 0 1 2 3 4 Max(binIdx) + 1 - This binarization process requires the use of two input parameters splitUnitSize and outputSymSize. outputSymSize must always be a multiple of splitUnitSize.
- The SUTU binary string is a concatenation of repeated TU binarizations, where each TU binarization is applied to portions of symVal which are splitUnitSize bits long. In other words, symVal is represented by x binary string obtained with the TU binarization, where x =outputSymSize/splitUnitSize. The cMax parameter for each binary string is defined as cMax=(1«splitUnitSize)−1.
- Table 23 illustrates the binary strings of split unit-wise truncated unary binarizations with splitUnitSize=2 and outputSymbSize=8.
-
TABLE 23 Bin string of the Split Unit-wise Truncated Unary binarization with splitUnitSize = 2, outputSymSize = 8. TU Instance TU Instance TU Instance TU Instance 1 2 3 4 symVal cMax == 3 cMax == 3 cMax == 3 cMax == 3 0 0 — — 0 — — 0 — — 0 — — 1 1 0 — 0 — — 0 — — 0 — — 3 1 1 1 0 — — 0 — — 0 — — 15 1 1 1 1 1 1 0 — — 0 — — 31 1 1 1 1 1 1 1 0 — 0 — — 63 1 1 1 1 1 1 1 1 1 0 — — binIdx 0 1 2 3 4 5 6 7 8 9 10 11 - The bitstream syntax for this binarization process is described below.
-
TABLE 24 CABAC decoding process for TU binarization. decode_cabac_SUTU(ctxTable, ctxIdx, splitUnitSize, outputSymSize) { output_symb = 0 cMax = (1<<splitUnitSize) − 1 for (i=0; i<outputSymSize; i+=splitUnitSize) { tmp = decode_cabac_TU(ctxTable, ctxIdx, cMax) ctxIdx += cMax output_sym |= (tmp<<i) } return output_sym } - This binarization process requires the use of two input parameters splitUnitSize and outputSymSize.
- The SSUTU binary string is obtained by an extension of the SUTU binarization process with the sign of symVal coded as a separate flag.
-
- The SUTU binarization for the value Abs(symVal).
- If symVal!=0, a one-bit flag equal to 1 (if symVal<0) or equal to 0 (if symVal>0).
- Table 25 illustrates the binary strings of the Signed Split Unit-wise Truncated Unary binarization with splitUnitSize=2, outputSymbSize=8.
-
TABLE 25 Bin string of the Signed Split Unit-wise Truncated Unary binarization with splitUnitSize = 2, outputSymSize = 8. TU TU TU TU Instance 1 Instance 2Instance 3Instance 4 symVal cMax == 3 cMax == 3 cMax == 3 cMax == 3 Sign −63 1 1 1 1 1 1 1 1 1 0 — — 1 −31 1 1 1 1 1 1 1 0 — 0 — — 1 −15 1 1 1 1 1 1 0 — — 0 — — 1 −3 1 1 1 0 — — 0 — — 0 — — 1 −1 1 0 — 0 — — 0 — — 0 — — 1 0 0 — — 0 — — 0 — — 0 — — — 1 1 0 — 0 — — 0 — — 0 — — 0 3 1 1 1 0 — — 0 — — 0 — — 0 15 1 1 1 1 1 1 0 — — 0 — — 0 31 1 1 1 1 1 1 1 0 — 0 — — 0 63 1 1 1 1 1 1 1 1 1 0 — — 0 binIdx 0 1 2 3 4 5 6 7 8 9 10 11 12 - The syntax for this binarization process is described below.
-
decode_cabac_SSUTU(ctxTable, ctxIdx, splitUnitSize, outputSymSize) { output_sym = decode_cabac_SUTU(ctxTable, ctxIdx, splitUnitSize, outputSymSize) if(output_sym > 0) { ctxIdx += ((1<<splitUnitSize) − 1) * (outputSymSize / splitUnitSize) sign_flag if(sign_flag == 1) output_sym = −output_sym } return output_sym } - sign_flag represents the cabac decoding of a bit on context variable identified by ctxldx.
- decode_cabac_SUTU( ) represents the cabac decoding process for the SUTU binarization.
- This binarization process requires the use of two input parameters splitUnitSize and outputSymSize.
- The DTU binary string is a concatenation of two binarizations, namely the TU binarization and the SUTU binarization. The parameter cMax is used for the TU binarization, and parameters splitUnitSize and outputSymSize are used for the SUTU binarization (where its cMax is derived internally).
-
- The first instance of the TU binarization for the value Min(Abs(symVal), cMax).
- If Abs(symVal)>cMax, the second instance of SUTU binarization for the value Abs(symVal)−cMax.
- Table 26 illustrates the binary strings of the Double Truncated Unary binarization with cMax =1, splitUnitSize=2, outputSymSize=8.
-
TABLE 26 Bin string of the Double Truncated Unary binarization with cMax = 1, splitUnitSize = 2, outputSymSize = 8. SUTU Instance: splitUnitSize = TU 2, outputUnitSize = 8 Instance TU TU TU TU sym- cMax = Instance 1Instance 2Instance 3Instance 4 Val 1 cMax = 3 cMax = 3 cMax = 3 cMax = 3 0 0 — — — — — — — — — — — — 1 1 — — — — — — — — — — — — 3 1 1 1 0 0 — — 0 — — 0 — — 15 1 1 1 0 1 1 1 0 — — 0 — — 31 1 1 1 0 1 1 1 1 0 — 0 — — 63 1 1 1 0 1 1 1 1 1 1 0 — — binIdx 0 1 2 3 4 5 6 7 8 9 10 11 12 - The binarization process is described below.
-
decode_cabac_DTU(ctxTable, ctxIdx, cMax, splitUnitSize, outputSymSize) { output_sym = 0 if(cMax > 0) { output_sym = decode_cabac_TU(ctxTable, ctxIdx, cMax) if(output_sym > cMax) { output_sym += decode_cabac_SUTU(ctxTable, ctxIdx+cMax, splitUnitSize, outputSymSize) } } else output_sym = decode_cabac_SUTU(ctxTable, ctxIdx, splitUnitSize, outputSymSize) return output_sym } - decode_cabac_TU( ) represents the cabac decoding process for TU binarization.
- decode_cabac_SUTU( ) represents the cabac decoding process for SUTU binarization.
- This binarization process requires the use of two additional input parameters splitUnitSize and outputSymSize.
- The SDTU binary string is obtained by an extension of the DTU binarization process with the sign of symVal coded as a flag.
-
- The DTU binarization for the value Abs(symVal).
- If symVal!=0, a one-bit flag equal to 1 (if symVal<0) or equal to 0 (if symVal>0).
- Table 27 illustrates the bin strings of double truncated unary binarization with with cMax=1, splitUnitSize=2, outputSymSize=8.
-
TABLE 27 Bin string of the Signed Double Truncated Unary binarization with with cMax = 1, splitUnitSize = 2, outputSymSize = 8. SUTU Instance: splitUnitSize = 2, outputUnitSize = 8 TU TU TU TU TU Instance Instance 1 Instance 2Instance 3Instance 4 symVal cMax = 1 cMax = 3 cMax = 3 cMax = 3 cMax = 3 Sign −63 1 1 1 0 1 1 1 1 1 1 0 — — 1 −31 1 1 1 0 1 1 1 1 0 — 0 — — 1 −15 1 1 1 0 1 1 1 0 — — 0 — — 1 −3 1 1 1 0 0 — — 0 — — 0 — — 1 −1 1 — — — — — — — — — — — — 1 0 0 — — — — — — — — — — — — — 1 1 — — — — — — — — — — — — 0 3 1 1 1 0 0 — — 0 — — 0 — — 0 15 1 1 1 0 1 1 1 0 — — 0 — — 0 31 1 1 1 0 1 1 1 1 0 — 0 — — 0 63 1 1 1 0 1 1 1 1 1 1 0 — — 0 binIdx 0 1 2 3 4 5 6 7 8 9 10 11 12 13 - The syntax for this binarization process is described below.
-
decode_cabac_SDTU(ctxTable, ctxIdx, cMax, splitUnitSize, outputSymSize) { output_sym = decode_cabac_DTU(ctxTable, ctxIdx, cMax, splitUnitSize, outputSymSize) if(output_sym > 0) { ctxIdx += cMax + ((1<<splitUnitSize) − 1) * (outputSymSize / splitUnitSize) sign_flag if(sign_flag == 1) output_sym = −output_sym } return output_sym } - sign_flag represents the cabac decoding of a bit on context variable identified by ctxIdx.
- decode_cabac_DTU( ) represents the cabac decoding with DTU binarization.
- Each binarization algorithm introduced in the previous sections requires configuration parameters at the encoding and decoding ends. In an embodiment, said configuration parameters are encapsulated in a data structure described in Table 28. Each binarization algorithm is identified by an identifier as listed in Table 16.
-
TABLE 28 Binarization parameters structure Binarization ID parameters 0 cLength 1 cMax 2 — 3 — 4 tegParam 5 stegParam 6 splitUnitSize, outputSymSize 7 splitUnitSize, outputSymSize 8 cMax, splitUnitSize, outputSymSize 9 cMaxsplitUnitSize, outputSymSize - In Table 28 the following semantics applies:
- cMax represents the largest value to be binarized. Larger values will be truncated.
- cLength represents the numbers of bits with which the value is binarized.
- tegParam represents the tegParam variable defined in this disclosure for TEG binarization.
- stegParam represents the stegParam variable defined in this disclosure for STEG binarization.
- splitUnitSize represents the splitUnitSize variable defined in this disclosure for SUTU, SSUTU and DTU binarizations.
- outputSymSize represents the outputSymSize variable defined in this disclosure for SUTU, SSUTU DTU and SDTU binarizations.
- By applying the indicated CABAC binarization to the respective genomic descriptors as indicated in Table 29, the compression performance reported in * no additional information is necessary since it is already available in the compressed representation according to the principles of this disclosure.
- Table 30 can be obtained. The improvement in compression performance of the method described in this disclosure can be appreciated by comparison with the corresponding file sizes of BAM and CRAM approaches and one of the best compressors in literature known as DeeZ (see Numanagic, I., et al “Comparison of high-throughput sequencing data compression tools”, Nature Methods (ISSN: 1548-7091), vol. 13, p. 1005-1008 London: Nature Publishing Group, 2016). It has to be appreciated that the DeeZ, BAM and CRAM compression performance are calculated by adding the size of the compressed reference genome used for alignment to the sizes of the compressed genome sequence data. According to the principles of the present disclosure, the reference genome is embedded in the compressed file. In today practice said compressed reference genome is a FASTA (ASCII text) file compressed using general purpose compressors such as GZIP, LZMA, Bzip2. In the proposed comparison the reference genome hs37d5.fa was compressed using the xz Linux command with the option of maximum compression (-9).
- Table 30 shows the binarization applied to the genomic descriptors defined in this disclosure. When a concatenation of several binarizations is indicated, the different binarizations are applied to the different elements composing each descriptor as defined in this disclosure.
-
TABLE 29 Binarizations associated to each genomic descriptor. descriptor_id short name binarization_id 1 pos 8 (aligned reads) 9 (raw reads) 2 pair 0, 6, 6, 6 3 rlen 6 4 rcomp 1 5 mmpos 0, 6 6 mmtype 0, 0, 1, 1 7 clips 5, 1, 3, 0 8 flags 0 9 mmap 6 10 msar 3 11 ureads 1 12 rtype 1 13 rftp 0, 6 14 rftt 0, 0, 1, 1 15 rgroup 1 16 mscore 1 - An example of binarization of rftp and rftt is provided in this section and illustrated in
FIG. 10 . The descriptors associated to five mismatches between a contig and a reference genome used for alignment are shown below: -
rftp 5 7 12 13 15 rftt C T T C A - Each nucleotide symbol is associated to an integer code:
-
Nucleotide code A 0 C 1 G 2 T 3 N 4 - After transformation the values become:
-
rftp 5 2 5 1 2 rftt 1 3 3 1 0 - The binarized values for rftp are calculates as follows:
-
- 1. The terminator value can be binarized as 0 or 1. Here for this example we select 0.
- 2. If terminator =0 then binarization no. 6 with splitUnitSize =4, outputSymbolSize =12 is used and the following binary strings are associated to the values of rftp
- a. 5=11110
- b. 2=110
- c. 5=11110
- d. 1=10
- e. 2=110
- The binarized values for rftt are calculates as follows:
-
- 1. Knowing the nucleotide present in the reference genome, remove the corresponding symbol from the possible symbols to be encoded. I.e. for the first mismatch of the example, if the corresponding symbol in the reference is a ‘G’, the space of possible symbols to be encoded is 0, 1, 3, 4.
- 2. The frequencies of symbols of the mismatches types on the data to be encoded are measured and indexed from 0 to 3. Index 0 is affected to the most frequent mismatch and
index 3 is affected to the less frequent mismatch. In this example an indexing could be: {0 =>3, 1=>0, 2=>4, 3=>1}
- 3. In the given example the five mismatches would be binarized using the TU binarization as:
-
TU binarization with Symbol index cMax = 3 1 3 111 3 0 0 3 0 0 1 3 0 0 1 10 - With the binarization approach shown above the following compression results are achieved:
-
TABLE 30 Compression performance with respect to state of the art solutions (sizes in bytes). Proposed Compressor BAM CRAM Deez method 9827_2#49.bam 4,755,859,110 3,124,448,497 2,592,665,720 2,164,362,407 (ERR317482)- Low coverage Reference 707,712,316 707,712,316 707,712,316 N/A* genome hs37d5.fa Total 5,463,571,426 3,832,160,813 3,300,378,036 2,164,362,407 NA12878_S1.bam- 117,653,446,187 64,565,636,391 64,334,196,408 47,759,141,388 High Coverage Reference 707,712,316 707,712,316 707,712,316 N/A* genome hs37d5.fa Total 118,361,158,503 65,273,348,707 65,041,908,724 47,759,141,388 *no additional information is necessary since it is already available in the compressed representation according to the principles of this disclosure. - In an embodiment, the parameters needed to encode and decode each Access Unit are encapsulated in a data structure named Parameter Set as defined in Table 31.
-
TABLE 31 Coding parameters for genomic descriptors Parameter Set Parameter name Cardinality Description dataset type 1 type of data encoded in Access Units referring to this encoding parameters. reads length 1 length in nucleotides of sequence reads in case of constant reads length. The value 0 indicates the presence of variable reads length (conveyed by the rlen syntax element as defined in this disclosure). QV depth 1 number of Quality Values associated to each coded nucleotide. 0 means that no Quality Values are encoded. alignment score 1 number of alignments scores associated to each coded depth alignments. 0 means that no alignment scores are encoded. terminator size 1 represents the size in bytes minus one (e.g. 0 = 1 byte) of the terminator symbol used for the mmpos descriptor defined in Error! Reference source not found. terminator value 1 represents the value of the terminator symbol used for the mmpos descriptor defined in Table 1. number of classes 1 number of data classes encoded in all Access Units referring to these encoding parameters. class ID number of identifier associated to one of the data class defined in classes this disclosure (P, N, M, I, HM, U). number of 1 total number of descriptors contained in Access Units descriptors referring to these configuration parameters coding mode ID number of one of the coding modes defined in this disclosure descriptors number of groups 1 number of different values of the rgroup descriptor listed in table 1 present in all Access Units referring to the current encoding parameters. group name number of null-terminated string identifier of a read group. groups multiple alignments 1 flag signaling the presence of multiple alignments in the flag Access Unit. When set to 0 no multiple alignments are present spliced reads flag 1 flag signaling the presence of spliced reads in the Access Unit. When set to 0 no spliced reads are present. multiple signature 1 flag signaling the use of multiple signatures in an base Access Unit containing unmapped sequence reads (Class U). signature size 1 size in bits of each integer representing an encoded signature. score exponent 1 number of bits used to encode the exponent part of the multiple alignments score encoded in the mscore descriptor. As specified in IEEE RFC 754 the value can go from 0 to 11. The mscore descriptor is defined in table 1. score fractional 1 number of bits used to encode the fractional part of the multiple alignments score encoded in the mscore descriptor. As specified in IEEE RFC 754 the value can go from 0 to 52. The mscore descriptor is defined in table 1. contig buffer size 1 size in bytes of the buffer used to build the contig 102 in figure 1. contig buffer count 1 number of reads used to build the contig 102 in figure 1. -
FIG. 13 shows an encoding apparatus according to the principles of this invention. The encoding apparatus receives as input areference genome 1302 and unalignedgenomic sequences 1300, for example produced by a genome sequencing apparatus. Genome sequencing apparatus are known in the art, like the Illumina HiSeq 2500, the Thermo-Fisher Ion Torrent devices or the Oxford Nanopore MinION. Theunaligned sequence data 1300 are fed to a readsalignment unit 1301, which maps the sequences on areference genome 1302. The aligned genomic sequences 303 are then fed to a reference basedcompressor 1305 which generatesgenomic descriptors 1306 representing both mapped and unmapped genomic sequences. Thegenomic descriptors 1306 generated by the reference basedcompressor 1305 are first binarized byseveral binarization units 1307 and then entropy coded byseveral entropy coders 1308. The entropy coded genomic descriptors are then fed to a multiplexing apparatus 1310 to build one or more Access Unit composing acompressed bitstream 1311. The multiplexed bitstream contains as well codingparameters structures 1304 built by a coding parameters encoder 1309. Each Access Unit contains entropy coded descriptors representing alignment information and sequence reads belonging to one class of data as defined in this disclosure. -
FIG. 14 shows a decoding apparatus according to the principles of this disclosure. Ademultiplexing unit 1401 receives a multiplexedbitstream 1400 from a network or a storage element and extracts the entropy coded payload of the Access Units composing said bitstream.Entropy decoders 1402 receive the extracted payloads and decode the different types of genomic descriptors into their binary representations. Said binary representations are then fed to several binary decoders 1404 which generategenomic descriptors 1405. Acoding parameter decoder 1403 receives coding parameters multiplexed with the genomic information and feds them to thedecoding unit 1406.Genomic descriptors 1405 representing genomic sequence reads are fed to a sequence readsreconstruction unit 1406 which reconstructs the alignedgenomic sequences 1407 using anavailable reference genome 1408. - The inventive techniques herewith disclosed may be implemented in hardware, software, firmware or any combination thereof. When implemented in software, these may be stored on a computer medium and executed by a hardware processing unit. The hardware processing unit may comprise one or more processors, digital signal processors, general purpose microprocessors, application specific integrated circuits or other discrete logic circuitry. The techniques of this disclosure may be implemented in a variety of devices or apparatuses, including mobile phones, desktop computers, servers, tablets and similar devices.
Claims (82)
Applications Claiming Priority (9)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/EP2016/074311 WO2018068830A1 (en) | 2016-10-11 | 2016-10-11 | Method and system for the transmission of bioinformatics data |
| PCT/EP2016/074301 WO2018068828A1 (en) | 2016-10-11 | 2016-10-11 | Method and system for storing and accessing bioinformatics data |
| PCT/EP2016/074297 WO2018068827A1 (en) | 2016-10-11 | 2016-10-11 | Efficient data structures for bioinformatics information representation |
| PCT/EP2016/074307 WO2018068829A1 (en) | 2016-10-11 | 2016-10-11 | Method and apparatus for compact representation of bioinformatics data |
| PCT/US2017/017842 WO2018071055A1 (en) | 2016-10-11 | 2017-02-14 | Method and apparatus for the compact representation of bioinformatics data |
| USPCT/US2017/017842 | 2017-02-14 | ||
| PCT/US2017/041579 WO2018071078A1 (en) | 2016-10-11 | 2017-07-11 | Method and apparatus for the access to bioinformatics data structured in access units |
| USPCT/US2017/041579 | 2017-07-11 | ||
| PCT/US2017/066863 WO2018151788A1 (en) | 2017-02-14 | 2017-12-15 | Method and systems for the efficient compression of genomic sequence reads |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| US20200051667A1 true US20200051667A1 (en) | 2020-02-13 |
Family
ID=61905752
Family Applications (6)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/341,426 Abandoned US20200042735A1 (en) | 2016-10-11 | 2017-02-14 | Method and system for selective access of stored or transmitted bioinformatics data |
| US16/337,639 Abandoned US20190214111A1 (en) | 2016-10-11 | 2017-07-11 | Method and systems for the representation and processing of bioinformatics data using reference sequences |
| US16/337,642 Active 2038-03-31 US11404143B2 (en) | 2016-10-11 | 2017-07-11 | Method and systems for the indexing of bioinformatics data |
| US16/485,623 Pending US20190385702A1 (en) | 2016-10-11 | 2017-12-14 | Method and systems for the reconstruction of genomic reference sequences from compressed genomic sequence reads |
| US16/485,649 Pending US20200051667A1 (en) | 2016-10-11 | 2017-12-15 | Method and systems for the efficient compression of genomic sequence reads |
| US16/485,670 Pending US20200051665A1 (en) | 2016-10-11 | 2018-02-14 | Method and apparatus for the compact representation of bioinformatics data using multiple genomic descriptors |
Family Applications Before (4)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/341,426 Abandoned US20200042735A1 (en) | 2016-10-11 | 2017-02-14 | Method and system for selective access of stored or transmitted bioinformatics data |
| US16/337,639 Abandoned US20190214111A1 (en) | 2016-10-11 | 2017-07-11 | Method and systems for the representation and processing of bioinformatics data using reference sequences |
| US16/337,642 Active 2038-03-31 US11404143B2 (en) | 2016-10-11 | 2017-07-11 | Method and systems for the indexing of bioinformatics data |
| US16/485,623 Pending US20190385702A1 (en) | 2016-10-11 | 2017-12-14 | Method and systems for the reconstruction of genomic reference sequences from compressed genomic sequence reads |
Family Applications After (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US16/485,670 Pending US20200051665A1 (en) | 2016-10-11 | 2018-02-14 | Method and apparatus for the compact representation of bioinformatics data using multiple genomic descriptors |
Country Status (17)
| Country | Link |
|---|---|
| US (6) | US20200042735A1 (en) |
| EP (3) | EP3526694A4 (en) |
| JP (4) | JP2020505702A (en) |
| KR (4) | KR20190073426A (en) |
| CN (6) | CN110168651A (en) |
| AU (3) | AU2017342688A1 (en) |
| BR (7) | BR112019007359A2 (en) |
| CA (3) | CA3040138A1 (en) |
| CL (6) | CL2019000968A1 (en) |
| CO (6) | CO2019003639A2 (en) |
| EA (2) | EA201990917A1 (en) |
| IL (3) | IL265879B2 (en) |
| MX (2) | MX2019004130A (en) |
| PE (7) | PE20191058A1 (en) |
| PH (6) | PH12019550058A1 (en) |
| SG (3) | SG11201903270RA (en) |
| WO (4) | WO2018071054A1 (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220358697A1 (en) * | 2021-05-10 | 2022-11-10 | Optum Services (Ireland) Limited | Predictive data analysis using image representations of genomic data |
| US12347528B2 (en) | 2020-04-07 | 2025-07-01 | Koninklijke Philips N.V. | System and method for storing and transporting diverse genomic data |
Families Citing this family (44)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB2526598B (en) | 2014-05-29 | 2018-11-28 | Imagination Tech Ltd | Allocation of primitives to primitive blocks |
| US11574287B2 (en) | 2017-10-10 | 2023-02-07 | Text IQ, Inc. | Automatic document classification |
| US11030324B2 (en) * | 2017-11-30 | 2021-06-08 | Koninklijke Philips N.V. | Proactive resistance to re-identification of genomic data |
| WO2019191083A1 (en) * | 2018-03-26 | 2019-10-03 | Colorado State University Research Foundation | Apparatuses, systems and methods for generating and tracking molecular digital signatures to ensure authenticity and integrity of synthetic dna molecules |
| US20210158902A1 (en) * | 2018-05-31 | 2021-05-27 | Koninklijke Philips N.V. | System and method for allele interpretation using a graph-based reference genome |
| CN108753765B (en) * | 2018-06-08 | 2020-12-08 | 中国科学院遗传与发育生物学研究所 | A Genome Assembly Method for Constructing Ultra-Long Contiguous DNA Sequences |
| US12210904B2 (en) * | 2018-06-29 | 2025-01-28 | International Business Machines Corporation | Hybridized storage optimization for genomic workloads |
| US11474978B2 (en) * | 2018-07-06 | 2022-10-18 | Capital One Services, Llc | Systems and methods for a data search engine based on data profiles |
| US12300358B2 (en) * | 2018-08-20 | 2025-05-13 | The Board Of Trustees Of The Leland Stanford Junior University | Systems and methods for compressing genetic sequencing data and uses thereof |
| GB2585816A (en) * | 2018-12-12 | 2021-01-27 | Univ York | Proof-of-work for blockchain applications |
| US20210074381A1 (en) * | 2019-09-11 | 2021-03-11 | Enancio | Method for the compression of genome sequence data |
| JP2022551261A (en) * | 2019-10-01 | 2022-12-08 | コーニンクレッカ フィリップス エヌ ヴェ | Systems and methods for efficient identification and extraction of sequence pathways in genome graphs |
| CN110797087B (en) * | 2019-10-17 | 2020-11-03 | 南京医基云医疗数据研究院有限公司 | Sequencing sequence processing method and device, storage medium and electronic equipment |
| BR112022007331A2 (en) * | 2019-10-18 | 2022-07-05 | Koninklijke Philips Nv | METHOD AND APPLIANCE TO CONTROL DATA COMPRESSION |
| US12322477B1 (en) * | 2019-12-04 | 2025-06-03 | John Hayward | Methods of efficiently transforming and comparing recombinable DNA information |
| US12445148B2 (en) | 2020-01-03 | 2025-10-14 | Koninklijke Philips N.V. | System and method for effective compression representation and decompression of diverse tabulated data |
| WO2021156110A1 (en) * | 2020-02-07 | 2021-08-12 | Koninklijke Philips N.V. | Improved quality value compression framework in aligned sequencing data based on novel contexts |
| CN111243663B (en) * | 2020-02-26 | 2022-06-07 | 西安交通大学 | Gene variation detection method based on pattern growth algorithm |
| CN111370070B (en) * | 2020-02-27 | 2023-10-27 | 中国科学院计算技术研究所 | Compression processing method for big data gene sequencing file |
| US12014802B2 (en) * | 2020-03-17 | 2024-06-18 | Western Digital Technologies, Inc. | Devices and methods for locating a sample read in a reference genome |
| US12006539B2 (en) | 2020-03-17 | 2024-06-11 | Western Digital Technologies, Inc. | Reference-guided genome sequencing |
| US11837330B2 (en) | 2020-03-18 | 2023-12-05 | Western Digital Technologies, Inc. | Reference-guided genome sequencing |
| EP3896698A1 (en) * | 2020-04-15 | 2021-10-20 | Genomsys SA | Method and system for the efficient data compression in mpeg-g |
| CN111459208A (en) * | 2020-04-17 | 2020-07-28 | 南京铁道职业技术学院 | Control system and method for electric energy of subway power supply system |
| US12224042B2 (en) | 2020-06-22 | 2025-02-11 | SanDisk Technologies, Inc. | Devices and methods for genome sequencing |
| US12093803B2 (en) * | 2020-07-01 | 2024-09-17 | International Business Machines Corporation | Downsampling genomic sequence data |
| AU2021342166A1 (en) * | 2020-09-14 | 2023-01-05 | Illumina, Inc. | Custom data files for personalized medicine |
| BR112023006194A2 (en) * | 2020-10-06 | 2023-05-09 | Koninklijke Philips Nv | METHOD AND SYSTEM FOR STORING GENOMIC DATA WITHIN A DATA STRUCTURE |
| AU2021357118A1 (en) * | 2020-10-06 | 2023-06-08 | Koninklijke Philips N.V. | Methods and systems for storing genomic data in a file structure comprising protection metadata |
| CN112836355B (en) * | 2021-01-14 | 2023-04-18 | 西安科技大学 | Method for predicting coal face roof pressure probability |
| JP7118199B1 (en) * | 2021-03-26 | 2022-08-15 | エヌ・ティ・ティ・コミュニケーションズ株式会社 | Processing system, processing method and processing program |
| ES2930699A1 (en) * | 2021-06-10 | 2022-12-20 | Veritas Intercontinental S L | GENOMIC ANALYSIS METHOD IN A BIOINFORMATIC PLATFORM (Machine-translation by Google Translate, not legally binding) |
| CN113670643B (en) * | 2021-08-30 | 2023-05-12 | 四川虹美智能科技有限公司 | Intelligent air conditioner testing method and system |
| CN113643761B (en) * | 2021-10-13 | 2022-01-18 | 苏州赛美科基因科技有限公司 | Extraction method for data required by interpretation of second-generation sequencing result |
| US20230187020A1 (en) * | 2021-12-15 | 2023-06-15 | Illumina Software, Inc. | Systems and methods for iterative and scalable population-scale variant analysis |
| EP4490735A1 (en) | 2022-03-08 | 2025-01-15 | Illumina Inc | Multi-pass software-accelerated genomic read mapping engine |
| CN115458050B (en) * | 2022-08-05 | 2026-01-06 | 武汉大学 | Methods, devices, equipment, and storage media for constructing multi-gene discovery networks |
| CN115391284B (en) * | 2022-10-31 | 2023-02-03 | 四川大学华西医院 | Method, system and computer-readable storage medium for rapid identification of genetic data files |
| US20250095789A1 (en) * | 2022-12-02 | 2025-03-20 | City University Of Hong Kong | Reinforcement-learning-based network transmission of compressed genome sequence |
| CN116541348B (en) * | 2023-03-22 | 2023-09-26 | 河北热点科技股份有限公司 | Intelligent data storage method and terminal query integrated machine |
| CN116739646B (en) * | 2023-08-15 | 2023-11-24 | 南京易联阳光信息技术股份有限公司 | Method and system for analyzing big data of network transaction |
| CN117153270B (en) * | 2023-10-30 | 2024-02-02 | 吉林华瑞基因科技有限公司 | Gene second-generation sequencing data processing method |
| CN117708755B (en) * | 2023-12-17 | 2024-06-21 | 重庆文理学院 | Data processing method and device based on ecological environment |
| WO2025137316A1 (en) * | 2023-12-21 | 2025-06-26 | Illumina, Inc. | Sequence data processing, retention, and recovery |
Family Cites Families (54)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6303297B1 (en) * | 1992-07-17 | 2001-10-16 | Incyte Pharmaceuticals, Inc. | Database for storage and analysis of full-length sequences |
| JP3429674B2 (en) | 1998-04-28 | 2003-07-22 | 沖電気工業株式会社 | Multiplex communication system |
| WO2001083691A2 (en) * | 2000-04-12 | 2001-11-08 | The Cleveland Clinic Foundation | System for identifying and analyzing expression of are-containing genes |
| FR2820563B1 (en) * | 2001-02-02 | 2003-05-16 | Expway | COMPRESSION / DECOMPRESSION PROCESS FOR A STRUCTURED DOCUMENT |
| US20040153255A1 (en) * | 2003-02-03 | 2004-08-05 | Ahn Tae-Jin | Apparatus and method for encoding DNA sequence, and computer readable medium |
| DE10320711A1 (en) * | 2003-05-08 | 2004-12-16 | Siemens Ag | Method and arrangement for setting up and updating a user interface for accessing information pages in a data network |
| WO2005024562A2 (en) * | 2003-08-11 | 2005-03-17 | Eloret Corporation | System and method for pattern recognition in sequential data |
| US7805282B2 (en) * | 2004-03-30 | 2010-09-28 | New York University | Process, software arrangement and computer-accessible medium for obtaining information associated with a haplotype |
| US8340914B2 (en) * | 2004-11-08 | 2012-12-25 | Gatewood Joe M | Methods and systems for compressing and comparing genomic data |
| US20130332133A1 (en) * | 2006-05-11 | 2013-12-12 | Ramot At Tel Aviv University Ltd. | Classification of Protein Sequences and Uses of Classified Proteins |
| SE531398C2 (en) | 2007-02-16 | 2009-03-24 | Scalado Ab | Generating a data stream and identifying positions within a data stream |
| KR101369745B1 (en) * | 2007-04-11 | 2014-03-07 | 삼성전자주식회사 | Method and apparatus for multiplexing and demultiplexing asynchronous bitstreams |
| US8832112B2 (en) * | 2008-06-17 | 2014-09-09 | International Business Machines Corporation | Encoded matrix index |
| GB2477703A (en) * | 2008-11-14 | 2011-08-10 | Real Time Genomics Inc | A method and system for analysing data sequences |
| US20100217532A1 (en) * | 2009-02-25 | 2010-08-26 | University Of Delaware | Systems and methods for identifying structurally or functionally significant amino acid sequences |
| CA2779495C (en) * | 2009-10-30 | 2019-04-30 | Synthetic Genomics, Inc. | Encoding text into nucleic acid sequences |
| EP2362657B1 (en) * | 2010-02-18 | 2013-04-24 | Research In Motion Limited | Parallel entropy coding and decoding methods and devices |
| WO2011143231A2 (en) * | 2010-05-10 | 2011-11-17 | The Broad Institute | High throughput paired-end sequencing of large-insert clone libraries |
| CN103384887B (en) * | 2010-05-25 | 2017-01-18 | 加利福尼亚大学董事会 | BAMBAM: Parallel comparative analysis of high-throughput sequencing data |
| CN103329138A (en) * | 2011-01-19 | 2013-09-25 | 皇家飞利浦电子股份有限公司 | Method for processing genomic data |
| WO2012122555A2 (en) * | 2011-03-09 | 2012-09-13 | Lawrence Ganeshalingam | Biological data networks and methods therefor |
| CN103797486A (en) * | 2011-06-06 | 2014-05-14 | 皇家飞利浦有限公司 | Methods for assembling nucleic acid sequence data |
| IL299953B2 (en) * | 2011-06-16 | 2024-01-01 | Ge Video Compression Llc | Context initialization in entropy coding |
| US8707289B2 (en) * | 2011-07-20 | 2014-04-22 | Google Inc. | Multiple application versions |
| WO2013050612A1 (en) * | 2011-10-06 | 2013-04-11 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Entropy coding buffer arrangement |
| JP2015501974A (en) * | 2011-11-07 | 2015-01-19 | インジェヌイティ システムズ インコーポレイテッド | Methods and systems for identification of causal genomic mutations. |
| KR101922129B1 (en) * | 2011-12-05 | 2018-11-26 | 삼성전자주식회사 | Method and apparatus for compressing and decompressing genetic information using next generation sequencing(NGS) |
| US10140683B2 (en) * | 2011-12-08 | 2018-11-27 | Five3 Genomics, Llc | Distributed system providing dynamic indexing and visualization of genomic data |
| EP2608096B1 (en) * | 2011-12-24 | 2020-08-05 | Tata Consultancy Services Ltd. | Compression of genomic data file |
| US9600625B2 (en) * | 2012-04-23 | 2017-03-21 | Bina Technologies, Inc. | Systems and methods for processing nucleic acid sequence data |
| CN103049680B (en) * | 2012-12-29 | 2016-09-07 | 深圳先进技术研究院 | gene sequencing data reading method and system |
| US9679104B2 (en) * | 2013-01-17 | 2017-06-13 | Edico Genome, Corp. | Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform |
| WO2014145503A2 (en) * | 2013-03-15 | 2014-09-18 | Lieber Institute For Brain Development | Sequence alignment using divide and conquer maximum oligonucleotide mapping (dcmom), apparatus, system and method related thereto |
| JP6054790B2 (en) * | 2013-03-28 | 2016-12-27 | 三菱スペース・ソフトウエア株式会社 | Gene information storage device, gene information search device, gene information storage program, gene information search program, gene information storage method, gene information search method, and gene information search system |
| GB2512829B (en) * | 2013-04-05 | 2015-05-27 | Canon Kk | Method and apparatus for encoding or decoding an image with inter layer motion information prediction according to motion information compression scheme |
| WO2014186604A1 (en) * | 2013-05-15 | 2014-11-20 | Edico Genome Corp. | Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform |
| KR101522087B1 (en) * | 2013-06-19 | 2015-05-28 | 삼성에스디에스 주식회사 | System and method for aligning genome sequnce considering mismatch |
| CN103336916B (en) * | 2013-07-05 | 2016-04-06 | 中国科学院数学与系统科学研究院 | A kind of sequencing sequence mapping method and system |
| US20150032711A1 (en) * | 2013-07-06 | 2015-01-29 | Victor Kunin | Methods for identification of organisms, assigning reads to organisms, and identification of genes in metagenomic sequences |
| KR101493982B1 (en) * | 2013-09-26 | 2015-02-23 | 대한민국 | Coding system for cultivar identification and coding method using thereof |
| CN104699998A (en) * | 2013-12-06 | 2015-06-10 | 国际商业机器公司 | Method and device for compressing and decompressing genome |
| US10902937B2 (en) * | 2014-02-12 | 2021-01-26 | International Business Machines Corporation | Lossless compression of DNA sequences |
| US9916313B2 (en) * | 2014-02-14 | 2018-03-13 | Sap Se | Mapping of extensible datasets to relational database schemas |
| WO2015127058A1 (en) * | 2014-02-19 | 2015-08-27 | Hospodor Andrew | Efficient encoding and storage and retrieval of genomic data |
| US9354922B2 (en) * | 2014-04-02 | 2016-05-31 | International Business Machines Corporation | Metadata-driven workflows and integration with genomic data processing systems and techniques |
| US20150379195A1 (en) * | 2014-06-25 | 2015-12-31 | The Board Of Trustees Of The Leland Stanford Junior University | Software haplotying of hla loci |
| GB2527588B (en) * | 2014-06-27 | 2016-05-18 | Gurulogic Microsystems Oy | Encoder and decoder |
| US20160019339A1 (en) * | 2014-07-06 | 2016-01-21 | Mercator BioLogic Incorporated | Bioinformatics tools, systems and methods for sequence assembly |
| US10230390B2 (en) * | 2014-08-29 | 2019-03-12 | Bonnie Berger Leighton | Compressively-accelerated read mapping framework for next-generation sequencing |
| US10116632B2 (en) * | 2014-09-12 | 2018-10-30 | New York University | System, method and computer-accessible medium for secure and compressed transmission of genomic data |
| US20160125130A1 (en) * | 2014-11-05 | 2016-05-05 | Agilent Technologies, Inc. | Method for assigning target-enriched sequence reads to a genomic location |
| CN107851137A (en) * | 2015-06-16 | 2018-03-27 | 汉诺威戈特弗里德威廉莱布尼茨大学 | Method for compressing genomic data |
| CN105956417A (en) * | 2016-05-04 | 2016-09-21 | 西安电子科技大学 | Similar base sequence query method based on editing distance in cloud environment |
| CN105975811B (en) * | 2016-05-09 | 2019-03-15 | 管仁初 | An intelligently aligned gene sequence analysis device |
-
2017
- 2017-02-14 PE PE2019000804A patent/PE20191058A1/en unknown
- 2017-02-14 WO PCT/US2017/017841 patent/WO2018071054A1/en not_active Ceased
- 2017-02-14 KR KR1020197013567A patent/KR20190073426A/en not_active Withdrawn
- 2017-02-14 BR BR112019007359A patent/BR112019007359A2/en not_active IP Right Cessation
- 2017-02-14 AU AU2017342688A patent/AU2017342688A1/en not_active Abandoned
- 2017-02-14 SG SG11201903270RA patent/SG11201903270RA/en unknown
- 2017-02-14 CN CN201780062919.5A patent/CN110168651A/en active Pending
- 2017-02-14 CA CA3040138A patent/CA3040138A1/en not_active Abandoned
- 2017-02-14 WO PCT/US2017/017842 patent/WO2018071055A1/en not_active Ceased
- 2017-02-14 US US16/341,426 patent/US20200042735A1/en not_active Abandoned
- 2017-02-14 JP JP2019540510A patent/JP2020505702A/en not_active Withdrawn
- 2017-02-14 EP EP17859972.6A patent/EP3526694A4/en not_active Withdrawn
- 2017-02-14 MX MX2019004130A patent/MX2019004130A/en unknown
- 2017-07-11 MX MX2019004128A patent/MX2019004128A/en unknown
- 2017-07-11 KR KR1020197013418A patent/KR20190062541A/en not_active Withdrawn
- 2017-07-11 AU AU2017341684A patent/AU2017341684A1/en not_active Abandoned
- 2017-07-11 EP EP17860868.3A patent/EP3526707A4/en not_active Withdrawn
- 2017-07-11 CA CA3040145A patent/CA3040145A1/en not_active Abandoned
- 2017-07-11 CN CN201780063013.5A patent/CN110506272B/en active Active
- 2017-07-11 WO PCT/US2017/041585 patent/WO2018071079A1/en not_active Ceased
- 2017-07-11 EP EP17860980.6A patent/EP3526657A4/en active Pending
- 2017-07-11 EA EA201990917A patent/EA201990917A1/en unknown
- 2017-07-11 EA EA201990916A patent/EA201990916A1/en unknown
- 2017-07-11 PE PE2019000803A patent/PE20191057A1/en unknown
- 2017-07-11 CN CN201780063014.XA patent/CN110121577B/en active Active
- 2017-07-11 BR BR112019007357A patent/BR112019007357A2/en not_active Application Discontinuation
- 2017-07-11 JP JP2019540511A patent/JP7079786B2/en active Active
- 2017-07-11 BR BR112019007360A patent/BR112019007360A2/en not_active Application Discontinuation
- 2017-07-11 PE PE2019000802A patent/PE20191056A1/en unknown
- 2017-07-11 PE PE2019000805A patent/PE20191227A1/en unknown
- 2017-07-11 US US16/337,639 patent/US20190214111A1/en not_active Abandoned
- 2017-07-11 AU AU2017341685A patent/AU2017341685A1/en not_active Abandoned
- 2017-07-11 BR BR112019007363A patent/BR112019007363A2/en not_active Application Discontinuation
- 2017-07-11 IL IL265879A patent/IL265879B2/en unknown
- 2017-07-11 JP JP2019540513A patent/JP2020500383A/en not_active Withdrawn
- 2017-07-11 US US16/337,642 patent/US11404143B2/en active Active
- 2017-07-11 JP JP2019540512A patent/JP2019537172A/en not_active Withdrawn
- 2017-07-11 CN CN201780062885.XA patent/CN110114830B/en active Active
- 2017-07-11 CA CA3040147A patent/CA3040147A1/en not_active Abandoned
- 2017-07-11 SG SG11201903271UA patent/SG11201903271UA/en unknown
- 2017-07-11 WO PCT/US2017/041591 patent/WO2018071080A2/en not_active Ceased
- 2017-07-11 KR KR1020197013419A patent/KR20190069469A/en not_active Withdrawn
- 2017-07-11 SG SG11201903272XA patent/SG11201903272XA/en unknown
- 2017-12-14 PE PE2019001667A patent/PE20200323A1/en unknown
- 2017-12-14 KR KR1020197026863A patent/KR20190117652A/en not_active Withdrawn
- 2017-12-14 US US16/485,623 patent/US20190385702A1/en active Pending
- 2017-12-14 BR BR112019016230A patent/BR112019016230A2/en not_active Application Discontinuation
- 2017-12-14 CN CN201780086529.1A patent/CN110603595B/en active Active
- 2017-12-15 PE PE2019001669A patent/PE20200226A1/en unknown
- 2017-12-15 CN CN201780086770.4A patent/CN110678929B/en active Active
- 2017-12-15 US US16/485,649 patent/US20200051667A1/en active Pending
- 2017-12-15 BR BR112019016232A patent/BR112019016232A2/en not_active Application Discontinuation
-
2018
- 2018-02-14 BR BR112019016236A patent/BR112019016236A2/en unknown
- 2018-02-14 PE PE2019001668A patent/PE20200227A1/en unknown
- 2018-02-14 US US16/485,670 patent/US20200051665A1/en active Pending
-
2019
- 2019-04-08 IL IL265928A patent/IL265928B/en active IP Right Grant
- 2019-04-10 CL CL2019000968A patent/CL2019000968A1/en unknown
- 2019-04-10 CL CL2019000973A patent/CL2019000973A1/en unknown
- 2019-04-10 CL CL2019000972A patent/CL2019000972A1/en unknown
- 2019-04-11 PH PH12019550058A patent/PH12019550058A1/en unknown
- 2019-04-11 PH PH12019550059A patent/PH12019550059A1/en unknown
- 2019-04-11 PH PH12019550060A patent/PH12019550060A1/en unknown
- 2019-04-11 CO CONC2019/0003639A patent/CO2019003639A2/en unknown
- 2019-04-11 PH PH12019550057A patent/PH12019550057A1/en unknown
- 2019-04-11 CO CONC2019/0003595A patent/CO2019003595A2/en unknown
- 2019-04-11 IL IL265972A patent/IL265972A/en unknown
- 2019-04-11 CO CONC2019/0003638A patent/CO2019003638A2/en unknown
- 2019-04-15 CO CONC2019/0003842A patent/CO2019003842A2/en unknown
- 2019-08-12 CL CL2019002277A patent/CL2019002277A1/en unknown
- 2019-08-12 CL CL2019002276A patent/CL2019002276A1/en unknown
- 2019-08-12 CL CL2019002275A patent/CL2019002275A1/en unknown
- 2019-08-13 PH PH12019501879A patent/PH12019501879A1/en unknown
- 2019-08-13 PH PH12019501881A patent/PH12019501881A1/en unknown
- 2019-09-12 CO CONC2019/0009922A patent/CO2019009922A2/en unknown
- 2019-09-12 CO CONC2019/0009920A patent/CO2019009920A2/en unknown
Non-Patent Citations (4)
| Title |
|---|
| "Joint Call for Proposals for Genomic Information Compression and Storage", INTERNATIONAL ORGANISATION FOR STANDARDISATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC 1/SC 29/WG 11, ISO/IEC JTC 1/SC 29/WG 11 N16320, ISO/TC 276/WG 5 N99, Geneva, CH – June 2016 (Year: 2016) * |
| "Requirements on Genomic Information Compression and Storage", INTERNATIONAL ORGANISATION FOR STANDARDISATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC 1/SC 29/WG 11 CODING OF MOVING PICTURES AND AUDIO, ISO/IEC JTC 1/SC 29/WG 11 N16323, ISO/TC 276/WG 5 N97, Geneva, CH – June 2016 (Year: 2016) * |
| "White paper on genomic information compression and storage", INTERNATIONAL ORGANISATION FOR STANDARDISATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO/IEC JTC1/SC29/WG11 CODING OF MOVING PICTURES AND AUDIO, ISO/IEC JTC1/SC29/WG11 MPEG2014/N15047, October 2014, Strasbourg, France (Year: 2014) * |
| Panneel, Jens. "Implementation and optimization of CABAC in." (2015). (Year: 2015) * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12347528B2 (en) | 2020-04-07 | 2025-07-01 | Koninklijke Philips N.V. | System and method for storing and transporting diverse genomic data |
| US20220358697A1 (en) * | 2021-05-10 | 2022-11-10 | Optum Services (Ireland) Limited | Predictive data analysis using image representations of genomic data |
| US12406413B2 (en) * | 2021-05-10 | 2025-09-02 | Optum Services (Ireland) Limited | Predictive data analysis using image representations of genomic data |
Also Published As
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20200051667A1 (en) | Method and systems for the efficient compression of genomic sequence reads | |
| KR102729412B1 (en) | Efficient compression method and system for genome sequence reads | |
| KR102733786B1 (en) | Method and device for compressing and representing bioinformatics data using multiple genome descriptors | |
| JP2020509473A (en) | Compact representation method and apparatus for biological information data using a plurality of genome descriptors | |
| EP3583250B1 (en) | Method and systems for the efficient compression of genomic sequence reads | |
| CN110663022B (en) | Methods and apparatus for compact representation of bioinformatics data using genomic descriptors | |
| AU2017399715A1 (en) | Method and systems for the reconstruction of genomic reference sequences from compressed genomic sequence reads | |
| HK40014527B (en) | Method and systems for the efficient compression of genomic sequence reads | |
| HK40014527A (en) | Method and systems for the efficient compression of genomic sequence reads | |
| NZ757185B2 (en) | Method and apparatus for the compact representation of bioinformatics data using multiple genomic descriptors | |
| EA043338B1 (en) | METHOD AND DEVICE FOR COMPACT REPRESENTATION OF BIOINFORMATION DATA USING SEVERAL GENOMIC DESCRIPTORS | |
| EA040022B1 (en) | METHOD AND DEVICE FOR COMPACT REPRESENTATION OF BIOINFORMATICS DATA |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: APPLICATION UNDERGOING PREEXAM PROCESSING |
|
| AS | Assignment |
Owner name: GENOMSYS SA, SWITZERLAND Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ALBERTI, CLAUDIO;BALUCH, MOHAMED KHOSO;SIGNING DATES FROM 20190718 TO 20190720;REEL/FRAME:050168/0408 |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION COUNTED, NOT YET MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
| AS | Assignment |
Owner name: KONINKLIJKE PHILIPS N.V., NETHERLANDS Free format text: ASSIGNMENT OF ASSIGNOR'S INTEREST;ASSIGNOR:GENOMSYS SA;REEL/FRAME:072825/0666 Effective date: 20251103 Owner name: KONINKLIJKE PHILIPS N.V., NETHERLANDS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:GENOMSYS SA;REEL/FRAME:072825/0666 Effective date: 20251103 |