| Literature DB >> 36090814 |
Mohammed Alser1, Joel Lindegger1, Can Firtina1, Nour Almadhoun1, Haiyu Mao1, Gagandeep Singh1, Juan Gomez-Luna1, Onur Mutlu1.
Abstract
We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are still not able to read a genome in its entirety. We describe the ongoing journey in significantly improving the performance, accuracy, and efficiency of genome analysis using intelligent algorithms and hardware architectures. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches for each step of the genome analysis pipeline and provide experimental evaluations. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory) along with algorithmic changes, leading to new hardware/software co-designed systems. We conclude with a foreshadowing of future challenges, benefits, and research directions triggered by the development of both very low cost yet highly error prone new sequencing technologies and specialized hardware chips for genomics. We hope that these efforts and the challenges we discuss provide a foundation for future work in making genome analysis more intelligent.Entities:
Keywords: Genome analysis; Hardware acceleration; Hardware/software co-design; Read mapping
Year: 2022 PMID: 36090814 PMCID: PMC9436709 DOI: 10.1016/j.csbj.2022.08.019
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1Overview of a typical genome analysis pipeline. 1) Genomic sequencing data is first obtained through sequencing a new sample, downloading from publicly-available databases, or computer simulation. Sequencing starts with (A) extracting the DNA, (B) fragmenting it, (C) preparing its library, and (D) using sequencing machines for providing raw sequencing data. 2) Raw sequencing data needs to be converted into read sequences of A, C, G, T in the DNA alphabet using basecalling techniques. Basecalling techniques are sequencing technology dependent. 3) To ensure high quality sequencing reads, a quality control step is performed to filter out low quality subsequences of a read or an entire read sequence. 4) Read mapping step is performed to locate each read sequence in a reference genome. Read mapping is four steps: (A) indexing the reference genome, (B) extracting seeds from each read and locating common seeds with the index, (C) pre-alignment filtering dissimilar sequence pairs, and (D) performing sequence alignment for every sequence pair that passes the filtering. 5) Detecting and inferring genomic variations usually consists of three steps: (A) processing the read mapping data for increasing its quality, (B) classifying variations between mapped reads and a reference genome, and (C) identifying genomic variations.
Summary of the main differences between the three types of prominent sequencing technologies. We choose the most capable instrument as a representative of each sequencing technology.
| Short Reads | Ultra-long Reads | Accurate Long Reads | |
|---|---|---|---|
| Leading company | Illumina ( | Oxford Nanopore Technologies ( | Pacific Biosciences ( |
| Representative instrument | Illumina NovaSeq 6000 | ONT PromethION 48 | PacBio Sequel IIe |
| Instrument picture | |||
| Instrument weight | 481 kg | 28 kg | 362 kg |
| Instrument dimension (W × H × D in cm) | 80 × 94.5 × 165.6 | 59 × 19 × 43 | 92.7 × 167.6 × 86.4 |
| Instrument cost | 3Y | Y | 1.6Y |
| Read length | 100–300 | 100–2 M | 10 K-30 K |
| Read length in a single data file | Fixed | Variable | Modest Variability |
| Accuracy | 99.9 % | 90 %-98 % | 99.9 % |
| Maximum sequencing throughput per run | 6 Tb | 14 Tb | 35 Gb (160 Gb of raw sequencing data) |
| Number of flow cells operating simultaneously | 2 | 48 | 1 |
| Hands-on library preparation time | 45 min − 2 h | 10 min-3 h | 6 h |
| Turnaround library preparation time | 1.5–6.5 h | 24 h | 24 h |
| Sequencing run time (including basecalling) | 44 h | 72 h | 30 h |
Y can be as low as USD 300,000. The cost does not include consumables (e.g., flow cells and reagents) needed for each sequencing run.
Consensus accuracy.
Per SMRT cell. Sequel IIe can process up to 8 SMRT cells sequentially, slide 25 https://www.pacb.com/wp-content/uploads/HiFi_Sequencing_and_Software_v10.1_Release_Technical_Overview_for_Sequel_II_System_and_Sequel_IIe_System_Users-Customer-Training-01.pdf.
Includes both hands-on library preparation time and waiting time.
Fig. 2Performance comparison of the five main steps of genome analysis pipeline.
Summary of the main differences between the three types of prominent sequencing technologies. We choose the most capable instrument as a representative of each sequencing technology.
| Short Reads | Ultra-long Reads | Accurate Long Reads | |
|---|---|---|---|
| Type of raw sequencing data (before basecalling) | Multiple images of fluorescence intensities for each sequencing cycle | Electrical signal for each DNA segment | Fluorescence traces captured continuously into a 30-hour movie |
| Input file format for basecalling | BCL or CBCL | FAST5 | BAM |
| Expected size of basecalling input file | One CBCL file of size 350 MB per cycle, lane, and surface | 10x the size of the corresponding FASTQ file | Subreads.BAM of size 0.5–1.5 TB |
| Basecalling algorithm | BCL2FASTQ | Guppy/Bonito (deep neural networks) | CCS |
| Basecalling time | 48 minutes | 142 minutes | 24 hours |
| Number of basecalled bases | 83.5 Gb | 20 Gb | 200 Gb of HiFi yield |
BCL2FASTQ based on SRR2890933, 1.67 billion reads (full 8 lanes) at 50 bp read length [113].
Using a single V100 GPU, adapted from [111].
Using CCS v6.0 for a 25 KBases library on 2x64 cores at 2.4 GHz, adapted from [114].
Evaluation analysis of three state-of-the-art read mappers, BWA-MEM2, minimap2, and pbmm2, for three different types of sequencing data, short reads, ultra-long reads, and accurate long reads, respectively.
| Short reads | Ultra-long reads | Accurate long reads | |
|---|---|---|---|
| Read mapping tool | BWA-MEM2 | minimap2 | pbmm2 |
| Version | 2.2.1 | 2.24-r1122 | 1.7.0 |
| Reference genome | Human genome (HG38), GCA_000001405.15, FASTA size 3.2 GB | ||
| Indexing time (CPU) | 2002 | 163 | 144 |
| Indexing peak memory | 72.3 GB | 11.4 GB | 14.4 GB |
| Indexing size* | 17 GB | 7.3 GB | 5.7 GB |
| Read set (accession number) | |||
| Number of reads | 1,430,362,384 | 6,724,033 | 1,289,591 |
| Number of bases | 144,466,600,784 | 82,196,263,791 | 24,260,611,730 |
| Read mapping time | 868.9 hours | 48.6 h | 17.5 h |
| Mapping peak memory | 131.2 GB | 9.6 GB | 20 GB |
| Number of mapped reads | 2,842,576,947 | 3,854,572 | 1,286,256 |
| Mapping throughput (input bases/mapping time) | 92,369 bases/ | 469,801 bases/ | 385,090 bases/ |
| Number of output mappings (SAM) | 2,875,143,231 | 8,322,218 | 1,396,899 |
| File size of output mappings (SAM) | 222.6 GB | 190.2 GB | 54.4 GB |
For paired-end read mapping.
After excluding secondary, supplementary, and unmapped alignments using SAMtools and SAM flag 2308.
Evaluation analysis of variant calling, using DeepVariant tool, using read mapping results of three different types of sequencing data, short reads, ultra-long reads, and accurate long reads.
| Short reads | Ultra-long reads | Accurate long reads | |
|---|---|---|---|
| Variant calling tool | DeepVariant | PEPPER + DeepVariant | DeepVariant |
| Version | 1.3.0 | 0.8 | 1.3.0 |
| Total number of bases in input SAM file | 250,103,434,512 | 56,958,985,752 | 23,944,354,059 |
| Phase 1 (make examples) CPU Time ( | 250,359 | 1,136,356 | 230,066 |
| Phase 2 (call variants) CPU Time ( | 473,962 | 3,480,419 | 549,272 |
| Phase 3 (post-process variants) CPU Time ( | 2,193 | 2,765 | 6,201 |
| Total Run Time ( | 746,514 | 4,619,540 | 785,539 |
| Number of Called Variants (PASS) | 4,644,980 | 6,054,168 | 4,589,024 |
| Number of Called Variants (RefCall) | 1,313,292 | 6,722,265 | 2,603,968 |
| Variant Calling Throughput (Number of called variants / | 7.9 variants/ | 2.77 variants/ | 9.2 variants/ |