| Literature DB >> 34343181 |
Altti Ilari Maarala1, Ossi Arasalo2, Daniel Valenzuela1, Veli Mäkinen1,3, Keijo Heljanko1,3.
Abstract
Computational pan-genomics utilizes information from multiple individual genomes in large-scale comparative analysis. Genetic variation between case-controls, ethnic groups, or species can be discovered thoroughly using pan-genomes of such subpopulations. Whole-genome sequencing (WGS) data volumes are growing rapidly, making genomic data compression and indexing methods very important. Despite current space-efficient repetitive sequence compression and indexing methods, the deployed compression methods are often sequential, computationally time-consuming, and do not provide efficient sequence alignment performance on vast collections of genomes such as pan-genomes. For performing rapid analytics with the ever-growing genomics data, data compression and indexing methods have to exploit distributed and parallel computing more efficiently. Instead of strict genome data compression methods, we will focus on the efficient construction of a compressed index for pan-genomes. Compressed hybrid-index enables fast sequence alignments to several genomes at once while shrinking the index size significantly compared to traditional indexes. We propose a scalable distributed compressed hybrid-indexing method for large genomic data sets enabling pan-genome-based sequence search and read alignment capabilities. We show the scalability of our tool, DHPGIndex, by executing experiments in a distributed Apache Spark-based computing cluster comprising 448 cores distributed over 26 nodes. The experiments have been performed both with human and bacterial genomes. DHPGIndex built a BLAST index for n = 250 human pan-genome with an 870:1 compression ratio (CR) in 342 minutes and a Bowtie2 index with 157:1 CR in 397 minutes. For n = 1,000 human pan-genome, the BLAST index was built in 1520 minutes with 532:1 CR and the Bowtie2 index in 1938 minutes with 76:1 CR. Bowtie2 aligned 14.6 GB of paired-end reads to the compressed (n = 1,000) index in 31.7 minutes on a single node. Compressing n = 13,375,031 (488 GB) GenBank database to BLAST index resulted in CR of 62:1 in 575 minutes. BLASTing 189,864 Crispr-Cas9 gRNA target sequences (23 MB in total) to the compressed index of human pan-genome (n = 1,000) finished in 45 minutes on a single node. 30 MB mixed bacterial sequences were (n = 599) were blasted to the compressed index of 488 GB GenBank database (n = 13,375,031) in 26 minutes on 25 nodes. 78 MB mixed sequences (n = 4,167) were blasted to the compressed index of 18 GB E. coli sequence database (n = 745,409) in 5.4 minutes on a single node.Entities:
Year: 2021 PMID: 34343181 PMCID: PMC8330939 DOI: 10.1371/journal.pone.0255260
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Distributed Relative Lempel-Ziv compression.
Fig 2Distributed compression pipeline for hybrid-index.
Fig 3Distributed hybrid-indexing with BLAST.
Fig 4Sequence alignment with hybrid-index.
Compressing a human pan-genome (n = 250) with increasing RLZ dictionary length (% of pan-genome length).
| Dict.length (%) | DRLZ (min) | Compressed size (GB) | CR | Bpc |
|---|---|---|---|---|
| 1 | 111 | 52.2 | 13.8 | 0.072 |
| 15 | 269 | 4.8 | 150.5 | 0.0066 |
| 30 | 326 | 2.8 | 258.0 | 0.0039 |
| 45 | 398 | 1.9 | 380.3 | 0.0026 |
Compressing human pan-genomes using reference sequence size of 30%.
| n | Size (GB) | DRLZ (min) | Compressed size (GB) | CR | Bpc |
|---|---|---|---|---|---|
| 250 | 772.5 | 326 | 2.8 | 258.0 | 0.0039 |
| 500 | 1445 | 678 | 6.8 | 208.0 | 0.0048 |
| 750 | 2217.5 | 1049 | 11.6 | 187.4 | 0.0053 |
| 1000 | 2890 | 1421 | 16 | 176.8 | 0.0057 |
Building a complete hybrid-index for a human pan-genome using distributed indexing.
| n | Size (GB) | Tool | Indexing (min) | Index (GB) | CR | Bpc |
|---|---|---|---|---|---|---|
| 250 | 722.5 | Bowtie2 | 71 | 4.6 | 157.07 | 0.00637 |
| 250 | 722.5 | BLAST | 16 | 0.83 | 870.48 | 0.00115 |
| 500 | 1445 | Bowtie2 | 222 | 16.2 | 87.03 | 0.0115 |
| 500 | 1445 | BLAST | 64 | 2.6 | 542.31 | 0.00184 |
| 1000 | 2890 | Bowtie2 | 517 | 38 | 76.05 | 0.0131 |
| 1000 | 2890 | BLAST | 107 | 5.3 | 532.08 | 0.00188 |
Bowtie2 index is created on a single node. BLAST index is created in parallel on 22 nodes from chromosomal kernels. The kernelization step is included in the indexing time and it has been performed in parallel on 22 nodes in both cases.
Fig 5Summary of compressing and indexing complete human pan-genomes with distributed and non-distributed methods.
DRLZ compression and indexing of bacterial pan-genomes with BLAST.
| Tool | Pan-genome (size) | Seqs. | DRLZ | Indexing | Index size | CR | Bpc |
|---|---|---|---|---|---|---|---|
| BLAST | 745k | 33 min | 2 min | 0.177 GB | 101.69 | 0.0098 | |
| BLAST | GenBank (488 GB) | 13.4 M | 551 min | 24 min | 7.9 GB | 61.78 | 0.0162 |
| Bowtie2 | 745k | 33 min | 8 min | 0.356 GB | 50.56 | 0.0198 | |
| Bowtie2 | GenBank (488 GB) | 13.4M | 551 min | 58 min | 46 GB | 10.61 | 0.0943 |
Bowtie2 index is created on a single node from concatenated kernel. BLAST index is created in parallel on 25 nodes from distributed kernels. The kernelization step is included in the indexing time.
Aligning sequences to compressed human pan-genome index of size n genomes.
| Tool | n | Query sequences | min | Mapped |
|---|---|---|---|---|
| Bowtie2 | 1000 | 2x28.86M (2x7.3 GB) | 31.7 | 10.12M |
| BLAST(megablast) | 1000 | 189.9k (23 MB) | 45.2 | 639k |
The effect of CHIC aligner parameters to the number of pan-genome mapped reads (n = 10).
| Mapped (min) | ||||
|---|---|---|---|---|
| n | Reads (size) | default | sAll | bowtie2 -a |
| 10 | 2x14.69M (2x949 MB) | 0.126M (0.6) | 530.14M (12.3) | 8154.46M (302) |
The CHIC aligner with Bowtie2 was run with three different parameter settings: default to find primary matches only, -sALL to find primary+secondary matches, and bowtie2 -a option (reports all approximate alignments) to find primary+secondary matches also from approximate read alignments.
Aligning bacterial sequences to a compressed index with BLAST (blastn).
| Matches(min) | |||
|---|---|---|---|
| Pan-genome (seqs., size) | Query seqs. (size) | primary | -sALL |
| 4.2k (78 MB) | 1741k (5.38) | 1872k (7.55) | |
| GenBank (13.4M, 488 GB) | 599 (30 MB) | 13.79M (26.17) | 43.60M (44.56) |
The CHIC aligner was run with two different parameter settings: default to find primary matches only, and -sALL to find primary+secondary matches.
Aligning next-generation sequencing reads to a compressed index with Bowtie2.
| Mapped (min) | ||||
|---|---|---|---|---|
| Pan-genome (seqs., size) | Reads (size) | default | -sAll | bowtie2 -a |
| 3.1M (792 MB) | 73 (5.94) | 73 (5.98) | 2.2k (31.47) | |
| GenBank (13.4M, 488 GB) | 27.2M (4334 MB) | 1.07k (12) | 41.5k (24) | 228.4k (92) |
The CHIC aligner has been executed with three different parameter settings: default to find primary matches only, -sALL to find primary+secondary matches, and bowtie2 -a option (reports all approximate alignments) to find primary+secondary matches also from approximate read alignments.