| Literature DB >> 32181684 |
Alan Kuhnle1,2, Taher Mun3, Christina Boucher2, Travis Gagie4,5, Ben Langmead3, Giovanni Manzini6.
Abstract
Short-read aligners predominantly use the FM-index, which is easily able to index one or a few human genomes. However, it does not scale well to indexing collections of thousands of genomes. Driving this issue are the two chief components of the index: (1) a rank data structure over the Burrows-Wheeler Transform (BWT) of the string that will allow us to find the interval in the string's suffix array (SA), and (2) a sample of the SA that-when used with the rank data structure-allows us to access the SA. The rank data structure can be kept small even for large genomic databases, by run-length compressing the BWT, but until recently there was no means known to keep the SA sample small without greatly slowing down access to the SA. Now that (SODA 2018) has defined an SA sample that takes about the same space as the run-length compressed BWT, we have the design for efficient FM-indexes of genomic databases but are faced with the problem of building them. In 2018, we showed how to build the BWT of large genomic databases efficiently (WABI 2018), but the problem of building the sample efficiently was left open. We compare our approach to state-of-the-art methods for constructing the SA sample, and demonstrate that it is the fastest and most space-efficient method on highly repetitive genomic databases. Lastly, we apply our method for indexing partial and whole human genomes and show that it improves over the FM-index-based Bowtie method with respect to both memory and time and over the hybrid index-based CHIC method with respect to query time and memory required for indexing.Entities:
Keywords: Burrows–Wheeler Transform; indexing; pan-genomics; r-index
Mesh:
Year: 2020 PMID: 32181684 PMCID: PMC7185338 DOI: 10.1089/cmb.2019.0309
Source DB: PubMed Journal: J Comput Biol ISSN: 1066-5277 Impact factor: 1.479
FIG. 1.Runtime and peak memory usage for construction of full SA. (a) Salmonella, 1 thread. (b) chr19, 1 thread. (c) chr19, 16 threads. (d) Peak memory, bigbwt. SA, suffix array.
FIG. 2.Runtime and peak memory usage for construction of SA sample. (a) Salmonella, 1 thread. (b) chr19, 1 thread. (c) chr19, 16 threads. (d) Peak memory, bigbwt.
FIG. 3.Scalability of r-index, Bowtie, and CHIC (RLZ compressed, FM-index kernel) against chr19 haplotype collection size and total sequence length (megabases) with respect to index construction time (seconds) (a), index construction peak memory (megabytes) (b), index disk space (megabytes) (c), and locate time (seconds) of 100,000 one hundred base pair queries (d). Four different CHIC indexes were used, using different combinations of prefix size and maximum query length, each labeled as CHIC_(prefix size)p_(max query length).
FIG. 4.Peak index-building memory for r-index when indexing successively larger collections of 1KG individuals and whole-genome long-read assemblies (LRA). 1KG, 1000 genomes.
Sequence Length and n/r Statistic with Respect to Number of Whole Genomes for the First 6 Collections in the 1000 Genomes and Long-Read Assembly Series
| No. of genomes | Sequence | |||
|---|---|---|---|---|
| Length (MB) | n/r | |||
| 1KG | LRA | 1KG | LRA | |
| 1 | 6072 | 6072 | 1.86 | 1.86 |
| 2 | 12,144 | 12,484 | 3.70 | 3.58 |
| 3 | 18,217 | 17,006 | 5.38 | 4.83 |
| 4 | 24,408 | 22,739 | 7.13 | 6.25 |
| 5 | 30,480 | 28,732 | 8.87 | 7.80 |
| 6 | 36,671 | 34,420 | 10.63 | 9.28 |
1KG, 1000 genomes; LRA, long-read assembly.