| Literature DB >> 33328168 |
Camille Marchet1, Christina Boucher2, Simon J Puglisi3, Paul Medvedev4,5,6, Mikaël Salson1, Rayan Chikhi7.
Abstract
High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.Entities:
Mesh:
Year: 2020 PMID: 33328168 PMCID: PMC7849385 DOI: 10.1101/gr.260604.119
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.Timeline, main relationships, and application highlights for the methods covered in this survey.
Figure 2.Overview of set of k-mer sets building blocks. We classified strategies in color-aggregative approaches and k-mer aggregative approaches (second column). The top row of the figure indicates the general categories of components of each method: the type of k-mer set; the way multiple sets are combined together; and an optional compression scheme. Each next row describes one of the surveyed methods. The cells in this figure are methodological choice, potentially common across methods; hence many cells are joined.
Summary of the existing color-aggregative methods and some of their features
Summary of the existing k-mer aggregative methods and some of their features
Overview of the best achievable performance in terms of space and time requirements to build indices