| Literature DB >> 31955674 |
Katherine Yelick1,2, Aydın Buluç1,2, Muaaz Awan1, Ariful Azad3, Benjamin Brock1,2, Rob Egan4, Saliya Ekanayake1, Marquita Ellis1,2, Evangelos Georganas5, Giulia Guidi1,2, Steven Hofmeyr1, Oguz Selvitopi1, Cristina Teodoropol1,2, Leonid Oliker1.
Abstract
Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or 'motifs' that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'.Entities:
Keywords: bioinformatics; high-performance data analytics; parallel computing
Year: 2020 PMID: 31955674 PMCID: PMC7015300 DOI: 10.1098/rsta.2019.0394
Source DB: PubMed Journal: Philos Trans A Math Phys Eng Sci ISSN: 1364-503X Impact factor: 4.226
Figure 1.A spectrum of regularity with different patterns of communication and synchronization. Data analysis is often at the two extremes.
Figure 2.Seven parallelism motifs in genomic data analysis.
Figure 3.Dependencies of various machine learning methods upon linear algebraic primitives. The three grey boxes on the top are unsupervised methods while the two white boxes include supervised methods. Examples of algorithms in each group are in parentheses: non-negative matrix factorization (NMF), principal component analysis (PCA), and Markov cluster algorithm (MCL), convex correlation selection method (CONCORD), a low-rank matrix factorization (CX).
A comparison of motifs for parallel computing, including our own set for genomic data analysis.
| dense matrix | dense matrix | dense and | dense matrix |
| sparse matrix | sparse matrix | …sparse matrix | sparse matrix |
| structured grid | structured grid | ||
| unstructured grid | unstructured grid | ||
| spectral methods | spectral methods | ||
| particle methods | N-body | Gen. N-body | Gen. N-body |
| Monte Carlo | MapReduce | basic statistics | basic operationsa |
| finite-state machine | |||
| graph traversal | graph theoretic | graph traversal | |
| dynamic Prog. | alignment | alignment | |
| backtracking search | |||
| graphical models | |||
| combinatorial | |||
| optimization | |||
| integration | |||
| hash tables | |||
| sorting |
aBasic operations include string parsing, string identity and 2-bit encoding of DNA sequences.