| Literature DB >> 29635394 |
Brendan Lawlor1,2, Richard Lynch1, Micheál Mac Aogáin3, Paul Walsh2.
Abstract
Background: Bioinformatic research is increasingly dependent on large-scale datasets, accessed either from private or public repositories. An example of a public repository is National Center for Biotechnology Information's (NCBI's) Reference Sequence (RefSeq). These repositories must decide in what form to make their data available. Unstructured data can be put to almost any use but are limited in how access to them can be scaled. Highly structured data offer improved performance for specific algorithms but limit the wider usefulness of the data. We present an alternative: lightly structured data stored in Apache Kafka in a way that is amenable to parallel access and streamed processing, including subsequent transformations into more highly structured representations. We contend that this approach could provide a flexible and powerful nexus of bioinformatic data, bridging the gap between low structure on one hand, and high performance and scale on the other. To demonstrate this, we present a proof-of-concept version of NCBI's RefSeq database using this technology. We measure the performance and scalability characteristics of this alternative with respect to flat files.Entities:
Mesh:
Year: 2018 PMID: 29635394 PMCID: PMC5906921 DOI: 10.1093/gigascience/giy036
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Anatomy of an Apache Kafka topic and partitions (from the Apache Kafka website).
Figure 2:Consumer groups (from Apache Kafka website).
Figure 3:NCBI download agent.
Figure 4:GC content agent.
Download times (seconds)
| T/A |
|
|
|
|
|---|---|---|---|---|
| 4 | 94 | 96 | 116 | 99 |
| 8 | 216 | 179 | 165 | 152 |
| 12 | 333 | 381 | 174 | 146 |
| 16 | 445 | 508 | 259 | 202 |
| 20 | 395 | 214 | ||
| 24 | 484 | 275 |
T/A: number of threads (Benchmark) or agents (Field of Genes). DL: download on Benchmark. DL: download on Field of Genes cluster, size 4, etc.
Figure 5:Scalability of download.
GC content times (seconds)
| T/A |
|
|
|
| Sequences |
|---|---|---|---|---|---|
| 4 | 207 | 260 | 260 | 270 | 1.56· 105 |
| 8 | 335 | 275 | 261 | 262 | 3.1· 105 |
| 12 | 498 | 326 | 276 | 249 | 4.58· 105 |
| 16 | 664 | 475 | 288 | 274 | 6.16· 105 |
| 20 | 315 | 284 | 7.7· 105 | ||
| 24 | 348 | 286 | 9.27· 105 |
T/A: number of threads (Benchmark) or agents (Field of Genes). GC: GC content on Benchmark. GC: GC content on Field of Genes cluster, size 4, etc.
Figure 6:Scalability of GC content.
Figure 7:Processing rates of GC content (seq/sec).
Figure 8:Example use context of Apache Kafka.