| Literature DB >> 26558254 |
Rodrigo Aniceto1, Rene Xavier1, Valeria Guimarães1, Fernanda Hondo1, Maristela Holanda1, Maria Emilia Walter1, Sérgio Lifschitz2.
Abstract
Rapid advances in high-throughput sequencing techniques have created interesting computational challenges in bioinformatics. One of them refers to management of massive amounts of data generated by automatic sequencers. We need to deal with the persistency of genomic data, particularly storing and analyzing these large-scale processed data. To find an alternative to the frequently considered relational database model becomes a compelling task. Other data models may be more effective when dealing with a very large amount of nonconventional data, especially for writing and retrieving operations. In this paper, we discuss the Cassandra NoSQL database approach for storing genomic data. We perform an analysis of persistency and I/O operations with real data, using the Cassandra database system. We also compare the results obtained with a classical relational database system and another NoSQL database approach, MongoDB.Entities:
Year: 2015 PMID: 26558254 PMCID: PMC4629038 DOI: 10.1155/2015/502795
Source DB: PubMed Journal: Int J Genomics ISSN: 2314-436X Impact factor: 2.326
Figure 1Database schema.
Figure 2Stages of insertion.
Figure 3Stages of extraction.
Cells files.
| File | File number | Size | Number of lines |
|---|---|---|---|
| Liver cells files | 1 | 9,0 GB | 850.933 |
| 2 | 4,0 GB | 358.841 | |
| 3 | 3,2 GB | 286.563 | |
|
| |||
| Kidney cells files | 4 | 6,9 GB | 648.612 |
| 5 | 3,8 GB | 335.973 | |
| 6 | 5,3 GB | 475.210 | |
Times to insert and extract sequences from the database.
| File | Size | Insertion | Extraction | ||
|---|---|---|---|---|---|
| Cassandra (2) | Cassandra (4) | Cassandra (2) | Cassandra (4) | ||
| 1 | 9,0 GB | 14 m 30 s 645 ms | 11 m 44 s 105 ms | 23 m 37 s 964 ms | 15 m 04 s 158 ms |
| 2 | 4,0 GB | 6 m 10 s 471 ms | 05 m 05 s 710 ms | 9 m 41 s 018 ms | 7 m 34 s 523 ms |
| 3 | 3,2 GB | 5 m 05 s 914 ms | 4 m 51 s 823 ms | 7 m 39 s 188 ms | 6 m 02 s 648 ms |
| 4 | 6,9 GB | 11 m 25 s 899 ms | 8 m 27 s 630 ms | 14 m 25 s 120 ms | 10 m 00 s 031 ms |
| 5 | 3,8 GB | 6 m 09 s 417 ms | 4 m 42 s 386 ms | 8 m 37 s 890 ms | 6 m 05 s 487 ms |
| 6 | 5,3 GB | 8 m 43 s 330 ms | 8 m 05 s 215 ms | 12 m 23 s 855 ms | 9 m 03 s 041 ms |
Figure 4Comparison between inserts (time × file number).
Figure 5Comparison between extractions (time × file number).
PostgreSQL and Cassandra results.
| Database | Insertion | Extraction |
|---|---|---|
| PostgreSQL | 1 h 51 m 54 s | 28 m 27 s |
| Cassandra (2) | 52 m 5 s | 1 h 16 m 25 s |
| Cassandra (4) | 42 m 56 s | 53 m 49 s |
Figure 6Comparison between Cassandra and PostgreSQL.
MongoDB and Cassandra final results.
| Database | Insertion | Extraction |
|---|---|---|
| MongoDB | 45 m 17 s | 19 m 13 s |
| Cassandra (2) | 52 m 5 s | 1 h 16 m 25 s |
| Cassandra (4) | 42 m 56 s | 53 m 49 s |
Figure 7Comparison between Cassandra and MongoDB database.