| Literature DB >> 30186700 |
Maria Luiza Mondelli1, Thiago Magalhães1, Guilherme Loss1, Michael Wilde2, Ian Foster2, Marta Mattoso3, Daniel Katz4, Helio Barbosa1,5, Ana Tereza R de Vasconcelos1, Kary Ocaña1, Luiz M R Gadelha1.
Abstract
Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.Entities:
Keywords: Bioinformatics; Data analytics; Profiling; Provenance; Scientific workflows
Year: 2018 PMID: 30186700 PMCID: PMC6119457 DOI: 10.7717/peerj.5551
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1BioWorkbench layered conceptual architecture.
Figure 2Main entities in the conceptual model of the Swift provenance database.
Figure 3BioWorkbench web interface displaying information about a RASopathy analysis workflow (RASflow) execution.
Figure 4SwiftPhylo workflow modeling.
Figure 5SwiftPhylo execution time and speedup.
Figure 6SwiftPhylo workflow Gantt chart expressing its parallelism level.
Figure 7SwiftGECKO workflow modeling.
Average duration (s) and the amount of data read and written by each activity of SwiftGECKO.
| Activity | Duration (s) | GB read | GB written |
|---|---|---|---|
| hits | 60,058 | 455.24 | 111.36 |
| sortHits | 4457.3 | 111.36 | 111.36 |
| FragHits | 3793.3 | 94.7 | 0.32 |
| filterHits | 2,402 | 111.36 | 83.09 |
| csvGenerator | 697.8 | 0.006 | 0.004 |
| combineFrags | 425.8 | 0.32 | 0.16 |
| w2hd | 425.1 | 7.05 | 11.67 |
| sortWords | 76.6 | 7.05 | 7.05 |
| words | 55.9 | 0.29 | 7.05 |
| reverseComplement | 11.8 | 0.15 | 0.14 |
Information gain ratio values of the attributes used for the analysis.
| Information gain ratio | Attribute |
|---|---|
| 1 | total_read |
| 0.1842 | word_length |
| 0.1611 | total_fasta_size |
| 0.1208 | total_genomes |
| 0.0421 | length |
| 0.0271 | similarity |
J48 predictive model.
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
| 11 |
| 12 |
| 13 |
BFTree predictive model.
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
| 10 |
| 11 |
| 12 |
| 13 |
Figure 8RASopathy analysis workflow modeling.
Figure 9Database entities for storing scientific domain annotations in RASflow.
Alignment rate resulting from patients analysis in RASflow.
| Patient | Alignment rate (%) |
|---|---|
| P1.log | 93.95 |
| P2.log | 94.52 |
| P3.log | 94.41 |
| P4.log | 94.48 |
| P5.log | 94.62 |
| P6.log | 94.58 |
Mutation list with the biotype and the name of the transcribed genes in the final VCF file of a patient.
| Patient | Gene | Transcript | Name | Biotype |
|---|---|---|---|---|
| P1.log | ENSE00001768193 | ENST00000341065 | SAMD11-001 | protein_coding |
| P1.log | ENSE00003734555 | ENST00000617307 | SAMD11-203 | protein_coding |
| P1.log | ENSE00003734555 | ENST00000618779 | SAMD11-206 | protein_coding |
| P1.log | ENSE00001864899 | ENST00000342066 | SAMD11-010 | protein_coding |
| P1.log | ENSE00003734555 | ENST00000622503 | SAMD11-208 | protein_coding |