| Literature DB >> 26335558 |
Amir Szitenberg1, Max John1, Mark L Blaxter2, David H Lunt1.
Abstract
The reproducibility of experiments is key to the scientific process, and particularly necessary for accurate reporting of analyses in data-rich fields such as phylogenomics. We present ReproPhylo, a phylogenomic analysis environment developed to ensure experimental reproducibility, to facilitate the handling of large-scale data, and to assist methodological experimentation. Reproducibility, and instantaneous repeatability, is built in to the ReproPhylo system and does not require user intervention or configuration because it stores the experimental workflow as a single, serialized Python object containing explicit provenance and environment information. This 'single file' approach ensures the persistence of provenance across iterations of the analysis, with changes automatically managed by the version control program Git. This file, along with a Git repository, are the primary reproducibility outputs of the program. In addition, ReproPhylo produces an extensive human-readable report and generates a comprehensive experimental archive file, both of which are suitable for submission with publications. The system facilitates thorough experimental exploration of both parameters and data. ReproPhylo is a platform independent CC0 Python module and is easily installed as a Docker image or a WinPython self-sufficient package, with a Jupyter Notebook GUI, or as a slimmer version in a Galaxy distribution.Entities:
Mesh:
Year: 2015 PMID: 26335558 PMCID: PMC4559436 DOI: 10.1371/journal.pcbi.1004447
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Summary of the Python module structure.
| Module feature | Description |
|---|---|
|
| Descriptor of the name, aliases, feature type and sequence type of an analysed locus |
|
| Container for the input, intermediate and output datasets, and their metadata. Structured using Locus and Concatenation objects |
| method categories | |
| Read | Read data and metadata in any Biopython compatible format or tabular format for metadata |
| Filter | Filter sequences based on length, GC content or ID |
| edit_metadata | Programmatically manipulate sequence metadata |
| Align | Conduct sequence alignment(s) configured by a Conf object |
| Trim | Conduct alignment trimming configured by a Conf object |
| Tree | Conduct tree reconstruction(s) configured by a Conf object |
| Annotate | Annotate and root trees based on metadata stored in the Project |
| Write | Write files containing sequences, alignments, trees or metadata in any Biopython format |
| View | View alignments, statistics plots, occupancy tables etc. in the browser |
| Fetch | Copy a Project attribute (e.g. a tree or alignment object) into an independent variable |
|
| A set of classes for configuring the different analytic steps |
|
| Contains alignment and sequence parameters of the data in the Project |
| Methods | |
| Sort | sort the loci based on one of the available parameters |
| Plot | plot parameter boxplots |
| Slice | produce a supermatrix with certain parameter limits |
| Slide | create supermatrices by a sliding window approach along a gradient of a given parameter |
|
| Descriptor of the locus and OTU composition of a supermatrix |
| method categories | |
| Add | Add the concatenation to the analysis |
| Make | Prepare a supermatrix based on the instructions |
|
| |
| list_loci | List loci found in a gb file, synonymize and choose from |
| Report | Write human readable report containing detailed methods and results |
| Pickle | Serialize/ Unserialize a Project object |
| Exonerate | Functions to run exonerate yielding metadata rich gb files |
| Bayestraits | Invokes BayesTraits using a Project object as the input source for both trees and traits |
Fig 1A typical ReproPhylo workflow.
This illustration demonstrates the flow of data (blue arrows) and metadata (red arrows) through the phylogenetic analysis. Numbers on arrows correspond with code snippets in S1 Methods. Asterisks indicate an automatic pickle and Git checkpoint. The user can toggle between these checkpoints indefinitely using a built in ReproPhylo function.
Fig 2The phylogenetic workflow as a single Python object.
(A) The workflow is contained as a single object with bins (attributes) for the raw data and metadata, as well as for the various workflow analyses and forks. These are made provenance-explicit with unique IDs and names. (B) Analyses are invoked via commands that modify the workflow object. A command can invoke batch analysis for all the relevant data in the object. For example, the command ‘align’ will apply for all the unaligned datasets. Commands can be limited to certain datasets using IDs. Commands can be customized using options. (C) Provenance survives version changes. The workflow object can be serialized (pickled) and then committed to a version control repository as a single file. Reverting to previous output version will also revert to the intermediate steps leading to it. Forks can be done post-hoc using the all-inclusive and provenance explicit workflow (pickled) object.
Fig 3Exploratory phylogenomic analysis of a Lepidoptera dataset.
(A) A nucleotide dataset from 26 species from Kawahara and Breinholt [40] was reanalyzed. Loci were sorted by their median, 75 percentile and 25 percentile entropy values (centre panel). For each locus, a box plot was generated. The medians are denoted by brown dots. The boxes (blue) represent the 25–75 percentiles. Whiskers (black) represent values that are found within a range outside the box, 1.5 times as long as the box (which is null, when the box itself has a null range) Trees (insets A 1–6) were reconstructed from 200-locus windows with 50 locus overlap between neighbouring windows. The windows are represented by black and gray horizontal bars, each with an arrow pointing to the tree generated from it. In trees 1–6, dark blue highlights denote Rhopalocera (butterfly) taxa, and light blue, gray and yellow highlights denote clades I, III and IV respectively (sensu Kawahara and Breinholt [40]). Bullets on nodes represent Bootstrap percentages (BP). Blue bullets represent maximal support. Other support values above 80% are denoted by gray bullets. (B-D) Three pairwise tree divergence metrics were calculated and presented as heatmaps, with the most divergent tree pairs denoted by dark blue and identical tree pairs by a white box. While the scales are not comparable among the metrics, the relative differences are. The metrics are (B) the Symmetric Distance of Robinson-Foulds [44], (C) the Branch Distance [45] and (D) evolutionary rate corrected Branch Distance [45].