Literature DB >> 30544142

Keeping it light: (re)analyzing community-wide datasets without major infrastructure.

Harriet Alexander^1,2, Lisa K Johnson^1,3, C Titus Brown^1,3,4.

Abstract

DNA sequencing technology has revolutionized the field of biology, shifting biology from a data-limited to data-rich state. Central to the interpretation of sequencing data are the computational tools and approaches that convert raw data into biologically meaningful information. Both the tools and the generation of data are actively evolving, yet the practice of re-analysis of previously generated data with new tools is not commonplace. Re-analysis of existing data provides an affordable means of generating new information and will likely become more routine within biology, yet necessitates a new set of considerations for best practices and resource development. Here, we discuss several practices that we believe to be broadly applicable when re-analyzing data, especially when done by small research groups.

Entities: Chemical Disease Gene

Mesh：

Year: 2019 PMID： 30544142 PMCID： PMC6350038 DOI： 10.1093/gigascience/giy159

Source DB: PubMed Journal: Gigascience ISSN： 2047-217X Impact factor: 6.524

Background

Advances in high-throughput, next-generation sequencing technologies have catapulted biology into a new computational era. In fields of biology that leverage sequencing data, the primary limiting step in the earlier stages of biological inquiry has increasingly shifted away from data generation to data analysis. Concomitant with the increasing emphasis on the computational processing of these data is the advancement of the computational tools available for such analyses; new computational approaches for the analysis of these data are constantly being created, tested, and proven worthy of use. Yet, outside of computational lab groups, the practice of re-analysis of previously generated data with new tools and approaches is not commonplace. Such re-analysis has great utility and will become more routine within the life sciences, yet re-analysis necessitates a new set of considerations for best practices and resource development. Our interest in the issues surrounding re-analysis was spurred by a large-scale sequencing project: the Marine Microbial Transcriptome Sequencing Project (MMETSP), which generated 678 transcriptomes, spanning 396 different strains of eukaryotic microbial eukaryotes isolated from marine settings [1]. This dataset is an invaluable resource within the oceanographic community [1,2], as it exponentially expands the accessible genetic information base of marine protistan life. Moreover, the MMETSP has created a uniquely useful test dataset for computational biologists. The MMETSP dataset spans a large evolutionary history of organisms, and all of the 678 transcriptomes were prepared and sequenced in a consistent way [2]. The sequencing project, which was completed in 2014, was originally assembled by the National Center for Genome Resources using a custom pipeline that employed the best available computational tools at the time [3,4]. Since the original MMETSP analysis, new tools and techniques for the assembly of de novo transcriptomes from RNA sequencing data have been described, and preexisting tools have been improved upon [5]. Moreover, new annotation tools and databases have become available. The transcriptome assembly project described in [6] was designed to create a streamlined and reproducible assembly framework that not only enables the re-analysis of these datasets but creates a framework to facilitate easy and rapid re-analyses in the future. These secondary data products of sequencing, such as annotated assemblies, should be viewed as hypotheses generated from the underlying biology, rather than some immutable “truth.” As such, these secondary data products can continue to be improved as new tools are developed. For example, we note that [7] described several limitations and challenges of current assembly technology and developed an improved Oyster River Protocol, which we could use to generate another, perhaps improved, MMETSP assembly. Ultimately, such iterations on the original raw data have the potential to improve upon the secondary data products, that is, the assembled transcriptomes and associated annotations that are relied upon by the broader community for biological inquiry. Through this process, we developed several practices that we believe to be broadly applicable when re-analyzing data, especially when done by small research groups.

Storage of Secondary Data Products

Funding agencies and academic journals now mandate the deposition of raw data into digital repositories (e.g., the National Center for Biotechnology Information Sequence Read Archive (SRA) and Gene Expression Omnibus, European Nucleotide Archive). Thus, to date, the majority of the sequence data that have been generated and published is openly available online for reference and use in other studies. The sharing and availability of raw data from high-throughput sequencing studies has been largely managed through the development of archival services such as the SRA, which was established as part of the International Nucleotide Sequence Database Collaboration [8]. The SRA currently contains more than 1.8e16 bases of information (~7e15 are open access). While a tremendous resource for biological inquiry, a major problem remains in that raw sequencing data are not the most directly useful form of sequencing data. Rather, biologists rely heavily upon the computationally generated secondary products of sequencing reads (e.g., assembled transcriptomes or genomes, annotations, associated count-based data). There is a dearth of these secondary products in central, publicly accessible databases, such as the Transcriptome Shotgun Assembly (TSA) Sequence Database. In fact, a substantial proportion of these data products might be aptly categorized as “dark data,” as they are largely undiscoverable and often archived independently in association with a publication or on private servers. Even more limiting, however, is that the guidelines for public databases such as the TSA specifically state that “assemblies from sequences not directly sequenced by the submitter” should not be uploaded to the TSA, thereby excluding the potential for reassembled datasets to be made available and directly linked to preexisting BioProjects, BioSamples, TSAs, and SRA entries (https://www.ncbi.nlm.nih.gov/genbank.tsa/). From the perspective of our MMETSP re-analysis, we argue that the community needs more than a place to put the primary and secondary data products associated with a single publication. Ideally, the results of each re-analysis would be deposited in a discoverable location but would have a coherent archival procedure that is lab independent, easily searchable, and “forward discoverable” (i.e., when a new version of a data product is released, old versions can point to the new version). Moreover, such an archival platform would ideally document the full provenance of the secondary data product. Movement toward this kind of data archival system is being made both with the development of alternative scientific data publication models (e.g., the Research Object [9]) as well as integration of metadata models (such as the Resource Description Framework) onto existing scientific databases such as the European Bioinformatics Institute [10]. However, policies surrounding secondary data products will need to change.

Directly linking secondary data products to provenance of workflow

In the absence of a community database specifically for the type of secondary product that was produced in this analysis, we opted to upload the assemblies, annotations, and counts to Zenodo (https://zenodo.org), a scientific data repository founded by CERN, which provided a DOI for the assemblies (https://doi.org/10.5281/zenodo.740440). The header information for each assembly was modified to contain the DOI. We then created a GitHub repository containing the scripts used to generate the assemblies. The repository was then archived with Zenodo, which generated a single DOI for the project (https://doi.org/10.5281/zenodo.594854). As such, the scripts used in the generation of transcriptomes are directly linked through a unique DOI to the data products that are listed in the directory. Since the scripts are easily accessible, they can be tweaked to re-analyze the primary sequence data using different parameters or tools, and the new pipeline and output files can be archived again with Zenodo using the same approach as above. Moreover, the Zenodo archival system will then automatically indicate the presence of other versions of a given repository such that a user might be sure to use the newest version of an assembly. In the future, such an approach might be further complemented by the integration of a JSON Linked Data file detailing the metadata for the assembly product, such as the pipeline used and previous versions of the assemblies.

Conclusion

The GitHub-Zenodo framework presented here represents an efficient way for small research groups (e.g., a graduate student) to host and link both the code and results from large-scale re-analysis projects in a publicly accessible way. The direct linking of protocols and metadata to output data products is paramount in the data-heavy future of scientific advancement. We also identified several lingering issues surrounding large-scale re-analysis. Actual computation on these large datasets is a non-trivial issue, as it requires access to facilities with sufficiently large, high-memory machines. Amazon Web Service instances and other “cloud” platforms, including XSEDE, provide flexible computing options and are broadly accessible. Cloud-based systems, however, tend to be more expensive per computation hour than local resources. High-performance computing (HPC) resources at local institutions represent another potential site of compute ability. However, HPC resources can be temperamental and potentially balk at larger, more node-consuming procedures; moreover, bioinformatics tools may be poorly optimized for HPC resourcess. Trinity, used in our pipeline, creates many small files for each run, and this repeatedly caused disk slowdowns on our HPC. The re-analysis by [6] attempted to use both but ultimately found that the HPC provided the most consistent scalable automation for running hundreds of jobs in a cost-efficient manner. However, more generally, we see no global solution for identifying and optimizing the global scientific cyberinfrastructure requirements for projects that require significant scaling; such considerations must be made on a project-by-project basis given the resources available to each lab. Beyond the optimization of computational resources, we feel that there is a significant opportunity for scientific advancement with high-throughput sequencing projects in making data products “forward discoverable,” because this makes it possible to improve downstream work without significant upstream investment. In an ideal future, a researcher might be automatically notified when a dataset that she is actively working on is updated or changes. This presents many social and technical challenges that will need to be solved if we are to take full advantage of public datasets.

Abbreviations

DOI: Digital Object Identifier; HPC: high-performance computing; MMETSP: Marine Microbial Transcriptome Sequencing Project; SRA: Sequence Read Archive; TSA: Transcriptome Shotgun Assembly.

Competing interests

The authors declare that they have no competing interests.

Funding

Funding was provided by the Gordon and Betty Moore Foundation (award GBMF4551 to C.T.B.).

Author contributions

Conceptualized by H.A., L.K.J., and C.T.B. Written by H.A. and C.T.B. Edited and revised by H.A., C.T.B., and L.K.J. All authors read and approved the final manuscript. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. 2/7/2018 Reviewed Click here for additional data file. 7/2/2018 Reviewed Click here for additional data file.

8 in total

1. CAP3: A DNA sequence assembly program.

Authors: X Huang; A Madan
Journal: Genome Res Date: 1999-09 Impact factor: 9.043

2. ABySS: a parallel assembler for short read sequence data.

Authors: Jared T Simpson; Kim Wong; Shaun D Jackman; Jacqueline E Schein; Steven J M Jones; Inanç Birol
Journal: Genome Res Date: 2009-02-27 Impact factor: 9.043

Review 3. Probing the evolution, ecology and physiology of marine protists using transcriptomics.

Authors: David A Caron; Harriet Alexander; Andrew E Allen; John M Archibald; E Virginia Armbrust; Charles Bachy; Callum J Bell; Arvind Bharti; Sonya T Dyhrman; Stephanie M Guida; Karla B Heidelberg; Jonathan Z Kaye; Julia Metzner; Sarah R Smith; Alexandra Z Worden
Journal: Nat Rev Microbiol Date: 2016-11-21 Impact factor: 60.633

4. The Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP): illuminating the functional diversity of eukaryotic life in the oceans through transcriptome sequencing.

Authors: Patrick J Keeling; Fabien Burki; Heather M Wilcox; Bassem Allam; Eric E Allen; Linda A Amaral-Zettler; E Virginia Armbrust; John M Archibald; Arvind K Bharti; Callum J Bell; Bank Beszteri; Kay D Bidle; Connor T Cameron; Lisa Campbell; David A Caron; Rose Ann Cattolico; Jackie L Collier; Kathryn Coyne; Simon K Davy; Phillipe Deschamps; Sonya T Dyhrman; Bente Edvardsen; Ruth D Gates; Christopher J Gobler; Spencer J Greenwood; Stephanie M Guida; Jennifer L Jacobi; Kjetill S Jakobsen; Erick R James; Bethany Jenkins; Uwe John; Matthew D Johnson; Andrew R Juhl; Anja Kamp; Laura A Katz; Ronald Kiene; Alexander Kudryavtsev; Brian S Leander; Senjie Lin; Connie Lovejoy; Denis Lynn; Adrian Marchetti; George McManus; Aurora M Nedelcu; Susanne Menden-Deuer; Cristina Miceli; Thomas Mock; Marina Montresor; Mary Ann Moran; Shauna Murray; Govind Nadathur; Satoshi Nagai; Peter B Ngam; Brian Palenik; Jan Pawlowski; Giulio Petroni; Gwenael Piganeau; Matthew C Posewitz; Karin Rengefors; Giovanna Romano; Mary E Rumpho; Tatiana Rynearson; Kelly B Schilling; Declan C Schroeder; Alastair G B Simpson; Claudio H Slamovits; David R Smith; G Jason Smith; Sarah R Smith; Heidi M Sosik; Peter Stief; Edward Theriot; Scott N Twary; Pooja E Umale; Daniel Vaulot; Boris Wawrik; Glen L Wheeler; William H Wilson; Yan Xu; Adriana Zingone; Alexandra Z Worden
Journal: PLoS Biol Date: 2014-06-24 Impact factor: 8.029

5. Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes.

Authors: Lisa K Johnson; Harriet Alexander; C Titus Brown
Journal: Gigascience Date: 2019-04-01 Impact factor: 6.524

6. Archiving next generation sequencing data.

Authors: Martin Shumway; Guy Cochrane; Hideaki Sugawara
Journal: Nucleic Acids Res Date: 2009-12-03 Impact factor: 16.971

7. Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Authors: Manfred G Grabherr; Brian J Haas; Moran Yassour; Joshua Z Levin; Dawn A Thompson; Ido Amit; Xian Adiconis; Lin Fan; Raktima Raychowdhury; Qiandong Zeng; Zehua Chen; Evan Mauceli; Nir Hacohen; Andreas Gnirke; Nicholas Rhind; Federica di Palma; Bruce W Birren; Chad Nusbaum; Kerstin Lindblad-Toh; Nir Friedman; Aviv Regev
Journal: Nat Biotechnol Date: 2011-05-15 Impact factor: 54.908

8. The Oyster River Protocol: a multi-assembler and kmer approach for de novo transcriptome assembly.

Authors: Matthew D MacManes
Journal: PeerJ Date: 2018-08-03 Impact factor: 2.984

8 in total

2 in total

1. Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes.

Authors: Lisa K Johnson; Harriet Alexander; C Titus Brown
Journal: Gigascience Date: 2019-04-01 Impact factor: 6.524

2. Cyberinfrastructure and resources to enable an integrative approach to studying forest trees.

Authors: Jill L Wegrzyn; Taylor Falk; Emily Grau; Sean Buehler; Risharde Ramnath; Nic Herndon
Journal: Evol Appl Date: 2019-11-03 Impact factor: 5.183

2 in total