Literature DB >> 31488541

Future-Proofing Your Microbiology Resource Announcements Genome Assembly for Reproducibility and Clarity.

David A Baltrus¹, Christina A Cuomo², John J Dennehy^3,4, Julie C Dunning Hotopp^5,6,7, Julia A Maresca⁸, Irene L G Newton⁹, David A Rasko^5,6, Antonis Rokas^10,11,12, Simon Roux¹³, Jason E Stajich¹⁴.

Abstract

Descriptions of resources, like the genome assemblies reported in Microbiology Resource Announcements, are often frozen at their time of publication, yet they will need to be interpreted in the midst of continually evolving technologies. It is therefore important to ensure that researchers accessing published resources have access to all of the information required to repeat, interpret, and extend these original analyses. Here, we provide a set of suggestions to help make certain that published resources remain useful and repeatable for the foreseeable future.

Entities: Disease Species

Year: 2019 PMID： 31488541 PMCID： PMC6728651 DOI： 10.1128/MRA.00954-19

Source DB: PubMed Journal: Microbiol Resour Announc ISSN： 2576-098X

EDITORIAL

There are many ways to sequence and assemble a genome, with the number of available sequencing and assembly platforms seemingly growing every week. Within sequencing platforms, library preparation, chemistry, and error profiles frequently change. Our primary goal as Microbiology Resource Announcements (MRA) editors is to ensure that a manuscript’s techniques and protocols are thoroughly documented so that readers can understand the strengths and weaknesses not only of a particular genome assembly but also the underlying raw data. Given the importance of clarity of workflows and reproducibility of data in validating scientific results (1–3), we want to ensure that all of the relevant data contributing to an assembly are available for other researchers so that they can (i) reproduce the study’s results, (ii) elaborate and incorporate the available data into other genome assemblies, or (iii) repurpose public data for use in alternative analyses. While many of these current best practices have been incorporated into the Instructions to Authors, in this opinion piece, we aim to provide a set of thematic ideas and examples behind certain instructions for authors to increase reproducibility across groups and utility for future users. We also highlight the fact that groups have proposed sets of standards for isolate genomes (4), 16S rRNA/18S rRNA/other amplicons (5), and single-cell amplified genomes (SAGs) and metagenome-assembled genomes (MAGs) (6) and that recommendations from those proposals are highly relevant and compatible with points raised in this editorial.

Strain provenance and culture conditions.

Even before DNA extraction, it is important to document how particular isolates were isolated, cultured, and maintained. When and where was the strain isolated? What was the culture collection source? Has the strain been passaged since its isolation or acquisition from a culture collection? Was a single colony or plaque picked to amplify the culture? What kind of medium and growth conditions were used during growth of the organism prior to genome extraction? Deviations across these steps may not matter for the quality of proximate genome assemblies per se, but they can influence relevant measurements like estimation of the amount of polymorphisms compared to reference isolates and secondary patterns in which users might be interested, such as methylation status. There have been numerous studies demonstrating how common reference strains can accumulate changes simply because of independent maintenance across laboratories (e.g., see reference 7). Assembly of hypervariable genomic regions can also be significantly affected by polymorphisms that arise during culturing of strains prior to genomic extraction (8, 9). Other data, such as geographical coordinates, can be valuable to epidemiologists studying pathogen spread or evolutionary biologists studying isolation by distance. The more data provided relevant to the sample’s provenance, the more useful the resource will be to future researchers.

Sample preparation.

We often follow well-established protocols or use commercially available kits when extracting genomic DNA, and as such, it is commonplace and acceptable at MRA to provide references to specific methods or to state that procedures followed standard manufacturer protocols. However, these kits and protocols often include nonstandard or optional steps (e.g., addition of RNase); where possible, the inclusion of such steps should be documented in manuscripts because they can affect assembly quality. Likewise, it is valuable to include the type of kit used to create a sequencing library or prepare samples (Nextera/TruSeq, LSK108/RBK004, etc.), flow cell model or chemistry (FLO-MIN106, R9.4 pore, P6C4, etc.), if reads were multiplexed (and if so, what software was used to demultiplex or trim adaptors), and whether other DNA was sequenced in the same flow cell as part of the same run. Documentation of these steps can help reconcile biases that may influence genome assembly but also enable researchers to gauge the potential for contaminating reads to be incorporated in the reported genomes. When contracting with a commercial center or core, it is important to identify that center or core but also to verify that they will provide you with information required for publication. Such requirements currently include providing information about library construction methods, sequencing methods, sequencing platforms, and steps implemented in order to perform quality control for reads. The sequencing of viruses may require additional information depending on the type of genome (linear or circular) and nucleic acid species (RNA or DNA). Different sample preparation strategies have different error profiles. For example, converting RNA genomes into cDNA prior to PCR amplification and Sanger sequencing has different strengths and weaknesses than those with applying sequence-independent, single-primer amplification (SISPA) and Illumina sequencing. Specifying the sample preparation strategies used can help other researchers understand the limitations of the sequencing effort.

Sequencing technologies.

DNA sequencing technologies and assembly pipelines are rapidly changing. The best way to buffer against changes in genome assembly practices is to require that raw reads be deposited in a publicly available database, such as the NCBI Sequence Read Archive (SRA). Within reason, it is best if this information is posted in the least manipulated way so that researchers can derive the information in whatever way they would like. For instance, the removal of contamination of microbiome reads from a eukaryotic genome sequencing project could obscure secondary analysis of the microbiome of that eukaryote. It is especially critical that data underlying assemblies arising from sequencing reads generated by fast-changing technologies, like those generated through Oxford Nanopore devices, be extensively documented and accessible. To this point, since options for base calling from signals are rapidly changing and improving for this platform, deposition of fast5 files into the SRA is critical for enabling future users to independently call bases or search for nucleotide modifications in the raw signals. As the software and algorithms for base calling are frequently changing, even if the assembly is based solely on the fastq reads that are produced by the MinKNOW pipeline, it is crucial to document versions of the base callers used within the pipeline (and all relevant parameters, since there are now options for “fast” or “high-accuracy” base calls). Last, given the variety of options currently available within the MinKNOW software, the selection of reads promoted to the assembly and the methods and cutoffs applied for filtering are critical to document (e.g., were they from the “pass” folder, or do they also include the “fail” folder?).

Towards fully reproducible genome assemblies.

The more documentation that authors provide within each manuscript, the greater the possibility that results can be completely reproduced across labs and over time. We advocate for openness in terms of methods, sharing of all data, and deposition of relevant scripts described in manuscripts, and there are several ways that authors can achieve full transparency in these areas. We suggest that relevant and informative log files produced by software pipelines, which include information helpful for interpreting assembly metrics and pipeline dependencies, be made available through a publicly accessible data deposition archive like figshare or GitHub (https://guides.github.com/activities/citable-code/), linked to Zenodo (https://zenodo.org/), to enable documentation with digital object identifiers (DOIs). For instance, program packages like Unicycler (10) and Shovill (https://github.com/tseemann/shovill) output verbose log files that include parameters and versions of programs used in these packages, as well as inherent information such as how many rounds of Pilon (11) polishing each assembly underwent. Ultimately, the best solution possible is to post relevant information that can be used for benchmarking and quality control in accessible digital notebooks using programs like RMarkdown (12) or Jupyter (13) so that they are linked to DOIs that can be referenced in the manuscript.

10 in total

1. Genome diversity of Pseudomonas aeruginosa PAO1 laboratory strains.

Authors: Jens Klockgether; Antje Munder; Jens Neugebauer; Colin F Davenport; Frauke Stanke; Karen D Larbig; Stephan Heeb; Ulrike Schöck; Thomas M Pohl; Lutz Wiehlmann; Burkhard Tümmler
Journal: J Bacteriol Date: 2009-12-18 Impact factor: 3.490

2. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.

Authors: Pelin Yilmaz; Renzo Kottmann; Dawn Field; Rob Knight; James R Cole; Linda Amaral-Zettler; Jack A Gilbert; Ilene Karsch-Mizrachi; Anjanette Johnston; Guy Cochrane; Robert Vaughan; Christopher Hunter; Joonhong Park; Norman Morrison; Philippe Rocca-Serra; Peter Sterk; Manimozhiyan Arumugam; Mark Bailey; Laura Baumgartner; Bruce W Birren; Martin J Blaser; Vivien Bonazzi; Tim Booth; Peer Bork; Frederic D Bushman; Pier Luigi Buttigieg; Patrick S G Chain; Emily Charlson; Elizabeth K Costello; Heather Huot-Creasy; Peter Dawyndt; Todd DeSantis; Noah Fierer; Jed A Fuhrman; Rachel E Gallery; Dirk Gevers; Richard A Gibbs; Inigo San Gil; Antonio Gonzalez; Jeffrey I Gordon; Robert Guralnick; Wolfgang Hankeln; Sarah Highlander; Philip Hugenholtz; Janet Jansson; Andrew L Kau; Scott T Kelley; Jerry Kennedy; Dan Knights; Omry Koren; Justin Kuczynski; Nikos Kyrpides; Robert Larsen; Christian L Lauber; Teresa Legg; Ruth E Ley; Catherine A Lozupone; Wolfgang Ludwig; Donna Lyons; Eamonn Maguire; Barbara A Methé; Folker Meyer; Brian Muegge; Sara Nakielny; Karen E Nelson; Diana Nemergut; Josh D Neufeld; Lindsay K Newbold; Anna E Oliver; Norman R Pace; Giriprakash Palanisamy; Jörg Peplies; Joseph Petrosino; Lita Proctor; Elmar Pruesse; Christian Quast; Jeroen Raes; Sujeevan Ratnasingham; Jacques Ravel; David A Relman; Susanna Assunta-Sansone; Patrick D Schloss; Lynn Schriml; Rohini Sinha; Michelle I Smith; Erica Sodergren; Aymé Spo; Jesse Stombaugh; James M Tiedje; Doyle V Ward; George M Weinstock; Doug Wendel; Owen White; Andrew Whiteley; Andreas Wilke; Jennifer R Wortman; Tanya Yatsunenko; Frank Oliver Glöckner
Journal: Nat Biotechnol Date: 2011-05 Impact factor: 54.908

3. The genome-sequenced variant of Campylobacter jejuni NCTC 11168 and the original clonal clinical isolate differ markedly in colonization, gene expression, and virulence-associated phenotypes.

Authors: Erin C Gaynor; Shaun Cawthraw; Georgina Manning; Joanna K MacKichan; Stanley Falkow; Diane G Newell
Journal: J Bacteriol Date: 2004-01 Impact factor: 3.490

4. The minimum information about a genome sequence (MIGS) specification.

Authors: Dawn Field; George Garrity; Tanya Gray; Norman Morrison; Jeremy Selengut; Peter Sterk; Tatiana Tatusova; Nicholas Thomson; Michael J Allen; Samuel V Angiuoli; Michael Ashburner; Nelson Axelrod; Sandra Baldauf; Stuart Ballard; Jeffrey Boore; Guy Cochrane; James Cole; Peter Dawyndt; Paul De Vos; Claude DePamphilis; Robert Edwards; Nadeem Faruque; Robert Feldman; Jack Gilbert; Paul Gilna; Frank Oliver Glöckner; Philip Goldstein; Robert Guralnick; Dan Haft; David Hancock; Henning Hermjakob; Christiane Hertz-Fowler; Phil Hugenholtz; Ian Joint; Leonid Kagan; Matthew Kane; Jessie Kennedy; George Kowalchuk; Renzo Kottmann; Eugene Kolker; Saul Kravitz; Nikos Kyrpides; Jim Leebens-Mack; Suzanna E Lewis; Kelvin Li; Allyson L Lister; Phillip Lord; Natalia Maltsev; Victor Markowitz; Jennifer Martiny; Barbara Methe; Ilene Mizrachi; Richard Moxon; Karen Nelson; Julian Parkhill; Lita Proctor; Owen White; Susanna-Assunta Sansone; Andrew Spiers; Robert Stevens; Paul Swift; Chris Taylor; Yoshio Tateno; Adrian Tett; Sarah Turner; David Ussery; Bob Vaughan; Naomi Ward; Trish Whetzel; Ingio San Gil; Gareth Wilson; Anil Wipat
Journal: Nat Biotechnol Date: 2008-05 Impact factor: 54.908

5. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement.

Authors: Bruce J Walker; Thomas Abeel; Terrance Shea; Margaret Priest; Amr Abouelliel; Sharadha Sakthikumar; Christina A Cuomo; Qiandong Zeng; Jennifer Wortman; Sarah K Young; Ashlee M Earl
Journal: PLoS One Date: 2014-11-19 Impact factor: 3.240

6. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads.

Authors: Ryan R Wick; Louise M Judd; Claire L Gorrie; Kathryn E Holt
Journal: PLoS Comput Biol Date: 2017-06-08 Impact factor: 4.475

7. Measuring the reproducibility and quality of Hi-C data.

Authors: Galip Gürkan Yardımcı; Hakan Ozadam; Michael E G Sauria; Oana Ursu; Koon-Kiu Yan; Tao Yang; Abhijit Chakraborty; Arya Kaul; Bryan R Lajoie; Fan Song; Ye Zhan; Ferhat Ay; Mark Gerstein; Anshul Kundaje; Qunhua Li; James Taylor; Feng Yue; Job Dekker; William S Noble
Journal: Genome Biol Date: 2019-03-19 Impact factor: 13.583

8. Reproducible big data science: A case study in continuous FAIRness.

Authors: Ravi Madduri; Kyle Chard; Mike D'Arcy; Segun C Jung; Alexis Rodriguez; Dinanath Sulakhe; Eric Deutsch; Cory Funk; Ben Heavner; Matthew Richards; Paul Shannon; Gustavo Glusman; Nathan Price; Carl Kesselman; Ian Foster
Journal: PLoS One Date: 2019-04-11 Impact factor: 3.240

9. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea.

Authors: Robert M Bowers; Nikos C Kyrpides; Ramunas Stepanauskas; Miranda Harmon-Smith; Devin Doud; T B K Reddy; Frederik Schulz; Jessica Jarett; Adam R Rivers; Emiley A Eloe-Fadrosh; Susannah G Tringe; Natalia N Ivanova; Alex Copeland; Alicia Clum; Eric D Becraft; Rex R Malmstrom; Bruce Birren; Mircea Podar; Peer Bork; George M Weinstock; George M Garrity; Jeremy A Dodsworth; Shibu Yooseph; Granger Sutton; Frank O Glöckner; Jack A Gilbert; William C Nelson; Steven J Hallam; Sean P Jungbluth; Thijs J G Ettema; Scott Tighe; Konstantinos T Konstantinidis; Wen-Tso Liu; Brett J Baker; Thomas Rattei; Jonathan A Eisen; Brian Hedlund; Katherine D McMahon; Noah Fierer; Rob Knight; Rob Finn; Guy Cochrane; Ilene Karsch-Mizrachi; Gene W Tyson; Christian Rinke; Alla Lapidus; Folker Meyer; Pelin Yilmaz; Donovan H Parks; A M Eren; Lynn Schriml; Jillian F Banfield; Philip Hugenholtz; Tanja Woyke
Journal: Nat Biotechnol Date: 2017-08-08 Impact factor: 54.908

10. Genomic plasticity enables phenotypic variation of Pseudomonas syringae pv. tomato DC3000.

Authors: Zhongmeng Bao; Paul V Stodghill; Christopher R Myers; Hanh Lam; Hai-Lei Wei; Suma Chakravarthy; Brian H Kvitko; Alan Collmer; Samuel W Cartinhour; Peter Schweitzer; Bryan Swingle
Journal: PLoS One Date: 2014-02-06 Impact factor: 3.240

10 in total