Literature DB >> 28785418

CLIMB (the Cloud Infrastructure for Microbial Bioinformatics): an online resource for the medical microbiology community.

Thomas R Connor¹, Nicholas J Loman², Simon Thompson³, Andy Smith², Joel Southgate¹, Radoslaw Poplawski^2,3, Matthew J Bull¹, Emily Richardson², Matthew Ismail⁴, Simon Elwood- Thompson⁵, Christine Kitchen⁶, Martyn Guest⁶, Marius Bakke⁷, Samuel K Sheppard⁸, Mark J Pallen⁷.

Abstract

The increasing availability and decreasing cost of high-throughput sequencing has transformed academic medical microbiology, delivering an explosion in available genomes while also driving advances in bioinformatics. However, many microbiologists are unable to exploit the resulting large genomics datasets because they do not have access to relevant computational resources and to an appropriate bioinformatics infrastructure. Here, we present the Cloud Infrastructure for Microbial Bioinformatics (CLIMB) facility, a shared computing infrastructure that has been designed from the ground up to provide an environment where microbiologists can share and reuse methods and data.

Entities: Chemical

Keywords: bioinformatics; cloud computing; infrastructure; metagenomics; population genomics; virtual laboratory

Mesh：

Year: 2016 PMID： 28785418 PMCID： PMC5537631 DOI： 10.1099/mgen.0.000086

Source DB: PubMed Journal: Microb Genom ISSN： 2057-5858

Data Summary

The paper describes a new, freely available public resource and therefore no data have been generated. The resource can be accessed at http://www.climb.ac.uk. Source code for software developed for the project can be found at http://github.com/MRC-climb/.

Impact Statement

Technological advances mean that genome sequencing is now relatively simple, quick and affordable. However, handling large genome datasets remains a significant challenge for many microbiologists, with substantial requirements for computational resources and expertise in data storage and analysis. This has led to fragmentary approaches to software development and data sharing that reduce the reproducibility of research and limits opportunities for bioinformatics training. Here, we describe a nationwide electronic infrastructure that has been designed to support the UK microbiology community, providing simple mechanisms for accessing large, shared, computational resources designed to meet the bioinformatics needs of microbiologists.

Introduction

Genome sequencing has transformed the scale of questions that can be addressed by biological researchers. Since the publication of the first bacterial genome sequence over 20 years ago (Fleischmann ), there has been an explosion in the production of microbial genome sequence data, fuelled most recently by high-throughput sequencing (Loman & Pallen, 2015). This has placed microbiology at the forefront of data-driven science (Marx, 2013). As a consequence, there is now huge demand for physical and computational infrastructures to produce, analyse and share microbiological software and datasets and a requirement for trained bioinformaticians who can use genome data to address important questions in microbiology (Chang, 2015). It is worth stressing that microbial genomics, with its focus on the extensive variation seen in microbial genomes, brings challenges altogether different from the analysis of the larger but less variable genomes of humans, animals or plants. One solution to the data-deluge challenge is for every microbiology research group to establish their own dedicated bioinformatics hardware and software. However, this entails considerable upfront infrastructure costs and great inefficiencies of effort, while also encouraging a working-in-silos mentality, which makes it difficult to share data and pipelines and thus hard to replicate research. Cloud computing provides an alternative approach that facilitates the use of large genome datasets in biological research (Chang, 2015). The cloud-computing approach incorporates a shared online computational infrastructure, which spares the end user from worrying about technical issues such as the installation, maintenance and, even, the location of physical computing resources, together with other potentially troubling issues such as systems administration, data sharing, scalability, security and backup. At the heart of cloud computing lies virtualization, an approach in which a physical computing set-up is re-purposed into a scalable system of multiple independent virtual machines, each of which can be pre-loaded with software, customized by end users and saved as snapshots for re-use by others on the infrastructure. Ideally, such an infrastructure also provides large-scale data storage and compute capacity on demand, reducing costs to the public purse by optimizing utilization of hardware and avoiding resources sitting idle while still capitalizing on the economies of scale. The potential for cloud computing in biological research has been recognized by funding organizations and has seen the development of nationwide resources such as iPlant (Goff ) (now CyVerse), NECTAR (http://nectar.org.au) and XSEDE (Towns ) that provide researchers with access to large cloud infrastructures. Here, we describe a new facility, designed specifically for microbiologists, to provide a computational and bioinformatics infrastructure for the UK’s academic medical microbiology community, facilitating training, access to hardware and sharing of data and software.

Resource overview

The Cloud Infrastructure for Microbial Bioinformatics (CLIMB) facility is intended as a general solution to pressing issues in big-data microbiology. The resource comprises a core physical infrastructure (Fig. 1), combined with three key features making the cloud suitable for microbiologists.

Fig. 1.

Overview of the system. (a) The sites where the computational hardware is based. (b) High-level overview of the system and how the different software components connect to one another. (c) Compute hardware present at each of the four sites. (d) Hardware comprising the Ceph storage system at each site. (e) Type and role of network hardware used at each site. First, CLIMB provides a single environment, complete with pipelines and datasets that are co-located with computing resource. This makes the process of accessing published packages and sequence data simpler and faster, improving reuse of software and data. Second, CLIMB has been designed with training in mind. Rather than having trainees configure personal laptops or face challenges in gaining access to shared high-performance computing resources, we provide training images on virtual machines that have all the necessary software installed and we provide each trainee with his or her own personal server to continue learning after the workshop concludes. Third, by bringing together expert bioinformaticians, educators and biologists in a unified system, CLIMB provides an environment where researchers across institutions can share data and code, permitting complex projects iteratively to be remixed, reproduced, updated and improved. The CLIMB core infrastructure is a cloud system running the free open-source cloud operating system OpenStack (http://www.openstack.org). This system allows us to run over 1000 virtual machines at any one time, each preconfigured with a standard user configuration. Across the cloud, we have access to over 78 terabytes of RAM. Specialist users can request access to one of our 12 high-memory virtual machines each with 3 terabytes of RAM for especially large, complex analyses (Fig. 1). The system is spread over four sites to enhance its resilience and is supported by local scratch storage of 500 terabytes per site employing IBM’s Spectrum Scale storage (formerly GPFS). The system is underpinned by a large shared object storage system that provides approximately 2.5 petabytes of data storage, which may be replicated between sites. This storage system, running the free open-source Ceph system (http://www.redhat.com/en/technologies/storage/ceph), provides a place to store and share very large microbial datasets – for comparison, the bacterial component of the European Nucleotide Archive is currently around 400 terabytes in size. The CLIMB system can be coupled to sequencing services; for example, sequence data generated by the MicrobesNG service (http://www.microbesng.uk) have been made available within the CLIMB system.

Resource performance

To assess the performance of the CLIMB system in comparison to traditional high-performance computing (HPC) systems and similar cloud systems, we undertook a small-scale benchmarking exercise (Fig. 2). Compared to the Raven HPC resource at Cardiff (running Intel processors a generation behind those in CLIMB), performance on CLIMB was generally good, offering a relative increase in performance of up to 38 % on tasks commonly undertaken by microbial bioinformaticians. The CLIMB system also compares well to cloud servers from major providers, offering better aggregate performance than Microsoft Azure A8 and Google N1S2 virtual machines. The results also reveal a number of features that may be relevant to where a user chooses to run their analysis. CLIMB performs worse than Raven when running beast (Drummond & Rambaut, 2007), and provides a limited increase in performance for the package nhmmer, suggesting that while it is possible to run these analyses on CLIMB, other resources – such as local HPC facilities – might be more appropriate as these are optimized for computationally intensive workloads. Conversely, the largest performance increases are observed for Prokka (Seemann, 2014), Snippy and PhyML (Guindon ), which encompass some of the most commonly used analyses undertaken in microbial genomics. It is also interesting to note that both commercial clouds offer excellent performance relative to Raven for two workloads: muscle (Edgar, 2004) and PhyML. The source of this performance is difficult to predict, but it is possible that these workloads may be more similar to the sort of workloads that these cloud services have been designed to handle. On the basis of the performance results more generally, however, CLIMB is likely to offer a number of performance benefits over local resources for many microbial bioinformatics workloads.

Fig. 2.

Relative performance of virtual machines running on cloud services, compared to the Cardiff University HPC system, Raven. (a) Values for each package are the mean of the wall time taken for 10 runs performed on Raven, divided by the mean wall time of 40 runs undertaken on the virtual machine on the named service. Values greater than 1 are faster than Raven, values less than 1 are slower. (b) The raw wall time values for the named software on each of the systems. The data generated as part of the benchmarking exercise is included in Supplementary File 1.

Providing a single environment for training, data and software sharing

The CLIMB system is accessed through the Internet, via a simple set of web interfaces enabling the sharing of software on virtual machines (Fig. 3). Users request a virtual machine via a web form. Each virtual machine makes available the microbial version of the Genomics Virtual Laboratory (Fig. 3) (Afgan ). This includes a set of web tools [Galaxy (Pond ), Jupyter Notebook (Perez, 2015) and RStudio, with an optional PacBio SMRT portal], as well as a set of pipelines and tools that can be accessed via the command line. This standardized environment provides a common platform for teaching, while the base image provides a versatile platform that can be customized to meet the needs of individual researchers. To provide user support and documentation a CLIMB discussion forum (http://discourse.climb.ac.uk) is available. The forum contains a number of tutorials demonstrating functionality. Initial tutorials cover topics including shotgun metagenomics, genome-wide association studies, ancient DNA, the Nullarbor public health microbiology pipeline and setting up a simple blast server with SequenceServer (Priyam ).

Fig. 3.

CLIMB virtual machine launch workflow. A user, on logging in to the Bryn launcher interface, is presented with a list of the virtual machines they are running and are able to stop, reboot or terminate them (a). Users launch a new Genomics Virtual Laboratory (GVL) server with a minimal interface, specifying a name, the server ‘flavour’ (user or group) and an access password (b). On booting, the user accesses a webserver running on the GVL instance, which gives access to various services that are started automatically (c). The GVL provides access to a Cloudman, a Galaxy server, an administration interface, Jupyter notebook and RStudio (d, top to bottom).

System access

Users register at our website, using their UK academic credentials (http://bryn.climb.ac.uk/). Upon registration, users have one of two modes of access: the first is to launch an instance running a preconfigured virtual machine, with a set of predefined pipelines or tools, which includes the Genomics Virtual Laboratory. The second option is aimed at expert bioinformaticians and developers who may want to be able to develop their own virtual machines from a base image – to enable this we also allow users to access the system via a dashboard, similar to that provided by Amazon Web Services, where users can specify the size and type of virtual machine that they would like, with the system then provisioning this up on demand. To share the resource fairly, users will have individual quotas that can be increased on request. Irrespective of quota size, access to the system is free of charge to UK academic users.

Future directions and wider impact

CLIMB is likely to accumulate a library of images and datasets that will be of wide use to researchers within the UK and elsewhere. Currently it does not provide a simple system to export instances or data. However, it is already possible to export images and data from CLIMB, and we plan to develop systems to enable the rapid, simple sharing of virtual machines and data with other clouds (such as Amazon Web Services). While the computational resources are principally for UK researchers (and international collaborators working with UK-based Principal Investigators), by making virtual machines and data available to other services we provide a key mechanism for the international community to benefit from the resource. Nor is CLIMB functioning in a vacuum; the project is already engaged with academic cloud providers elsewhere in the world (such as NECTAR), and we are actively working with the European Bioinformatics Institute to examine ways in which virtual machines and complete datasets can be better shared and reused – another key output that is likely to be of considerable value to the international community. One of the challenges in this area is the fact that virtual machine images are monolithic entities that may package operating system software, installed packages, scripts and datasets into a single unit. However, recent advances in virtualization enable sharing of software and dependencies via lightweight ‘containers’ that are considerably smaller than virtual machines. Platforms such as Docker (https://www.docker.com/) and rkt (https://coreos.com/rkt/) provide new approaches that may be suitable for sharing complex bioinformatics pipelines without the need to package the operating system software. Such approaches may make it easier to integrate pipelines from multiple sources on a single virtual machine. As part of the development of CLIMB we hope to be able to support containers soon. Finally, CLIMB has been designed to allow the addition of hardware and other sites. It is our hope that as the system is used, it will also be built upon – expanding both its computational capacity and the number of sites involved – to extend the community that it is able to serve, adding in international sites and capacity for supporting researchers examining questions in other areas of data-intensive biology.

Conclusion

CLIMB is probably the largest computer system dedicated to microbiology in the world. The system has already been used to address microbiological questions featuring bacteria (Connor ) and viruses (Quick ). CLIMB has been designed from the ground up to meet the needs of microbiologists, providing a core infrastructure that is presented in a simple, intuitive way. Individual elements of the system – such as the large shared storage and extremely large memory systems – provide capabilities that are usually not available locally to microbiologists within most UK institutions, while the shared nature of the system provides new opportunities for data and software sharing that have the potential to enhance research reproducibility in data-intensive biology. Cloud computing clearly has the potential to revolutionize how biological data are shared and analysed. We hope that the microbiology research community will capitalize on these new opportunities by exploiting the CLIMB facility.

13 in total

1. MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors: Robert C Edgar
Journal: Nucleic Acids Res Date: 2004-03-19 Impact factor: 16.971

2. Core services: Reward bioinformaticians.

Authors: Jeffrey Chang
Journal: Nature Date: 2015-04-09 Impact factor: 49.962

3. Biology: The big challenges of big data.

Authors: Vivien Marx
Journal: Nature Date: 2013-06-13 Impact factor: 49.962

4. Prokka: rapid prokaryotic genome annotation.

Authors: Torsten Seemann
Journal: Bioinformatics Date: 2014-03-18 Impact factor: 6.937

5. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.

Authors: R D Fleischmann; M D Adams; O White; R A Clayton; E F Kirkness; A R Kerlavage; C J Bult; J F Tomb; B A Dougherty; J M Merrick
Journal: Science Date: 1995-07-28 Impact factor: 47.728

Review 6. Twenty years of bacterial genome sequencing.

Authors: Nicholas J Loman; Mark J Pallen
Journal: Nat Rev Microbiol Date: 2015-11-09 Impact factor: 60.633

7. The iPlant Collaborative: Cyberinfrastructure for Plant Biology.

Authors: Stephen A Goff; Matthew Vaughn; Sheldon McKay; Eric Lyons; Ann E Stapleton; Damian Gessler; Naim Matasci; Liya Wang; Matthew Hanlon; Andrew Lenards; Andy Muir; Nirav Merchant; Sonya Lowry; Stephen Mock; Matthew Helmke; Adam Kubach; Martha Narro; Nicole Hopkins; David Micklos; Uwe Hilgert; Michael Gonzales; Chris Jordan; Edwin Skidmore; Rion Dooley; John Cazes; Robert McLay; Zhenyuan Lu; Shiran Pasternak; Lars Koesterke; William H Piel; Ruth Grene; Christos Noutsos; Karla Gendler; Xin Feng; Chunlao Tang; Monica Lent; Seung-Jin Kim; Kristian Kvilekval; B S Manjunath; Val Tannen; Alexandros Stamatakis; Michael Sanderson; Stephen M Welch; Karen A Cranston; Pamela Soltis; Doug Soltis; Brian O'Meara; Cecile Ane; Tom Brutnell; Daniel J Kleibenstein; Jeffery W White; James Leebens-Mack; Michael J Donoghue; Edgar P Spalding; Todd J Vision; Christopher R Myers; David Lowenthal; Brian J Enquist; Brad Boyle; Ali Akoglu; Greg Andrews; Sudha Ram; Doreen Ware; Lincoln Stein; Dan Stanzione
Journal: Front Plant Sci Date: 2011-07-25 Impact factor: 5.753

8. BEAST: Bayesian evolutionary analysis by sampling trees.

Authors: Alexei J Drummond; Andrew Rambaut
Journal: BMC Evol Biol Date: 2007-11-08 Impact factor: 3.260

9. Genomics Virtual Laboratory: A Practical Bioinformatics Workbench for the Cloud.

Authors: Enis Afgan; Clare Sloggett; Nuwan Goonasekera; Igor Makunin; Derek Benson; Mark Crowe; Simon Gladman; Yousef Kowsar; Michael Pheasant; Ron Horst; Andrew Lonie
Journal: PLoS One Date: 2015-10-26 Impact factor: 3.240

10. Real-time, portable genome sequencing for Ebola surveillance.

Authors: Joshua Quick; Nicholas J Loman; Sophie Duraffour; Jared T Simpson; Ettore Severi; Lauren Cowley; Joseph Akoi Bore; Raymond Koundouno; Gytis Dudas; Amy Mikhail; Nobila Ouédraogo; Babak Afrough; Amadou Bah; Jonathan Hj Baum; Beate Becker-Ziaja; Jan-Peter Boettcher; Mar Cabeza-Cabrerizo; Alvaro Camino-Sanchez; Lisa L Carter; Juiliane Doerrbecker; Theresa Enkirch; Isabel Graciela García Dorival; Nicole Hetzelt; Julia Hinzmann; Tobias Holm; Liana Eleni Kafetzopoulou; Michel Koropogui; Abigail Kosgey; Eeva Kuisma; Christopher H Logue; Antonio Mazzarelli; Sarah Meisel; Marc Mertens; Janine Michel; Didier Ngabo; Katja Nitzsche; Elisa Pallash; Livia Victoria Patrono; Jasmine Portmann; Johanna Gabriella Repits; Natasha Yasmin Rickett; Andrea Sachse; Katrin Singethan; Inês Vitoriano; Rahel L Yemanaberhan; Elsa G Zekeng; Racine Trina; Alexander Bello; Amadou Alpha Sall; Ousmane Faye; Oumar Faye; N'Faly Magassouba; Cecelia V Williams; Victoria Amburgey; Linda Winona; Emily Davis; Jon Gerlach; Franck Washington; Vanessa Monteil; Marine Jourdain; Marion Bererd; Alimou Camara; Hermann Somlare; Abdoulaye Camara; Marianne Gerard; Guillaume Bado; Bernard Baillet; Déborah Delaune; Koumpingnin Yacouba Nebie; Abdoulaye Diarra; Yacouba Savane; Raymond Bernard Pallawo; Giovanna Jaramillo Gutierrez; Natacha Milhano; Isabelle Roger; Christopher J Williams; Facinet Yattara; Kuiama Lewandowski; Jamie Taylor; Philip Rachwal; Daniel Turner; Georgios Pollakis; Julian A Hiscox; David A Matthews; Matthew K O'Shea; Andrew McD Johnston; Duncan Wilson; Emma Hutley; Erasmus Smit; Antonino Di Caro; Roman Woelfel; Kilian Stoecker; Erna Fleischmann; Martin Gabriel; Simon A Weller; Lamine Koivogui; Boubacar Diallo; Sakoba Keita; Andrew Rambaut; Pierre Formenty; Stephan Gunther; Miles W Carroll
Journal: Nature Date: 2016-02-03 Impact factor: 69.504

78 in total

1. Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples.

Authors: Joshua Quick; Nathan D Grubaugh; Steven T Pullan; Ingra M Claro; Andrew D Smith; Karthik Gangavarapu; Glenn Oliveira; Refugio Robles-Sikisaka; Thomas F Rogers; Nathan A Beutler; Dennis R Burton; Lia Laura Lewis-Ximenez; Jaqueline Goes de Jesus; Marta Giovanetti; Sarah C Hill; Allison Black; Trevor Bedford; Miles W Carroll; Marcio Nunes; Luiz Carlos Alcantara; Ester C Sabino; Sally A Baylis; Nuno R Faria; Matthew Loose; Jared T Simpson; Oliver G Pybus; Kristian G Andersen; Nicholas J Loman
Journal: Nat Protoc Date: 2017-05-24 Impact factor: 13.491

2. riboSeed: leveraging prokaryotic genomic architecture to assemble across ribosomal regions.

Authors: Nicholas R Waters; Florence Abram; Fiona Brennan; Ashleigh Holmes; Leighton Pritchard
Journal: Nucleic Acids Res Date: 2018-06-20 Impact factor: 16.971

3. Comparative Analyses of Selection Operating on Nontranslated Intergenic Regions of Diverse Bacterial Species.

Authors: Harry A Thorpe; Sion C Bayliss; Laurence D Hurst; Edward J Feil
Journal: Genetics Date: 2017-03-09 Impact factor: 4.562

4. CloudLaunch: Discover and Deploy Cloud Applications.

Authors: Enis Afgan; Andrew Lonie; James Taylor; Nuwan Goonasekera
Journal: Future Gener Comput Syst Date: 2018-06-15 Impact factor: 7.187

5. Extensive microbial diversity within the chicken gut microbiome revealed by metagenomics and culture.

Authors: Rachel Gilroy; Anuradha Ravi; Maria Getino; Isabella Pursley; Daniel L Horton; Nabil-Fareed Alikhan; Dave Baker; Karim Gharbi; Neil Hall; Mick Watson; Evelien M Adriaenssens; Ebenezer Foster-Nyarko; Sheikh Jarju; Arss Secka; Martin Antonio; Aharon Oren; Roy R Chaudhuri; Roberto La Ragione; Falk Hildebrand; Mark J Pallen
Journal: PeerJ Date: 2021-04-06 Impact factor: 2.984

6. Genomic epidemiology of COVID-19 in care homes in the east of England.

Authors: William L Hamilton; Gerry Tonkin-Hill; Emily R Smith; Dinesh Aggarwal; Charlotte J Houldcroft; Ben Warne; Luke W Meredith; Myra Hosmillo; Aminu S Jahun; Martin D Curran; Surendra Parmar; Laura G Caller; Sarah L Caddy; Fahad A Khokhar; Anna Yakovleva; Grant Hall; Theresa Feltwell; Malte L Pinckert; Iliana Georgana; Yasmin Chaudhry; Colin S Brown; Sonia Gonçalves; Roberto Amato; Ewan M Harrison; Nicholas M Brown; Mathew A Beale; Michael Spencer Chapman; David K Jackson; Ian Johnston; Alex Alderton; John Sillitoe; Cordelia Langford; Gordon Dougan; Sharon J Peacock; Dominic P Kwiatowski; Ian G Goodfellow; M Estee Torok
Journal: Elife Date: 2021-03-02 Impact factor: 8.140

7. Pneumococcal Colonization and Virulence Factors Identified Via Experimental Evolution in Infection Models.

Authors: Angharad E Green; Deborah Howarth; Chrispin Chaguza; Haley Echlin; R Frèdi Langendonk; Connor Munro; Thomas E Barton; Jay C D Hinton; Stephen D Bentley; Jason W Rosch; Daniel R Neill
Journal: Mol Biol Evol Date: 2021-05-19 Impact factor: 16.240

8. Kill and cure: genomic phylogeny and bioactivity of Burkholderia gladioli bacteria capable of pathogenic and beneficial lifestyles.

Authors: Cerith Jones; Gordon Webster; Alex J Mullins; Matthew Jenner; Matthew J Bull; Yousef Dashti; Theodore Spilker; Julian Parkhill; Thomas R Connor; John J LiPuma; Gregory L Challis; Eshwar Mahenthiralingam
Journal: Microb Genom Date: 2021-01

9. Radical genome remodelling accompanied the emergence of a novel host-restricted bacterial pathogen.

Authors: Gonzalo Yebra; Andreas F Haag; Maan M Neamah; Bryan A Wee; Emily J Richardson; Pilar Horcajo; Sander Granneman; María Ángeles Tormo-Más; Ricardo de la Fuente; J Ross Fitzgerald; José R Penadés
Journal: PLoS Pathog Date: 2021-05-20 Impact factor: 7.464

10. Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver.

Authors: Chris Wymant; François Blanquart; Tanya Golubchik; Astrid Gall; Margreet Bakker; Daniela Bezemer; Nicholas J Croucher; Matthew Hall; Mariska Hillebregt; Swee Hoe Ong; Oliver Ratmann; Jan Albert; Norbert Bannert; Jacques Fellay; Katrien Fransen; Annabelle Gourlay; M Kate Grabowski; Barbara Gunsenheimer-Bartmeyer; Huldrych F Günthard; Pia Kivelä; Roger Kouyos; Oliver Laeyendecker; Kirsi Liitsola; Laurence Meyer; Kholoud Porter; Matti Ristola; Ard van Sighem; Ben Berkhout; Marion Cornelissen; Paul Kellam; Peter Reiss; Christophe Fraser
Journal: Virus Evol Date: 2018-05-18