Literature DB >> 28398468

RTK: efficient rarefaction analysis of large datasets.

Paul Saary¹, Kristoffer Forslund¹, Peer Bork^1,2,3,4, Falk Hildebrand¹.

Abstract

MOTIVATION: The rapidly expanding microbiomics field is generating increasingly larger datasets, characterizing the microbiota in diverse environments. Although classical numerical ecology methods provide a robust statistical framework for their analysis, software currently available is inadequate for large datasets and some computationally intensive tasks, like rarefaction and associated analysis.
RESULTS: Here we present a software package for rarefaction analysis of large count matrices, as well as estimation and visualization of diversity, richness and evenness. Our software is designed for ease of use, operating at least 7x faster than existing solutions, despite requiring 10x less memory.
AVAILABILITY AND IMPLEMENTATION: C ++ and R source code (GPL v.2) as well as binaries are available from https://github.com/hildebra/Rarefaction and from CRAN (https://cran.r-project.org/). CONTACT: bork@embl.de or falk.hildebrand@embl.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2017 PMID： 28398468 PMCID： PMC5870771 DOI： 10.1093/bioinformatics/btx206

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

A common task in ecology and microbiomic data analysis is to count and compare the occurrences of different organisms throughout different samples, resulting in taxa count matrices. Accounting for biases due to uneven depth of sampling between sites or time points is a major analytical challenge. Rarefaction is a data normalization technique designed to cope with such unequal sampling efforts, by subsampling to the same rarefaction depth for all samples, thus simulating equal sampling effort. This allows calculation of comparable diversity estimators and enables collectors curves, to estimate total expected diversity. Although several rarefaction implementations in microbiomics exist (e.g. vegan (Oksanen ), QIIME (Caporaso ), mothur (Schloss )), these often work poorly for very large datasets because of memory requirements, processing limitations and program design (see Supplementary Material), which requires custom parsing scripts and the use of special hardware to do rarefactions. Here, we present the rarefaction toolkit (RTK), which can perform fast rarefaction on very large datasets comprising millions of features even on a laptop computer, computes estimates of ecological diversity and provides appropriate visualizations of the results.

2 Implementation

RTK is implemented in C ++11 with an optional R interface, having two principal run modes: ‘memory’ and ‘swap’, the latter using temporary files to reduce memory footprint. Using asynchronous thread management, RTK can make use of modern multi-core processors. The algorithm works by transforming input counts into a vector of feature occurrences and shuffles it using the Mersenne Twister (Matsumoto and Nishimura, 1998) random number generator. A subset of this shuffled vector of length equal to the desired rarefaction depth is used to construct the rarefied sample and to estimate diversity. Multiple rarefactions are calculated, by reusing unused parts of the shuffled vector, guaranteeing unique sampling without wasting computational resources. From the rarefied matrix evenness, three diversity and five richness estimators are computed (see Supplementary Text). The R-package ‘RTK’ provides an interface and visualizations to the C ++ RTK, using the Rcpp package (Eddelbuettel and François, 2011).

3 Comparison to existing software

We used three tests to compare performance and memory consumption of RTK to vegan 2.4, mothur 1.38.1 and QIIME 1.9.1 on a Linux cluster with 1 TB RAM, using a single core. Other rarefaction programs were considered, but were not suited for high-throughput analysis (see Supplementary Material). Four published metagenomic datasets of different size were used: Two were human gut 16S OTU count tables termed Yatsuneko (Yatsunenko ) and HMP (Huttenhower ), both processed with the LotuS pipeline (Hildebrand ). We also reanalyzed two metagenomic datasets, termed Guinea pig gut (Hildebrand ) and Tara from Tara Oceans (Sunagawa ), using publicly available gene count matrices (see Supplementary Table S1 for statistics). We first computed the mean ecosystem richness over 20 rarefactions. For all dataset sizes RTK outperformed the other programs with regards to speed and memory requirement (Fig. 1, Supplementary Table S2). To rarefy the Tara gene matrix, all other programs required prohibitively large amounts of memory (>256 GB), while RTK required only a fraction of this (<10 GB), providing also a 5-fold increase in speed (Table 1, Fig. 1). Second, we tested performance when the number of repeated rarefactions to the same depth varied (Supplementary Fig. S2). vegan, mothur and QIIME had a linear increase in runtime with increasing repeats, whereas RTK runtime remained almost constant. Last, we tested multicore performance (only available in RTK), which reduced RTK runtime by a factor of three using 8 cores (see Supplementary Fig. S3).

Fig. 1

(A) Speed and memory requirements of different rarefaction programs. Four datasets were 20 times rarefied at 95% lowest sample count. Time and memory consumption of our implementation is consistently below that observed using mothur, vegan or QIIME for the same purpose. vegan failed processing the Tara table (see Supplementary material). (B) Plotting of collector curves as well as of rarefaction curves is implemented in the R-package (Color version of this figure is available at Bioinformatics online.)

Table 1.

Time and memory consumption when rarefying the Tara gene abundance matrix five times to 2.3 M counts per sample, from 139 M counts on average per sample

Software (mode)	Runtime	Max. memory	Success
RTK (memory)	3:50 h	140 Gb	successful
RTK (swap)	3:30 h	8.5 Gb	successful
R RTK (memory)	3:30 h	140 Gb	successful
R RTK (swap)	3:05 h	8.7 Gb	successful
QIIME	21:50 h	339 Gb	successful
vegan	–	387 Gb	failed
mothur	17:30 h	262 Gb	successful

Note: While RTK could return the rarefied data, mothur only reports diversity.

Time and memory consumption when rarefying the Tara gene abundance matrix five times to 2.3 M counts per sample, from 139 M counts on average per sample Note: While RTK could return the rarefied data, mothur only reports diversity. (A) Speed and memory requirements of different rarefaction programs. Four datasets were 20 times rarefied at 95% lowest sample count. Time and memory consumption of our implementation is consistently below that observed using mothur, vegan or QIIME for the same purpose. vegan failed processing the Tara table (see Supplementary material). (B) Plotting of collector curves as well as of rarefaction curves is implemented in the R-package (Color version of this figure is available at Bioinformatics online.)

4 Discussion

Rarefaction is a standard data normalization technique in numerical ecology, also useful to avoid false positive detection of rare features when comparing unequally sampled data (Supplementary Fig. S4, Supplementary Text). Rapid expansion in the size of microbiomic datasets makes rarefaction difficult to employ, due to speed and memory limitations. Here we present a software solution that is well-suited for state of the art microbiomics applications. It provides diversity estimators, various visualizations and statistics related to these, is easy and free to use, and scales better than presently available tools. Click here for additional data file.

7 in total

1. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities.

Authors: Patrick D Schloss; Sarah L Westcott; Thomas Ryabin; Justine R Hall; Martin Hartmann; Emily B Hollister; Ryan A Lesniewski; Brian B Oakley; Donovan H Parks; Courtney J Robinson; Jason W Sahl; Blaz Stres; Gerhard G Thallinger; David J Van Horn; Carolyn F Weber
Journal: Appl Environ Microbiol Date: 2009-10-02 Impact factor: 4.792

2. Ocean plankton. Structure and function of the global ocean microbiome.

Authors: Shinichi Sunagawa; Luis Pedro Coelho; Samuel Chaffron; Jens Roat Kultima; Karine Labadie; Guillem Salazar; Bardya Djahanschiri; Georg Zeller; Daniel R Mende; Adriana Alberti; Francisco M Cornejo-Castillo; Paul I Costea; Corinne Cruaud; Francesco d'Ovidio; Stefan Engelen; Isabel Ferrera; Josep M Gasol; Lionel Guidi; Falk Hildebrand; Florian Kokoszka; Cyrille Lepoivre; Gipsi Lima-Mendez; Julie Poulain; Bonnie T Poulos; Marta Royo-Llonch; Hugo Sarmento; Sara Vieira-Silva; Céline Dimier; Marc Picheral; Sarah Searson; Stefanie Kandels-Lewis; Chris Bowler; Colomban de Vargas; Gabriel Gorsky; Nigel Grimsley; Pascal Hingamp; Daniele Iudicone; Olivier Jaillon; Fabrice Not; Hiroyuki Ogata; Stephane Pesant; Sabrina Speich; Lars Stemmann; Matthew B Sullivan; Jean Weissenbach; Patrick Wincker; Eric Karsenti; Jeroen Raes; Silvia G Acinas; Peer Bork
Journal: Science Date: 2015-05-22 Impact factor: 47.728

3. QIIME allows analysis of high-throughput community sequencing data.

Authors: J Gregory Caporaso; Justin Kuczynski; Jesse Stombaugh; Kyle Bittinger; Frederic D Bushman; Elizabeth K Costello; Noah Fierer; Antonio Gonzalez Peña; Julia K Goodrich; Jeffrey I Gordon; Gavin A Huttley; Scott T Kelley; Dan Knights; Jeremy E Koenig; Ruth E Ley; Catherine A Lozupone; Daniel McDonald; Brian D Muegge; Meg Pirrung; Jens Reeder; Joel R Sevinsky; Peter J Turnbaugh; William A Walters; Jeremy Widmann; Tanya Yatsunenko; Jesse Zaneveld; Rob Knight
Journal: Nat Methods Date: 2010-04-11 Impact factor: 28.547

4. Structure, function and diversity of the healthy human microbiome.

Authors:
Journal: Nature Date: 2012-06-13 Impact factor: 49.962

5. A comparative analysis of the intestinal metagenomes present in guinea pigs (Cavia porcellus) and humans (Homo sapiens).

Authors: Falk Hildebrand; Tine Ebersbach; Henrik Bjørn Nielsen; Xiaoping Li; Si Brask Sonne; Marcelo Bertalan; Peter Dimitrov; Lise Madsen; Junjie Qin; Jun Wang; Jeroen Raes; Karsten Kristiansen; Tine Rask Licht
Journal: BMC Genomics Date: 2012-09-28 Impact factor: 3.969

6. Human gut microbiome viewed across age and geography.

Authors: Tanya Yatsunenko; Federico E Rey; Mark J Manary; Indi Trehan; Maria Gloria Dominguez-Bello; Monica Contreras; Magda Magris; Glida Hidalgo; Robert N Baldassano; Andrey P Anokhin; Andrew C Heath; Barbara Warner; Jens Reeder; Justin Kuczynski; J Gregory Caporaso; Catherine A Lozupone; Christian Lauber; Jose Carlos Clemente; Dan Knights; Rob Knight; Jeffrey I Gordon
Journal: Nature Date: 2012-05-09 Impact factor: 49.962

7. LotuS: an efficient and user-friendly OTU processing pipeline.

Authors: Falk Hildebrand; Raul Tadeo; Anita Yvonne Voigt; Peer Bork; Jeroen Raes
Journal: Microbiome Date: 2014-09-30 Impact factor: 14.650

7 in total

26 in total

1. Gut Steroids and Microbiota: Effect of Gonadectomy and Sex.

Authors: Silvia Diviccaro; Jamie A FitzGerald; Lucia Cioffi; Eva Falvo; Fiona Crispie; Paul D Cotter; Siobhain M O'Mahony; Silvia Giatti; Donatella Caruso; Roberto Cosimo Melcangi
Journal: Biomolecules Date: 2022-05-31

2. LotuS2: an ultrafast and highly accurate tool for amplicon sequencing analysis.

Authors: Ezgi Özkurt; Joachim Fritscher; Nicola Soranzo; Duncan Y K Ng; Robert P Davey; Mohammad Bahram; Falk Hildebrand
Journal: Microbiome Date: 2022-10-19 Impact factor: 16.837

3. Intestinal Microbiome-Macrophage Crosstalk Contributes to Cholestatic Liver Disease by Promoting Intestinal Permeability in Mice.

Authors: Anna Isaacs-Ten; Marta Echeandia; Mar Moreno-Gonzalez; Arlaine Brion; Andrew Goldson; Mark Philo; Angela M Patterson; Aimee Parker; Mikel Galduroz; David Baker; Simon M Rushbrook; Falk Hildebrand; Naiara Beraza
Journal: Hepatology Date: 2020-12 Impact factor: 17.298

4. Functional implications of microbial and viral gut metagenome changes in early stage L-DOPA-naïve Parkinson's disease patients.

Authors: J R Bedarf; F Hildebrand; L P Coelho; S Sunagawa; M Bahram; F Goeser; P Bork; U Wüllner
Journal: Genome Med Date: 2017-04-28 Impact factor: 11.117

5. High Frequency of Shared Clonotypes in Human T Cell Receptor Repertoires.

Authors: Cinque Soto; Robin G Bombardi; Morgan Kozhevnikov; Robert S Sinkovits; Elaine C Chen; Andre Branchizio; Nurgun Kose; Samuel B Day; Mark Pilkinton; Madhusudan Gujral; Simon Mallal; James E Crowe
Journal: Cell Rep Date: 2020-07-14 Impact factor: 9.423

6. Microbial Hotspots in Lithic Microhabitats Inferred from DNA Fractionation and Metagenomics in the Atacama Desert.

Authors: Dirk Schulze-Makuch; Daniel Lipus; Felix L Arens; Mickael Baqué; Till L V Bornemann; Jean-Pierre de Vera; Markus Flury; Jan Frösler; Jacob Heinz; Yunha Hwang; Samuel P Kounaves; Kai Mangelsdorf; Rainer U Meckenstock; Mark Pannekens; Alexander J Probst; Johan S Sáenz; Janosch Schirmack; Michael Schloter; Philippe Schmitt-Kopplin; Beate Schneider; Jenny Uhl; Gisle Vestergaard; Bernardita Valenzuela; Pedro Zamorano; Dirk Wagner
Journal: Microorganisms Date: 2021-05-12

7. Toxic Cyanobacteria in Svalbard: Chemical Diversity of Microcystins Detected Using a Liquid Chromatography Mass Spectrometry Precursor Ion Screening Method.

Authors: Julia Kleinteich; Jonathan Puddick; Susanna A Wood; Falk Hildebrand; H Dail Laughinghouse; David A Pearce; Daniel R Dietrich; Annick Wilmotte
Journal: Toxins (Basel) Date: 2018-04-03 Impact factor: 4.546

8. Great differences in performance and outcome of high-throughput sequencing data analysis platforms for fungal metabarcoding.

Authors: Sten Anslan; R Henrik Nilsson; Christian Wurzbacher; Petr Baldrian; Mohammad Bahram
Journal: MycoKeys Date: 2018-09-11 Impact factor: 2.984

9. Roots and Panicles of the C4 Model Grasses Setaria viridis (L). and S. pumila Host Distinct Bacterial Assemblages With Core Taxa Conserved Across Host Genotypes and Sampling Sites.

Authors: Carolina Escobar Rodríguez; Birgit Mitter; Livio Antonielli; Friederike Trognitz; Stéphane Compant; Angela Sessitsch
Journal: Front Microbiol Date: 2018-11-12 Impact factor: 5.640

10. Dispersal strategies shape persistence and evolution of human gut bacteria.

Authors: Falk Hildebrand; Toni I Gossmann; Clémence Frioux; Ezgi Özkurt; Pernille Neve Myers; Pamela Ferretti; Michael Kuhn; Mohammad Bahram; Henrik Bjørn Nielsen; Peer Bork
Journal: Cell Host Microbe Date: 2021-06-09 Impact factor: 21.023