| Literature DB >> 29069476 |
Alex L Mitchell1, Maxim Scheremetjew1, Hubert Denise1, Simon Potter1, Aleksandra Tarkowska1, Matloob Qureshi1, Gustavo A Salazar1, Sebastien Pesseat1, Miguel A Boland1, Fiona M I Hunter1, Petra Ten Hoopen1, Blaise Alako1, Clara Amid1, Darren J Wilkinson2, Thomas P Curtis3, Guy Cochrane1, Robert D Finn1.
Abstract
EBI metagenomics (http://www.ebi.ac.uk/metagenomics) provides a free to use platform for the analysis and archiving of sequence data derived from the microbial populations found in a particular environment. Over the past two years, EBI metagenomics has increased the number of datasets analysed 10-fold. In addition to increased throughput, the underlying analysis pipeline has been overhauled to include both new or updated tools and reference databases. Of particular note is a new workflow for taxonomic assignments that has been extended to include assignments based on both the large and small subunit RNA marker genes and to encompass all cellular micro-organisms. We also describe the addition of metagenomic assembly as a new analysis service. Our pilot studies have produced over 2400 assemblies from datasets in the public domain. From these assemblies, we have produced a searchable, non-redundant protein database of over 50 million sequences. To provide improved access to the data stored within the resource, we have developed a programmatic interface that provides access to the analysis results and associated sample metadata. Finally, we have integrated the results of a series of statistical analyses that provide estimations of diversity and sample comparisons.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29069476 PMCID: PMC5753268 DOI: 10.1093/nar/gkx967
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Illustration of the number of projects and runs analysed from each biome. The number of projects and runs from different study types are shown on consecutive log axes. This figure was produced using the iTOL server (44).
Figure 2.Schematic representations of the EBI metagenomics pipeline versions 3.0 (A) and 4.0 (B). Tools and reference databases updated in each release are indicated by a magenta circle and described in detail within the text. Processing steps are indicated in the colour rounded boxes (yellow, blue, green), tools in dark grey boxes and databases in light grey boxes. Input and output files as white squares. The combined gene caller component is indicated as CGC.
Figure 3.Krona plots showing taxonomic classification of run ERR771104 from Ocean Sampling Day 2014 (ENA project accession PRJEB8682). (A) Produced using version 2.0 of the pipeline and (B) using version 4.0. Prokaryotic taxonomic lineages are shown in red, eukaryotic in blue and unclassified in grey. The total number of 16S rRNA/SSU input sequences was similar in each case (976 with version 2.0 versus 1008 with version 4.0).
Figure 4.Correlation between temperature (A) and depth (B) and photosynthesis-related GO term counts, normalized by number of InterPro annotations, for Tara Oceans project PRJEB1787. Metadata and annotations were retrieved from the API and combined on the fly to generate the visualizations.
Figure 5.HMMER search results using the assembled peptide database. Searching the full length subdivision of the assembled peptide database with an arginine deiminase from Streptococcus sanguinis SK1057 (UniProt identifier: F2BTU6) identified over 800 sequences with a significant match (E-value < 1e–10) to the query sequence, with <9% (78 sequences) having an identical counterpart in UniProtKB.
Figure 6.Growth of metagenomics data housed in ENA and processed by EBI Metagenomics (EMG). This graph shows the cumulative growth of environmental data in the two resources (ENA: solid lines, EMG: dashed lines) according to two different metrics: numbers of samples (blue) and number of bases (orange).