| Literature DB >> 29361904 |
Thomas C A Hitch1,2, Christopher J Creevey3.
Abstract
BACKGROUND: The consensus emerging from the study of microbiomes is that they are far more complex than previously thought, requiring better assemblies and increasingly deeper sequencing. However, current metagenomic assembly techniques regularly fail to incorporate all, or even the majority in some cases, of the sequence information generated for many microbiomes, negating this effort. This can especially bias the information gathered and the perceived importance of the minor taxa in a microbiome.Entities:
Keywords: Assembly; Genomics; Metagenome
Mesh:
Year: 2018 PMID: 29361904 PMCID: PMC5781261 DOI: 10.1186/s12859-018-2028-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1A flow-chart of the steps used by Spherical. The blue circle encompasses all processes that are carried out in an iterative manner until the results meet the ‘user defined criteria’. The ‘user defined criteria’ is defined as any user option which indicates a point at which Spherical should stop iterating. The arrows width indicates the possible decrease in file size depending on user sub-sample selection. 1; User input data (usually quality controlled sequencing files). 2; Spherical takes a random subset of the input sequencing data. The size of the subset is determined by user. 3; An assembly of the subset is generated. 4; The number of reads aligning to the combined assembly are determined. 5; If the number of reads aligning meets the user criteria Spherical will move to step 7, otherwise Spherical will continue to step 6. 6; Reads that do not align to the combined assembly are used as input for the next round. 7; Spherical exits and combines the individual iterations assemblies into a single file
Information on each metagenomics dataset tested using Spherical
| Environment | Dataset size (Gbp) | MGRAST project ID |
|---|---|---|
| Chicken Caecum [ | 0.06 | 101 |
| Human oral cavity [ | 0.63 | 128 |
| Yucatan groundwater | 29.00 | 5969 |
For each metagenomic dataset the table states the source environment, the datasets size in Giga base pairs (Gbp) and its MGRAST project ID
Assembly statistics comparing dataset assemblies for each method
| Dataset | Method | RAM usage (Gb) | Alignment (%) | False bases (%) | Longest contig | Number of contigs |
|---|---|---|---|---|---|---|
| Cecum | Normalised | 119 | 29.5 | 0.01 | 831 | 103,618 |
| Base assembly | 1 | 29.5 | 0.01 | 831 | 103,618 | |
| Metavelvet | 2 | 29.1 | 0.07 | 831 | 103,618 | |
| Spherical (1) | 2 | 30.9 | 0.04 | 831 | 138,995 | |
| Oral | Normalised | 14 | 8.1 | 0.01 | 3337 | 1,825,177 |
| Base assembly | 25 | 13.0 | 0.02 | 4548 | 1,178,611 | |
| Metavelvet | 15 | 13.0 | 0.07 | 4548 | 1,178,611 | |
| Spherical (1) | 5 | 24.6 | 0.19 | 2380 | 1,053,802 | |
| Ground water | Normalised | 361 | 52.8 | 3.86 | 117,274 | 5,721,819 |
| Base assembly | 376 | 52.0 | 3.84 | 117,274 | 5,772,465 | |
| Metavelvet | 376 | 52.0 | 4.04 | 117,274 | 5,772,461 | |
| Spherical (1) | 377 | 59.7 | 2.89 | 117,274 | 13,312,643 | |
| Spherical (0.25) | 129 | 51.5 | 3.50 | 104,353 | 7,851,021 | |
| Spherical (0.033) | 107 | 49.8 | 3.78 | 53,836 | 7,145,998 |
The first column indicates the dataset utilized whilst the second column identified the assembly methodology. To identify the different subsampling amounts during each Spherical assembly the subsample size is stated in brackets in the method column. The final 5 columns provide information on the computational needs for each assembly (RAM usage) as well as statistics about the produced assemblies e.g. number of contigs and alignment (%)
Fig. 2Taxonomic breakdown of each iteration for the chicken ceacum (a), human oral cavity (b) and Yucatan groundwater Spherical (sub-sample size = 1) (c) assemblies at the class level. Each bar represents the number of reads that could be assigned to a taxonomic Class within the assembly from each iteration. The colours represent different taxonomic Classes identified in the legend on the right. The letters represent the results of the significance tests where bars with the same letter are not significantly different according to the Chi2 test for homogeneity