| Literature DB >> 29329522 |
Matteo Chiara1, Antonio Placido2, Ernesto Picardi2,3, Luigi Ruggiero Ceci2, David Stephen Horner4,5, Graziano Pesole2,3.
Abstract
BACKGROUND: Expression screening of environmental DNA (eDNA) libraries is a popular approach for the identification and characterization of novel microbial enzymes with promising biotechnological properties. In such "functional metagenomics" experiments, inserts, selected on the basis of activity assays, are sequenced with high throughput sequencing technologies. Assembly is followed by gene prediction, annotation and identification of candidate genes that are subsequently evaluated for biotechnological applications.Entities:
Keywords: Assembly; Candidate genes; Functional annotation; Functional metagenomics; Galaxy; Workflow
Mesh:
Year: 2018 PMID: 29329522 PMCID: PMC5767027 DOI: 10.1186/s12864-017-4369-z
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
List of bioinformatics tools and resources currently incorporated within A-GAME
| Quality Trimming | |
|---|---|
| Tool | Reference |
| FastQC | [ |
| Pear | [ |
| Flash | [ |
| Trimmomatic | [ |
| FastX | [ |
| Assembly | |
| Megahit | [ |
| SPAdes | [ |
| Abyss | [ |
| Velvet | [ |
| Meta-Velvet | [ |
| Meta-SPAdes | [ |
| Gene prediction | |
| Glimmer | [ |
| Augustus | [ |
| Prokka (Prodigal) | [ |
| Metagenemark | [ |
| Functional annotation | |
| Interpro | [ |
| PFAM | [ |
| Blast + suite | [ |
| Short read mapping | |
| Bowtie2 | [ |
| bwa | [ |
| Scaffolding | |
| SSpace | [ |
| Sopra | [ |
Fig. 1Schematic of workflows for the analysis of metagenomic eDNA data available in A-GAME. Bioinformatics analysis of functional metagenomics data requires pre-processing of raw data, sequence assembly and post-processing of contigs, gene model prediction and functional annotation. A-GAME offers 4 pre-configured workflows, Fosmid1 to Fosmid4, that automate all the steps of the analysis and differ only through the combination of tools used. Schematic of the 4 workflows are represented in the form of flow diagrams, tools used to perform individual steps are reported in red
Fig. 2Re-assembly of barcoded sequence data from Lam et al... dataset. Assemblies of bar-coded sequence data from Lam et al. were used to evaluate results achieved by different tool and pipelines in the analysis of pooled sequencing data. Of the 92 clones subjected to individual sequencing by Lam et al. [16] showed lack of coverage or absence of end tags sequences and were therefore discarded from these analyses. Of the 61 inserts for which sequencing information for both “end-tags” was available (“complete”, Indicated in black), 46 were matched correctly to both ends (Fully assigned), 11 were matched with only 1 out of 2 ends (Partially assigned) and 2 did not show any similarity to available insert termini sequences. Of the twelve clones for which only “partial” (single end-tag, indicated in red) sequencing information is available ten were recognized and assigned with the proper end tag while 2 are missing. Only assigned and partially assigned contigs were included in the benchmark dataset used for the evaluation
Comparison of workflows for the assembly and annotation of eDNA insert data using Lam et al. [16] pooled inserts
| Insert assignment based on end-tags | Completeness | Computational requirements | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Completea | 2 of 2 endsb | 1 of 1 endsc | Partiald | % assemblede | % of reference proteinsf | Assembly N50 | CPU time (h) | RAM peak (Gb) | |
| original assembly | 40 | 6 | 10 | 11 | 100.00 | 100.00 | 36,113 | NA | NA |
| SPAdes (F1) | 34 | 13 | 9 | 11 | 88.16 | 86.35 | 34,329 | 2.03 | 5.3 |
| Velvet (F2) | 18 | 27 | 10 | 12 | 66.58 | 64.79 | 32,942 | 1.67 | 6.61 |
| MEGAHIT (F3) | 30 | 19 | 8 | 10 | 95.14 | 92.13 | 34,446 | 1.21 | 3.22 |
| MetaVelvet (F4) | 19 | 26 | 9 | 13 | 74.64 | 73.38 | 33,150 | 1.75 | 7.01 |
| meta-SPAdes (F5) | 34 | 13 | 9 | 11 | 88.16 | 86.35 | 34,329 | 2.03 | 5.3 |
| MOCAT2 | 19 | 27 | 8 | 13 | 67.62 | 65.92 | 25,246 | 1.91 | 4.51 |
| Parallel META2 | 12 | 34 | 6 | 15 | 40.48 | 37.87 | 26,408 | 2.36 | 3.11 |
| Original from LAM et al | 19 | 28 | 7 | 13 | 72.47 | 69.97 | 33,347 | NA | NA |
aInsert assembled into a single contig matching both end tags
bInserts assembled into multiple contigs, both end tags are assigned
cInserts for which only a single end tag is available and gets assigned
dInserts for which both ends are available but only one is assigned
ePercentage of reference assembly represented in the pooled assembly
fPercentage of proteins from the reference assembly recovered in the pooled assembly
Sensitivity and specificity of the FosBin tool with real coverage and simulated coverage levels
| Real coverage | Simulated coverage | ||||
|---|---|---|---|---|---|
| N° of fosmidsa | N° of fragmentsb | Sensitivity | Specificity | Sensitivity | Specificity |
| 8 | 2 | 0.748 | 0.874 | 0.808 | 0.904 |
| 8 | 3 | 0.766 | 0.922 | 0.822 | 0.941 |
| 8 | 5 | 0.713 | 0.943 | 0.779 | 0.956 |
| 12 | 2 | 0.717 | 0.858 | 0.782 | 0.891 |
| 12 | 3 | 0.726 | 0.909 | 0.790 | 0.930 |
| 12 | 5 | 0.701 | 0.940 | 0.769 | 0.954 |
| 18 | 2 | 0.656 | 0.828 | 0.730 | 0.865 |
| 18 | 3 | 0.635 | 0.878 | 0.711 | 0.904 |
| 18 | 5 | 0.589 | 0.918 | 0.670 | 0.934 |
aNumber of inserts included in the simulated pool
bNumber of distinct fragments generated from each insert
Fig. 3Identification of candidate genes/protein a. A-GAME offers a collection of tools and utilities for the functional annotation of proteins that can be used in order to facilitate the identification of genes with enzymatic activity of interest. a Concise report of PFAM protein domain annotation. The report is presented as an html page in a fasta-like format. Protein sequences are associated to their PFAM domains and a brief description. Hyperlinks to PFAM are included in the report in order to facilitate the retrieval of more complete information. For datasets where contig clustering (FosBin) or Sanger end-tag data are available, annotation of each inferred cluster is reported in a dedicated html page. Users can navigate through individual reports using hyperlinks provided at the top of the main page. b. Keyword search of PFAM domain annotation. Candidate genes with enzymatic activities of interest can be retrieved by performing keyword search of PFAM domain annotation by using the dedicated utility in A-GAME. Protein matching user provided keywords are reported in a dedicated report. c. Characterization of candidate proteins. The Interpro suite can be used for more specific annotation of functional domains in user selected subsets of genes to generate pages containing graphical depictions with descriptions and web-links to corresponding databases. A-GAME allows users to perform similarity searches against a local database containing more than 2500 Refseq bacterial proteomes to identify homologs of reconstructed ORFs