| Literature DB >> 35710557 |
Marc Bailly-Bechet1, Etienne G J Danchin2, Carole Belliardo3,4, Georgios D Koutsovoulos1, Corinne Rancurel1, Mathilde Clément5, Justine Lipuma5.
Abstract
During the last decades, metagenomics has highlighted the diversity of microorganisms from environmental or host-associated samples. Most metagenomics public repositories use annotation pipelines tailored for prokaryotes regardless of the taxonomic origin of contigs. Consequently, eukaryotic contigs with intrinsically different gene features, are not optimally annotated. Using a bioinformatics pipeline, we have filtered 7.9 billion contigs from 6,872 soil metagenomes in the JGI's IMG/M database to identify eukaryotic contigs. We have re-annotated genes using eukaryote-tailored methods, yielding 8 million eukaryotic proteins and over 300,000 orphan proteins lacking homology in public databases. Comparing the gene predictions we made with initial JGI ones on the same contigs, we confirmed our pipeline improves eukaryotic proteins completeness and contiguity in soil metagenomes. The improved quality of eukaryotic proteins combined with a more comprehensive assignment method yielded more reliable taxonomic annotation. This dataset of eukaryotic soil proteins with improved completeness, quality and taxonomic annotation reliability is of interest for any scientist aiming at studying the composition, biological functions and gene flux in soil communities involving eukaryotes.Entities:
Mesh:
Year: 2022 PMID: 35710557 PMCID: PMC9203802 DOI: 10.1038/s41597-022-01420-4
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 8.501
Fig. 1Our eukaryotic protein prediction pipeline from soil metagenomic contigs to a final dataset of taxonomically annotated proteins with contigs, proteins and metagenomes number at each step.
Metrics to assess the contiguity of the 6,872 ‘Terrestrial’ and ‘Plant-associated’ metagenome-assembled genomes datasets from the IMG/M server of the JGI including the number of proteins predicted by Prodigal from IMG/M.
| Data | Metric | Min | Mean | Median | Max |
|---|---|---|---|---|---|
| Raw | Number of contigs per metagenomes | 1 | 1,160,141 | 294,105 | 39,582,895 |
| Contig length (pb) | 3 | 497 | 296 | 5,373,015 | |
| Number of genes per contig | 0 | 1 | 1 | 5,459 | |
| Filtered | Number of contigs per metagenomes | 1 | 115,615 | 22,307 | 3,625,639 |
| Contig length (pb) | 1,000 | 1,985 | 1,350 | 5,373,015 | |
| Number of genes per contig | 1 | 3 | 2 | 5,459 |
Fig. 2Phylogenetic tree of Augustus ab initio models showing the deeper taxonomic nodes used in the first step of the contig model selection.
Fig. 3Phylogenetic tree focused on Magnoliopsida clades displaying the Augustus model distribution supporting the assignment of ab initio gene model by dataset (blue = Plant-associated, orange = Terrestrial 1, green = Terrestrial 2).
Fig. 4Number of Augustus-predicted proteins and their taxonomic distribution per Augustus model kingdom by dataset (a) on all contigs (b) on eukaryotic contigs validated by Diamond (blue = Plant-associated, orange = Terrestrial 1, green = Terrestrial 2).
Taxonomic classification of Augustus predicted proteins in superkingdoms by the Last Common Ancestor algorithm of DIAMOND among each dataset.
| Clade | Plant-ass. | Terrestrial1 | Terrestrial2 | Total | % |
|---|---|---|---|---|---|
| Prokaryote | 12,271,986 | 11,564,201 | 20,560,428 | 44,396,615 | 47.6 |
| Eukaryote | 4,986,024 | 1,951,235 | 1,064,070 | 8,001,326 | 8.6 |
| Viruses | 23,743 | 25,409 | 70,942 | 120,094 | 0.1 |
| Undetermined | 4,511,252 | 29,664,147 | 6,655,739 | 40,831,138 | 43.7 |
| Total | 21,793,005 | 43,204,992 | 28,351,179 | 93,349,176 | 100 |
Fig. 5Krona representation of taxonomic assignment provided by the last common ancestor algorithm of DIAMOND for the 8 million eukaryotic proteins predicted by our homemade pipeline using Augustus (HTML file: available on Supplementary Data[42]), and the pie chart of taxonomic ranks of retrieved lineages.
Data record, information about files available on public repository DATA INRAE[49].
| File name | Type | Size | Path | Description |
|---|---|---|---|---|
| eukaryotic_proteins.aa[ | fasta | 3GB | . | 8 M of validated eukaryotic proteins predicted with Augustus in contigs from Terrestrial and Plant-associated metagenomic data from JGI |
| eukaryotic_proteins_taxonomy.txt[ | text file | 1,9GB | . | Taxonomic information for 8 M of validated eukaryotic proteins from the last common ancestor algorithm of Diamond |
| orphan_Euka.aa[ | fasta | 79MB | . | Orphan proteins from contigs with over half of eukaryotic proteins |
| eukaryotic_proteins_clustered.aa[ | fasta | 1.8GB | . | 4,6 M representative clusters of 8 M of eukaryotic proteins |
| eukaryotic_proteins_clustered.tsv[ | TSV | 614MB | . | Composition of eukaryotic protein clusters |
| orphan_proteins_clustered.aa[ | fasta | 66MB | . | 288,612 representative clusters of orphan proteins |
| orphan_proteins_clustered.tsv[ | TSV | 27MB | . | Composition of orphan protein clusters |
| eukaryotic_proteins_taxonomy_krona.html[ | html | 1,7MB | ./Supplementary Data | Krona representation of 8 M of validated eukaryotic protein taxonomy from last common ancestor algorithm of Diamond |
| Supplementary_data_1.txt[ | text file | 158KB | ./Supplementary Data | List of metagenome identifier of processed data from JGI |
| Supplementary_data_Figures.pdf[ | 323KB | ./Supplementary Data | Fig. | |
| Fig. | ||||
| Supplementary_data_tables.pdf[ | 51KB | ./Supplementary Data | Table | |
| Table | ||||
| Table | ||||
| Table |
Fig. 6Distribution of protein lengths of Augustus prediction in blue versus Prodigal prediction in orange. Proteins from Augustus are significantly longer than those from Prodigal (see text).
Fig. 7Complete and Fragmented BUSCO scores of the 1,093 metagenomes with single-copy universally conserved genes report a significantly better recovery of genes from eukaryotic microorganisms with Augustus than Prodigal (see text).
Fig. 8Annotated taxa of Arbuscular Mycorrhizal Fungal proteins with the last common ancestor algorithm of Diamond after protein prediction with Augustus. Number of proteins is shown for each taxa. The ratio of the taxonomic rank of annotations across AMF lineages is shown in a pie chart.
BUSCO scores and FASTA files information for several gene prediction methods (1) Augustus with a mixture of model as in our paper, (2) Augustus with Fusarium model, (3) Augustus with Zebrafish model, (4) MetaEuk with NR database, (5) MetaEuk with Swissprot database and (6) Prodigal. All scores are computed on the same metagenome used as reference.
| Model | BUSCO scores | Fasta informations | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Complete | Complete Single | Complete Duplicated | Fragmented | Missing | Nb. of Proteins | Total nb. of AA | Nb. of AA/protein | ||
| 1 | Mix | 100 | 12.9 | 87.1 | 0 | 0 | 63,986 | 25,941,958 | 405 |
| 2 | Fusarium | 98.4 | 12.5 | 85.9 | 1.2 | 0.4 | 87,508 | 36,614,755 | 418 |
| 3 | Zebrafish | 96.1 | 23.9 | 72.2 | 3.1 | 0.8 | 152,796 | 43,294,314 | 283 |
| 4 | Metaeuk nr | 100 | 1.6 | 98.4 | 0 | 0 | 119,085 | 34,031,250 | 286 |
| 5 | MetaEuk swp | 97.6 | 8.2 | 89.4 | 0.8 | 1.6 | 34,906 | 12,112,481 | 347 |
| 6 | Prodigal | 77.3 | 36.9 | 40.4 | 20 | 2.7 | 271,456 | 37,520,032 | 138 |
| Measurement(s) | gene prediction objective |
| Technology Type(s) | Bioinformatics |
| Sample Characteristic - Organism | Eukaryota |
| Sample Characteristic - Environment | bulk soil • rhizosphere • rhizosphere environment |
| Sample Characteristic - Location | world |