| Literature DB >> 29617768 |
Florence McLean1, Duncan Berger1, Dominik R Laetsch1, Hillel T Schwartz2, Mark Blaxter1.
Abstract
Background: Genome assembly and annotation remain exacting tasks. As the tools available for these tasks improve, it is useful to return to data produced with earlier techniques to assess their credibility and correctness. The entomopathogenic nematode Heterorhabditis bacteriophora is widely used to control insect pests in horticulture. The genome sequence for this species was reported to encode an unusually high proportion of unique proteins and a paucity of secreted proteins compared to other related nematodes. Findings: We revisited the H. bacteriophora genome assembly and gene predictions to determine whether these unusual characteristics were biological or methodological in origin. We mapped an independent resequencing dataset to the genome and used the blobtools pipeline to identify potential contaminants. While present (0.2% of the genome span, 0.4% of predicted proteins), assembly contamination was not significant. Conclusions: Re-prediction of the gene set using BRAKER1 and published transcriptome data generated a predicted proteome that was very different from the published one. The new gene set had a much reduced complement of unique proteins, better completeness values that were in line with other related species' genomes, and an increased number of proteins predicted to be secreted. It is thus likely that methodological issues drove the apparent uniqueness of the initial H. bacteriophora genome annotation and that similar contamination and misannotation issues affect other published genome assemblies.Entities:
Mesh:
Year: 2018 PMID: 29617768 PMCID: PMC5906903 DOI: 10.1093/gigascience/giy034
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Taxon-annotated GC-coverage plot of the H. bacteriophora assembly. Bottom left panel: Each scaffold or contig is represented by a single filled circle. Each scaffold is placed in the main panel based on its GC proportion (X axis) and coverage by reads from the Illumina resequencing project (Y axis). The fill color of the circle indicates the taxon of the top BLASTn hit in the NCBI nt database for that scaffold. The colors are annotated in the top right hand key, which indicates taxon assignment and (in brackets) the number of contigs and scaffolds so assigned, their total span, and their N50 length. The circles are scaled to scaffold length, as indicated in the key at the base of the main panel. Right panel: Nucleotide span in kb at each coverage level. Top panel: Nucleotide span in kb at each GC proportion.
Contamination screening of the H. bacteriophora assembly
| Number of scaffolds | Sum of scaffold spans (bp) | Mean coveragea | Best matches in NCBI nt database | Assignment |
|---|---|---|---|---|
| 12 | 99,556 | 2.8 |
| Bacterial culture contaminantb |
| 4 | 4,709 | 0.1 |
| Symbiont culture contaminantb |
| 2 | 2,144 | 756.0 | Poorly annotated mitochondrial matches |
|
| 22 | 3,051,844 | 69.6 | Mariner transposons in Metazoa, especially Hymenoptera and Platyhelminthes |
|
| 10 | 334,100 | 76.6 | Low score match to several histone H3.3 across Metazoa |
|
| 7 | 713,932 | 56.5 | Chance nucleotide matches to conserved genes in other taxa |
|
aThe average read coverage of the whole assembly was 85.3.
bThese scaffolds were removed by the low-coverage filter.
Figure 2:Comparisons of BRAKER1/soft-masked and original gene predictions from H. bacteriophora. A, B) Frequency histograms of intron count (A) and protein length (B) in BRAKER1/soft-masked (blue) and published (yellow) protein coding gene predictions. Outlying proteins longer than >2,500 amino acids(n = 40) or genes containing >60 introns (n = 20) are not shown. C) Frequency histogram of the proportion of each BRAKER1 gene prediction overlapped by a published gene prediction at the nucleotide level. D) Comparison of singleton, proteome-specific, and shared proteins in the published and BRAKER1/soft-masked protein sets. E) Counts of noncanonical GC/AG introns in gene predictions from the published and BRAKER1 H. bacteriophora gene sets and the model nematode Caenorhabditis elegans (WS258). Counts are of genes containing at least one noncanonical GC/AG intron with the specified number of noncanonical introns.
Comparison of the published and BRAKER1/soft-masked protein coding gene predictions
| Prediction set | Published [ | BRAKER1/soft-masked |
|---|---|---|
| Number of protein coding genes predicted | 20,964 | 15,747 |
| Mean protein length (amino acids) | 218.8 | 344.5 |
| Number of single exon genes | 1,728 | 2,326 |
| Mean number of exons per genea | 5.9 | 7.8 |
| Proportion of noncanonical (GC-AG) introns | 8.87% | 0.79% |
| Percentage mapping to publicly available transcriptome reads | ||
|
| 80.45% | 84.26% |
|
| 37.18% | 58.03% |
| BUSCO score for proteome | ||
|
| 47.8% | 94% |
|
| 34.7% | 4.3% |
| Number of proteins with no hits in Uniref90 | 8,962 | 2,889 |
| Protein singletons in clustering | 5,442 | 1,112 |
| Conserved, single-copy orthologuesb | ||
|
| 2,089 | 2,330 |
|
| 377 | 141 |
|
| 184 | 84 |
aNumber of exons: number of coding DNA sequence (CDS) entries per gene for BRAKER1 predictions. CDS features, not exons, are outputted by AUGUSTUS in GFF files.
bThe list of strict one-to-one orthologues was augmented with protein clusters where 75% of species had single-copy representatives (“fuzzy-1-to-1” orthologues identified by KinFin).
Figure 3:Maximum likelihood phylogeny of selected rhabditine (Clade V) nematodes. A supermatrix of aligned amino acid sequences from orthologous loci from both H. bacteriophora predictions and a set of 23 rhabditine (Clade V) nematodes (see Supporting Data, Orthofinder_analysis) were aligned and analyzed with RaxML using a PROTGAMMAGTR amino acid substitution model. Pristionchus spp. were designated as the outgroup. Bootstrap support values (100 bootstraps performed) were 100 for all branches except one.