| Literature DB >> 35096026 |
Aaron J Robinson1, Hajnalka E Daligault1, Julia M Kelliher1, Erick S LeBrun1, Patrick S G Chain1.
Abstract
Public sequencing databases are invaluable resources to biological researchers, but assessing data veracity as well as the curation and maintenance of such large collections of data can be challenging. Genomes of eukaryotic organelles, such as chloroplasts and other plastids, are particularly susceptible to assembly errors and misrepresentations in these databases due to their close evolutionary relationships with bacteria, which may co-occur within the same environment, as can be the case when sequencing plants. Here, based on sequence similarities with bacterial genomes, we identified several suspicious chloroplast assemblies present in the National Institutes of Health (NIH) Reference Sequence (RefSeq) collection. Investigations into these chloroplast assemblies reveal examples of erroneous integration of bacterial sequences into chloroplast ribosomal RNA (rRNA) loci, often within the rRNA genes, presumably due to the high similarity between plastid and bacterial rRNAs. The bacterial lineages identified within the examined chloroplasts as the most likely source of contamination are either known associates of plants, or co-occur in the same environmental niches as the examined plants. Modifications to the methods used to process untargeted 'raw' shotgun sequencing data from whole genome sequencing efforts, such as the identification and removal of bacterial reads prior to plastome assembly, could eliminate similar errors in the future.Entities:
Keywords: chloroplast; genome repositories; plastome; public sequence databases; sequence contamination
Year: 2022 PMID: 35096026 PMCID: PMC8793683 DOI: 10.3389/fgene.2021.821715
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Alignments showing similarity between Platanthera chloroplast and Klebsiella variicola rRNA sequences when (A) P. chlorantha (NC_044626.1) and (B) K. variicola (NZ_CP054254.1) are used as references.
FIGURE 2Phylogenetic trees of the 16S (A) and 23S (B) sequences, showing relationships between bacterial (black), cyanobacterial (blue), chloroplast sequences (green) and the R. parvula sequences (red). The original R. parvula chloroplast 16S and 23S sequences (2 copies each) from RefSeq assembly (NC_031180.2) are included as well as the corrected sequences obtained after removal of bacterial reads from the original sequencing dataset. NCBI accession identifiers and sequence ranges are shown in parentheses. Branches with bootstrap support greater than 75% (100 bootstrap replicates) are shown.
FIGURE 3Alignment between rRNA sequences from the K. alvarezii chloroplast (reference), K. alvarezii whole genome assembly and Serratia plymuthica.