Literature DB >> 29320804

Misidentification of genome assemblies in public databases: The case of Naumovozyma dairenensis and proposal of a protocol to correct misidentifications.

Aimilia A Stavrou^1,2, Verónica Mixão^3,4, Teun Boekhout^1,2, Toni Gabaldón^3,4,5.

Abstract

Online sequence databases such as NCBI GenBank serve as a tremendously useful platform for researchers to share and reuse published data. However, submission systems lack control for errors such as organism misidentification, which once entered in the database can be propagated and mislead downstream analyses. Here we present an illustrating case of misidentification of Candida albicans from a clinical sample as Naumovozyma dairenensis based on whole-genome shotgun data. Analyses of phylogenetic markers, read mapping and single nucleotide polymorphisms served to correct the identification. We propose that the routine use of such analyses could help to detect misidentifications arising from unsupervised analyses and correct them before they enter the databases. Finally, we discuss broader implications of such misidentifications and the difficulty of correcting them once they are in the records.

Entities: Chemical Disease Species

Keywords: Candida albicans; Naumovozyma dairenensis; misidentification; public databases

Mesh：

Year: 2018 PMID： 29320804 PMCID： PMC6001429 DOI： 10.1002/yea.3303

Source DB: PubMed Journal: Yeast ISSN： 0749-503X Impact factor: 3.239

INTRODUCTION

Part of our current work is the identification of species‐specific genomic regions of human pathogenic yeasts. We have identified one such region, which is species‐specific to and that shows a high similarity to its orthologue in Candida dubliniensis. The region of our interest is part of the ECE1 gene first described by Birse, Irwin, Fonzi, and Sypherdt (1993). It is easy to prove the species‐specificity of this gene by aligning the firstly described ECE1 sequence in GenBank (accession number L17087). However, during our work, when performing a BLASTn search against the whole‐genome shotgun (WGS) contigs database, as described in the ‘Materials and methods’ section, one of the obtained hits was a region of the Naumovozyma dairenensis assembly associated with the BioProject PRJNA267549 (Roach, Burton, Lee, et al., 2015). This assembly was obtained by Roach et al. (2015), who applied WGS to identify pathogen isolates from patients in the clinical care unit. The study unveiled cryptic transmissions between patients and potential novel pathogenic strains. Among the strains identified, the one named 763_NDAI was claimed to correspond to N. dairenensis. As this species is phylogenetically distantly related to and C. dubliniensis (Kurtzman, Fell, & Boekhout, 2011), the BLASTn hit seemed suspicious, prompting us to further investigate this finding. Here, we describe this investigation and discuss the implications of the presence of misidentified sequences in public databases.

MATERIALS AND METHODS

Detection of a possible misidentification

To verify the specificity of the ECE1 region of interest for our work to , BLASTn was used (Altschul, Gish, Miller, Myers, & Lipman, 1990) with the following parameters: Database – Nucleotide collection, MegaBLAST; Max Target sequences – 500; and the rest of the parameters set to default. The same search was repeated, but changing the Database option to WGS (Organism Fungi: Taxid 4751). After this search, one of the obtained hits did not correspond to any Candida species, but to the N. dairenensis 763_NDAI strain assembly associated with the BioProject PRJNA267549 (Roach et al., 2015). Therefore, we decided to confirm whether this strain was correctly identified.

Phylogenetic confirmation of the N. dairenensis assembly identification

To verify whether the 763_NDAI strain was correctly identified, we used phylogenetic markers traditionally used to identify filamentous fungi and yeasts, i.e. ribosomal DNA, elongation factor, DNA‐directed RNA polymerase II and additional regions from the N. dairenensis‐type strain CBS 421. These are the partial ribosomal DNA regions (small subunit ribosomal RNA gene, partial sequence; internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence; and large subunit ribosomal RNA gene; accession number AJ229072), actin1 (accession number AF527937; Kurtzman & Robnett, 2003), DNA‐directed RNA polymerase II (RPB2; accession number AF527908) and translation elongation factor 1 alpha (TEF1‐α; accession number AF402046; Goddard & Burt, 1999). To determine the possible regions of 763_NDAI assembly corresponding to these regions, a BLASTn search of the sequences was performed against the WGS database selecting BioProject: PRJNA267549 (Roach et al., 2015). Then, we downloaded the top match of the aligned sequences and performed a BLASTn search against both nucleotide and WGS databases selecting organism Ascomycota Taxid: 4890. The matches of these sequences and the N. dairenensis strain 763_NDAI are shown in Table 1.

Table 1

BLASTn results for Naumovozyma dairenensis CBS 421 against Naumovozyma dairenensis strain 763_NDAI

Strain	Locus	Accession number	NCBI database	Query coverage	Identity
CBS 421	Partial rDNA	AJ229072	WGS	37%	88%
CBS 421	Actin1	AF527937	WGS	100%	86%
CBS 421	RPB2	AF527908	WGS	97%	67%
CBS 421	TEF1α	AF402046	WGS	99%	89%

WGS = Whole‐genome shotgun.

BLASTn results for Naumovozyma dairenensis CBS 421 against Naumovozyma dairenensis strain 763_NDAI WGS = Whole‐genome shotgun.

Genomic confirmation of the N. dairenensis assembly identification

To confirm that the WGS reads from 763_NDAI deposited in Sequence Read Archive (Leinonen, Sugawara, & Shumway, 2011; Roach et al., 2015) actually correspond to a strain, as possibly indicated by the previous BLASTn hits, and not to N. dairenensis, we performed read mapping and single nucleotide polymorphism (SNP) calling on two available reference genomes for strains SC5314 (van het Hoog, Rast, Martchenko, et al., 2007) and WO‐1 (Butler, Rasmussen, Lin, et al., 2009) and to the reference genome of N. dairenensis CBS 421 (Gordon, Armisén, Proux‐Wéra, et al., 2011). Briefly, 763_NDAI reads were inspected for their quality with FastQC, and reads with quality lower than 10 and/or size lower than 31 bp were filtered out. Then, they were mapped with BWA‐MEM v0.7.15 (Li, 2013) against SC5314 genome and WO‐1, both retrieved from Candida Genome Database (http://www.candidagenome.org/), the first one on 4 August 2017 and the second one on 12 January 2017, and N. dairenensis CBS 421 genome, retrieved from GenBank (https://www.ncbi.nlm.nih.gov/genbank) on 16 January 2017. Picard v2.1.1 (http://broadinstitute.github.io/picard) was used to sort reads by coordinates, mark the duplicates and get the mapping statistics. The mapping was inspected with IGV version 2.0.30 (Robinson, Thorvaldsdóttir, Winckler, et al., 2011). GATK v3.6 (McKenna, Hanna, Banks, et al., 2010) was used to determine variants. The HaplotypeCaller tool was set with ‐‐genotyping_mode DISCOVERY ‐stand_emit_conf 10 ‐stand_call_conf 30 ‐ploidy 2. A file containing only SNPs was generated with SelectVariants tool. The resulting file was filtered with VariantFiltration tool defining the following parameters: ‐‐clusterSize 5 ‐‐clusterWindowSize 20 ‐‐genotypeFilterName "heterozygous" ‐‐genotypeFilterExpression "isHet == 1" ‐‐filterName "bad_quality" ‐filter "QD < 2.0 || MQ < 40 || FS > 60.0 || HaplotypeScore > 13.0 || MQRankSum < ‐12.5 || ReadPosRankSum < ‐8.0". Mapping coverage was determined with SAMtools v0.1.18 (Li, Handsaker, Wysoker, et al., 2009). To calculate the number of SNPs per kilobase, only positions with one or more reads were considered for the genome size. Likewise, bedtools genomecov (Quinlan & Hall, 2010) was used to count the number of positions in the reference without any read mapped, and this number was subtracted from the total number of bases.

RESULTS

As mentioned before, after BLASTn of the ECE1 region used in our work, which is specific for , all of the hits with significant scores are firstly and with lower scores C. dubliniensis when the Nucleotide Database is selected in the search criteria. When the selection criteria are switched to WGS, the results that appear are 38 BLASTn hits, 35 of which are strains, two C. dubliniensis strains and the N. dairenensis 763_NDAI strain. To investigate a possible misidentification, we performed BLASTn searches against NCBI database using 763_NDAI sequences corresponding to several widely used phylogenetic markers as queries. The results from the BLASTn analysis of all of the sequences, phylogenetic markers, showed 99–100% identity for strains in the nucleotide database and 100% identity with the N. dairenensis strain 763_NDAI and other strains in the NCBI WGS database (Altschul et al., 1990), suggesting that 763_NDAI strain may indeed be (Table 2).

Table 2

Blastn results for aligned regions of Naumovozyma dairenensis strain 763_NDAI against the Nucleotide database

Locus	NCBI Database	Species	Query Coverage	Identity
Partial rDNA	Nucleotide	C. albicans	100%	100%
Actin1	Nucleotide	C. albicans	100%	99%
RPB2	Nucleotide	C. albicans	100%	100%
TEF1a	Nucleotide	C. albicans	100%	99%

Blastn results for aligned regions of Naumovozyma dairenensis strain 763_NDAI against the Nucleotide database To confirm the results above with sequences from the whole genome of the putative misidentified strains we mapped raw reads from the 763_NDAI assembly to the reference genomes of SC5314 (van het Hoog et al., 2007) and WO‐1 (Butler et al., 2009) strains, and to the reference genome of N. dairenensis CBS 421 reference (Gordon et al., 2011). Only 1.2% of the reads mapped to the N. dairenensis CBS 421 reference, while 98% and 96.7% of the reads were successfully mapped to SC5314 and WO‐1 genomes, respectively. SNP calling on the strains identified a larger divergence with SC5314 (8.71 SNPs/kb) and WO‐1 (8.75 SNPs/kb), with 4.01 and 4.06 homozygous SNPs/kb, respectively. In any case, this latter divergence value is within the range of previously reported differences among strains of different clades that show an average of 3.7 SNPs/kb and 4.26 homozygous SNPs/kb in pair‐wise comparisons (Hirakawa, Martinez, Sakthikumar, et al., 2015).

DISCUSSION

The BLASTn results presented in this work, as well as the comparative genomics analysis performed, point to a misidentification of 763_NDAI strain, which seems to belong to the species . It is worth mentioning that N. dairenensis has never been reported as a human pathogen causing disease nor even as a human commensal. A search in Pubmed Central showed 31 articles, none of which refer to N. dairenensis as a commensal or pathogen. The type strain CBS 421 has been isolated from dried fruit and other strains have been found on maize (Kurtzman et al., 2011). On the contrary, is one of the main commensal yeasts on humans (Mayer, Wilson, & Hube, 2013) and a major cause of human yeast infection (Brown, Denning, Gow, et al., 2012). After confirming our suspected misidentification we set out to correct it from the record. We thus contacted the editors of the journal in which the paper containing this misidentification was published, presenting our case. Upon further inspection on the identities of other strains in this paper, it was found by the journal that there were several other cases of misidentifications and they worked with the authors of the original article to correct them. Later on a correction of the paper was published (Roach, Burton, Lee, et al., 2017). However, the misidentified strain reported in this work was not included in the corrected list. Even when formally published, misidentifications can take a long while to lead to correction in public databases. A recent example is the misidentification of the sequences of NRRL 62431 strain as Penicillium aurantiogriseum (Yang, Zhao, Barrero, et al., 2014), which was shown to be Penicillium expansum more than two years ago (Ballester, Marcet‐Houben, Levin, et al., 2015), yet it was only recently corrected in the GenBank database. Another example is the case of Geotrichum candidum strain 3C (Polev, Bobrov, Eneyskaya, & Kulminskaya, 2014) that was more than one year ago inferred to be a misidentified Pezizomycotina species (Shen, Zhou, Kominek, et al., 2016), and it has not yet been corrected in public databases. This genome information was used in subsequent studies (Borisova, Eneyskaya, Bobrov, et al., 2015; Hittinger, Rokas, Bai, et al., 2015), which highlights how misidentifications in databases are propagated. In a recent article, Nguyen and Boekhout (2017) raise the question of misidentifications in hybrid strains by reporting the case of incorrect nomenclature for Saccharomyces uvarum, which was for several years reduced to a variety of Saccharomyces bayanus. As the same authors suggest, and although in 2005 there was already an indication that these should be considered two different species (Nguyen & Gaillardin, 2005), in some databases this misidentification remains. Indeed, at the time of writing this article, in the NCBI Assembly database S. uvarum MCYC 623 and its spore clone 623‐6C appear as belonging to different species, as the first one is correctly labelled as S. uvarum (ASM16699v1; Kellis, Patterson, Endrizzi, Birren, & Lander, 2003) and the second is considered S. bayanus (ASM16703v1; Cliften, Sudarsanam, Desikan, et al., 2003). Whole genome sequencing data are currently flooding the databases and they undeniably provide a tremendous amount of valuable information. However, as in this case, the need for careful quality control of automated analyses and of curated databases is raised once more. Unfortunately, although misidentifications like the ones we presented here are rather easy to detect with comparative analyses, as shown, the errors are difficult to eliminate from the databases. Therefore, these errors can be propagated to several other studies, thus compromising their quality. To avoid such situations, it is important to include identification checkpoints in databases before making the sequences publicly available. For instance, in the case of WGS, the inspection of specific regions known as good phylogenetic markers could help solving the problem. Here we would like to propose a general standard practice (Box [Link]). It is true that there is published literature discussing the advantages and limitations of using particular regions as suitable phylogenetic markers for certain fungal groups (Capella‐Gutierrez, Kauff, & Gabaldón, 2014; Schoch, Seifert, Huhndorf, et al., 2012). However, there are also alternatives proposed, as for Penicillium spp. cytochrome c oxidase subunit 1 and β‐tubulin can be alternatives to the rDNA region (Frisvad & Samson, 2004). Therefore, we believe that for the majority of cases this solution could avoid erroneous cases such as the ones described here.

Fast and reliable identification of fungi and yeasts in WGS data.

To avoid misidentification of organisms there are certain steps that can be followed that, in the majority of cases, will ensure the correct identification of an unknown organism in the databases assuming that the species has been described and that some genome information regarding it is available in public databases. Since the correction of entries usually takes a considerable amount of time, incorporating the use of good phylogenetic markers in standard practice for sequence submission to public databases will greatly assist to decrease the cases of misidentifications. For this procedure to be more efficient it is also required that description of new species is accompanied by deposition of the sequences of the selected markers in the databases. As general guidelines we suggest some checkpoints before acceptance of sequences: The source of the strain is crucial. The habitat where a strain is isolated from can give a first indication of the identity. If not, at least the information can be used to cross‐reference the result (see example with Naumovozyma dairenensis–Candida albicans case). If a strain is available, its deposition in a public culture collection is essential. The D1–D2 domains of the large subunit ribosomal DNA (LSU‐rDNA) or Internal Transcribed spacers (ITS) are to be used in the majority of cases as they are popular regions for identification and databases contain a considerable number of such sequences from a wide variety of organisms. Identification at the genus level is possible most of the times with a simple BLASTn search in NCBI database. When D1–D2 does not provide an absolute species identification, additional markers should be used. The literature offers alternatives to D1‐D2 and ITS depending on the fungus or yeast species (Stielow, Lévesque, Seifert, et al., 2015; Vu, Groenewald, Szoke, et al., 2016). The choice of the most appropriate marker usually depends on the genus of the organism. The literature offers alternatives to D1–D2 and ITS, which are usually housekeeping genes. These alternative markers include, but are not limited to, actin1, DNA‐directed RNA polymerase II, translation elongation factor 1 α‐ or β‐tubulin. Recently a set of four widespread marker genes able to phylogenetically resolve species relationships in dikaryotic fungi has been proposed (Capella‐Gutierrez et al., 2014). We envision that an automated procedure to detect and compare such marker regions in a submitted sequence could be included in the submission process of public databases and used to detect potential misidentifications directly upon submission. A warning issued to the submitter or to database curators, coupled with a provisional halt of the submission, would allow correcting clear‐cut errors before the sequences are publicly available.

32 in total

1. Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi.

Authors: Conrad L Schoch; Keith A Seifert; Sabine Huhndorf; Vincent Robert; John L Spouge; C André Levesque; Wen Chen
Journal: Proc Natl Acad Sci U S A Date: 2012-03-27 Impact factor: 11.205

2. Characterization of Saccharomyces uvarum (Beijerinck, 1898) and related hybrids: assessment of molecular markers that predict the parent and hybrid genomes and a proposal to name yeast hybrids.

Authors: Huu-Vang Nguyen; Teun Boekhout
Journal: FEMS Yeast Res Date: 2017-03-01 Impact factor: 2.796

3. Genome, Transcriptome, and Functional Analyses of Penicillium expansum Provide New Insights Into Secondary Metabolism and Pathogenicity.

Authors: Ana-Rosa Ballester; Marina Marcet-Houben; Elena Levin; Noa Sela; Cristina Selma-Lázaro; Lourdes Carmona; Michael Wisniewski; Samir Droby; Luis González-Candelas; Toni Gabaldón
Journal: Mol Plant Microbe Interact Date: 2015-03 Impact factor: 4.171

4. BEDTools: a flexible suite of utilities for comparing genomic features.

Authors: Aaron R Quinlan; Ira M Hall
Journal: Bioinformatics Date: 2010-01-28 Impact factor: 6.937

5. Integrative genomics viewer.

Authors: James T Robinson; Helga Thorvaldsdóttir; Wendy Winckler; Mitchell Guttman; Eric S Lander; Gad Getz; Jill P Mesirov
Journal: Nat Biotechnol Date: 2011-01 Impact factor: 54.908

6. Misidentification of genome assemblies in public databases: The case of Naumovozyma dairenensis and proposal of a protocol to correct misidentifications.

Authors: Aimilia A Stavrou; Verónica Mixão; Teun Boekhout; Toni Gabaldón
Journal: Yeast Date: 2018-02-22 Impact factor: 3.239

7. Evolution of pathogenicity and sexual reproduction in eight Candida genomes.

Authors: Geraldine Butler; Matthew D Rasmussen; Michael F Lin; Manuel A S Santos; Sharadha Sakthikumar; Carol A Munro; Esther Rheinbay; Manfred Grabherr; Anja Forche; Jennifer L Reedy; Ino Agrafioti; Martha B Arnaud; Steven Bates; Alistair J P Brown; Sascha Brunke; Maria C Costanzo; David A Fitzpatrick; Piet W J de Groot; David Harris; Lois L Hoyer; Bernhard Hube; Frans M Klis; Chinnappa Kodira; Nicola Lennard; Mary E Logue; Ronny Martin; Aaron M Neiman; Elissavet Nikolaou; Michael A Quail; Janet Quinn; Maria C Santos; Florian F Schmitzberger; Gavin Sherlock; Prachi Shah; Kevin A T Silverstein; Marek S Skrzypek; David Soll; Rodney Staggs; Ian Stansfield; Michael P H Stumpf; Peter E Sudbery; Thyagarajan Srikantha; Qiandong Zeng; Judith Berman; Matthew Berriman; Joseph Heitman; Neil A R Gow; Michael C Lorenz; Bruce W Birren; Manolis Kellis; Christina A Cuomo
Journal: Nature Date: 2009-06-04 Impact factor: 49.962

8. Sequencing, biochemical characterization, crystal structure and molecular dynamics of cellobiohydrolase Cel7A from Geotrichum candidum 3C.

Authors: Anna S Borisova; Elena V Eneyskaya; Kirill S Bobrov; Suvamay Jana; Anton Logachev; Dmitrii E Polev; Alla L Lapidus; Farid M Ibatullin; Umair Saleem; Mats Sandgren; Christina M Payne; Anna A Kulminskaya; Jerry Ståhlberg
Journal: FEBS J Date: 2015-10-08 Impact factor: 5.542

9. Assembly of the Candida albicans genome into sixteen supercontigs aligned on the eight chromosomes.

Authors: Marco van het Hoog; Timothy J Rast; Mikhail Martchenko; Suzanne Grindle; Daniel Dignard; Hervé Hogues; Christine Cuomo; Matthew Berriman; Stewart Scherer; B B Magee; Malcolm Whiteway; Hiroji Chibana; André Nantel; P T Magee
Journal: Genome Biol Date: 2007 Impact factor: 13.583

10. Reconstructing the Backbone of the Saccharomycotina Yeast Phylogeny Using Genome-Scale Data.

Authors: Xing-Xing Shen; Xiaofan Zhou; Jacek Kominek; Cletus P Kurtzman; Chris Todd Hittinger; Antonis Rokas
Journal: G3 (Bethesda) Date: 2016-12-07 Impact factor: 3.154

9 in total

1. Genomic, Phenotypic, and Virulence Analysis of Streptococcus sanguinis Oral and Infective-Endocarditis Isolates.

Authors: Shannon P Baker; Tara J Nulton; Todd Kitten
Journal: Infect Immun Date: 2018-12-19 Impact factor: 3.441

Review 2. Aspergillus fumigatus and aspergillosis: From basics to clinics.

Authors: A Arastehfar; A Carvalho; J Houbraken; L Lombardi; R Garcia-Rubio; J D Jenks; O Rivero-Menendez; R Aljohani; I D Jacobsen; J Berman; N Osherov; M T Hedayati; M Ilkit; D James-Armstrong; T Gabaldón; J Meletiadis; M Kostrzewa; W Pan; C Lass-Flörl; D S Perlin; M Hoenigl
Journal: Stud Mycol Date: 2021-05-10 Impact factor: 16.097

Review 3. Recent trends in molecular diagnostics of yeast infections: from PCR to NGS.

Authors: Toni Gabaldón
Journal: FEMS Microbiol Rev Date: 2019-09-01 Impact factor: 16.408

4. Getting science priorities straight: how to increase the reliability of specimen identification?

Authors: Filipe Michels Bianchi; Leonardo Tresoldi Gonçalves
Journal: Biol Lett Date: 2021-04-28 Impact factor: 3.703

5. Misidentification of genome assemblies in public databases: The case of Naumovozyma dairenensis and proposal of a protocol to correct misidentifications.

Authors: Aimilia A Stavrou; Verónica Mixão; Teun Boekhout; Toni Gabaldón
Journal: Yeast Date: 2018-02-22 Impact factor: 3.239

6. Coverage-Versus-Length Plots, a Simple Quality Control Step for de Novo Yeast Genome Sequence Assemblies.

Authors: Alexander P Douglass; Caoimhe E O'Brien; Benjamin Offei; Aisling Y Coughlan; Raúl A Ortiz-Merino; Geraldine Butler; Kevin P Byrne; Kenneth H Wolfe
Journal: G3 (Bethesda) Date: 2019-03-07 Impact factor: 3.154

7. Genome Assemblies of Two Rare Opportunistic Yeast Pathogens: Diutina rugosa (syn. Candida rugosa) and Trichomonascus ciferrii (syn. Candida ciferrii).

Authors: Verónica Mixão; Ester Saus; Antonio Perez Hansen; Cornelia Lass-Florl; Toni Gabaldón
Journal: G3 (Bethesda) Date: 2019-12-03 Impact factor: 3.154

8. Taxonomic evaluation of Xylodon (Hymenochaetales, Basidiomycota) in Korea and sequence verification of the corresponding species in GenBank.

Authors: Yoonhee Cho; Ji Seon Kim; Yu-Cheng Dai; Yusufjon Gafforov; Young Woon Lim
Journal: PeerJ Date: 2021-12-10 Impact factor: 2.984

Review 9. The evolving species concepts used for yeasts: from phenotypes and genomes to speciation networks.

Authors: Teun Boekhout; M Catherine Aime; Dominik Begerow; Toni Gabaldón; Joseph Heitman; Martin Kemler; Kantarawee Khayhan; Marc-André Lachance; Edward J Louis; Sheng Sun; Duong Vu; Andrey Yurkov
Journal: Fungal Divers Date: 2021-06-26 Impact factor: 20.372

9 in total