Literature DB >> 26823635

One fungus, which genes? Development and assessment of universal primers for potential secondary fungal DNA barcodes.

J B Stielow¹, C A Lévesque², K A Seifert², W Meyer³, L Iriny³, D Smits¹, R Renfurm¹, G J M Verkley¹, M Groenewald¹, D Chaduli⁴, A Lomascolo⁵, S Welti⁶, L Lesage-Meessen⁴, A Favel⁵, A M S Al-Hatmi⁷, U Damm⁸, N Yilmaz⁹, J Houbraken¹, L Lombard¹, W Quaedvlieg¹, M Binder¹, L A I Vaas¹⁰, D Vu¹, A Yurkov¹¹, D Begerow¹², O Roehl¹², M Guerreiro¹³, A Fonseca¹³, K Samerpitak¹⁴, A D van Diepeningen¹, S Dolatabadi¹⁵, L F Moreno¹⁶, S Casaregola¹⁷, S Mallet¹⁷, N Jacques¹⁷, L Roscini¹⁸, E Egidi¹⁹, C Bizet²⁰, D Garcia-Hermoso²⁰, M P Martín²¹, S Deng²², J Z Groenewald¹, T Boekhout²³, Z W de Beer²⁴, I Barnes²⁵, T A Duong²⁵, M J Wingfield²⁴, G S de Hoog²⁶, P W Crous²⁷, C T Lewis², S Hambleton², T A A Moussa²⁸, H S Al-Zahrani²⁹, O A Almaghrabi²⁹, G Louis-Seize², R Assabgui², W McCormick², G Omer¹, K Dukik¹, G Cardinali¹⁸, U Eberhardt³⁰, M de Vries¹, V Robert¹.

Abstract

The aim of this study was to assess potential candidate gene regions and corresponding universal primer pairs as secondary DNA barcodes for the fungal kingdom, additional to ITS rDNA as primary barcode. Amplification efficiencies of 14 (partially) universal primer pairs targeting eight genetic markers were tested across > 1 500 species (1 931 strains or specimens) and the outcomes of almost twenty thousand (19 577) polymerase chain reactions were evaluated. We tested several well-known primer pairs that amplify: i) sections of the nuclear ribosomal RNA gene large subunit (D1-D2 domains of 26/28S); ii) the complete internal transcribed spacer region (ITS1/2); iii) partial β -tubulin II (TUB2); iv) γ-actin (ACT); v) translation elongation factor 1-α (TEF1α); and vi) the second largest subunit of RNA-polymerase II (partial RPB2, section 5-6). Their PCR efficiencies were compared with novel candidate primers corresponding to: i) the fungal-specific translation elongation factor 3 (TEF3); ii) a small ribosomal protein necessary for t-RNA docking; iii) the 60S L10 (L1) RP; iv) DNA topoisomerase I (TOPI); v) phosphoglycerate kinase (PGK); vi) hypothetical protein LNS2; and vii) alternative sections of TEF1α. Results showed that several gene sections are accessible to universal primers (or primers universal for phyla) yielding a single PCR-product. Barcode gap and multi-dimensional scaling analyses revealed that some of the tested candidate markers have universal properties providing adequate infra- and inter-specific variation that make them attractive barcodes for species identification. Among these gene sections, a novel high fidelity primer pair for TEF1α, already widely used as a phylogenetic marker in mycology, has potential as a supplementary DNA barcode with superior resolution to ITS. Both TOPI and PGK show promise for the Ascomycota, while TOPI and LNS2 are attractive for the Pucciniomycotina, for which universal primers for ribosomal subunits often fail.

Entities: CellLine Chemical Disease Gene Mutation Species

Keywords: DNA barcoding; ITS supplement; molecular taxonomy; phylogeny; species identification; universal primers

Year: 2015 PMID： 26823635 PMCID： PMC4713107 DOI： 10.3767/003158515X689135

Source DB: PubMed Journal: Persoonia ISSN： 0031-5850 Impact factor: 11.051

INTRODUCTION

Identification and classification of eukaryotes increasingly depends on DNA sequences of standardised genetic markers, a concept known as DNA barcoding (Hebert et al. 2003a, b, Hebert & Gregory 2005, Meyer & Paulay 2005, Schindel & Miller 2005, Schoch et al. 2012). An intense debate is ongoing concerning whether the identification of organisms of unresolved alpha taxonomy is amenable to DNA barcoding, because in silico-based identification requires gene sequences that accurately reflect natural classifications (Eberhardt 2010, Schlick-Steiner et al. 2010). If this prior condition is not adequately fulfilled a posteriori nesting of ‘unknowns’ among known fungal taxa is impossible. DNA barcoding has evolved, despite its unresolved theoretical and taxonomic issues, as a standard procedure in organism identification among various disciplines of modern biology (Tautz et al. 2003, Shokralla et al. 2014, Stockinger et al. 2014, Stoeckle & Thaler 2014). The milestone paper on amplification and direct sequencing of fungal ribosomal RNA genes for phylogenetics by White et al. (1990, now > 15 000 citations), described primers that allowed simple, rapid PCR-based amplification and Sanger sequencing of sections of the fungal rDNA operon. Inspired by the efforts to resolve bacterial taxonomy using 16S sequences summarised by Woese (1987) and further elaborated by Weisburg et al. (1991) and Stackebrandt & Goebel (1994), the White et al. paper generated an explosion of phylogenetic research that now dominates fungal taxonomy. The modern concept of fungal DNA barcoding does not differ substantially from approaches proposed more than two decades ago. Sanger DNA sequencing of nuclear rDNA domains remains the most widely accepted approach in molecular mycology to classify and to identify unknown fungal specimens or cultures. The first comprehensive database of fungal DNA barcodes was established for yeasts (Kurtzman & Robnett 1998, Fell et al. 2000, Scorzetti et al. 2002) and revolutionised the approach to species recognition in that group. Where once a single taxonomist could identify only a handful of yeast strains in a week because of the plethora of physiological tests required, now it is possible to process hundreds. The result has been a rapid increase in the number of recognised yeast species, and a wealth of ecological data (Kurtzman et al. 2015). The broad acceptance of DNA barcoding today relies on public repositories such as the INSDC (http://www.insdc.org/), which accessions hundreds of thousands of sequence entries (Schoch et al. 2012, 2014). In a concerted action with the Consortium for the Barcode Of Life (CBOL), the Fungal Barcoding Consortium (Schoch et al. 2012) ratified the ITS as the universal DNA barcode for the fungal kingdom using the same gene section proposed by White et al. (1990) more than 20 years earlier. This study (Schoch et al. 2012) was widely accepted by the community and already has been highly cited > 850 times. Abandoning dual nomenclature of pleomorphic fungi in favour of a single name system in the nomenclatural code was a radical change (Gams & Jaklitsch 2011, Taylor 2011), partly enabled by the possibility to unequivocally demonstrate phylogenetic relationships between asexual and sexual states that were formerly classified and named separately in parallel systems under the provisions of the former Article 59 (Guadet et al. 1989, Bruns et al. 1991, Berbee & Taylor 1992, Reynolds & Taylor 1993). This change to the botanical code was driven by DNA data, ITS and other rDNA sequences in particular, and demonstrated the impact of DNA sequencing on our understanding of genetic diversity, species identification, delimitation and nomenclatural changes in mycology. Despite the strength and impact of rDNA ITS as the sanctioned universal fungal DNA barcode, its resolution of higher taxonomic level relationships is inferior to many protein-coding genes such as RPB1, RPB2 or TUB2 (Nilsson et al. 2006, Seifert 2009, Begerow et al. 2010, Schoch et al. 2014). However, as a barcode ITS outperforms alternative loci because of its highly robust PCR amplification fidelity (> 90 % success rates), a Probability of Correct Identification (PCI) of about 70 %, and applicability to a wide range of sample conditions. Although many alternatives to ITS were considered by the mycological community, its sanctioning as the primary barcode marker rested on its practicality and reliability, not on the highly desired ‘resolution power’ (Schoch et al. 2012, 2014). Many mycologists would prefer one or several universal, but phylogenetically informative loci as barcodes, with higher species resolution power than is feasible with ITS. The ideal genetic marker would have high inter- and low intra-species sequence divergence, i.e. a discrete barcode gap (Schoch et al. 2012, Samerpitak et al. 2015), and accurately reflect higher-level taxonomic affiliations. It could then serve as a substitute for ITS, or as a supplementary, secondary or tertiary marker in concert with ITS as the primary barcode. Molecular taxonomists thus continue to search for genes conserved enough to allow reliable priming but sufficiently variable to yield highly resolved and well-supported phylograms and useful barcode gaps (Schmitt et al. 2009, Feau et al. 2011, Lewis et al. 2011, Robert et al. 2011, Walker et al. 2012, Capella-Gutierrez et al. 2014). Although the concepts of phylogenetic markers and DNA barcodes differ in principle, they overlap in application. Their ideal characteristics include adequate species resolution, ease of amplification, absence of extreme length variation, the presence of only single copies, and low intra-species variability. Numerous attempts were made to identify loci with suitable primary barcode characteristics. These include efforts targeting: i) Cox1 (or CO1; Seifert et al. 2007, Dentinger et al. 2011, Robideau et al. 2011), the primary barcode for animals ratified by CBOL (Hebert et al. 2003a, b, Schindel & Miller 2005); ii) the AFTOL (http://aftol.org/about.php) genes (e.g. RPB1, RPB2, nucLSU, nucSSU, mtSSU, TEF1α and mtATP6), partially used by James et al. (2006) and evaluated for their barcoding potential by Schoch et al. (2012); iii) non-universal regions such as ND6 (hypothetical protein), CAL (Calmodulin), ACT (Gamma Actin) or TUB2 (Beta Tubulin 2) (Carbone & Kohn 1999, Aveskamp et al. 2009, Lee & Young 2009, Verkley et al. 2014); iv) the minichromosome maintenance complex MCM7 (a DNA helicase) and Tsr1 (a pre-mRNA processing protein homolog), the first loci extracted from genome-based computational predictions (Aguileta et al. 2008, Schmitt et al. 2009) and FG1093 and MS204, selected from a screen of 25 single copy protein coding genes (Walker et al. 2012). The relatively small number of barcode markers now available strongly reflect the past ‘poor man’s approach’ of data-driven gene selection (Eberhardt 2010, Capella-Gutierrez et al. 2014). As emphasised by Eberhardt (2010), group or clade-specific questions continue to require different, more specialised species identification solutions (Balajee et al. 2009, Lumbsch & Leavitt 2011, Gao & Zhang 2013, Heinrichs et al. 2012). This persists because attempts to identify alternatives or substitutes for rDNA sequences that meet the requirements of a single ‘primary barcode’ have not yet succeeded. Discovery of the ‘golden bullet barcode’ seems more unlikely than ever. Inconsistencies between the results of different computational analysis and their identified ‘best’ genes (Tautz et al. 2003, Avise 2004, Feau et al. 2011, Lewis et al. 2011, Robert et al. 2011, Capella-Gutierrez et al. 2014) remain discouraging. Most mycologists agree that a compromise solution of combining ITS with secondary or tertiary, ‘group-specific’ DNA barcode(s) is the most realistic solution. This would combine the universal primer fidelity and high taxon coverage possible with ITS, with equally robust primers for secondary barcodes specific to the group or taxon of interest, enhancing precision of species identification. To take an analogy from medical diagnostics, this is an iterative diagnostic process, where the results of a primary analysis (ITS sequencing) can be used to determine what secondary analyses should be performed (Irinyi et al. 2015a, b). In adopting additional barcodes, the obvious absence of complete reference data is a serious problem, and was one of the main arguments presented against considering Cox1 a primary barcode for fungi (other problems eventually emerged, see e.g. Gilmore et al. 2009). To be truly effective for specialised and non-specialised applications, such as phytopathology (plant diseases control and quarantine), clinical applications (diagnostics, disease control, epidemiology) or environmental studies (ecology, conservation, metagenomics), formally adopted barcodes will need to be highly predictive, and thus comprehensive reference datasets will need to be developed. Ideally, such data should leave almost no margin for error, reduce false-positives even in the absence of a perfect match, and allow for additive compensation via the primary barcode (ITS) when unknowns are likely to occur and reference data for rare taxa is scarce, as is the case in fungal ecology (Tedersoo et al. 2008, Buée et al. 2009, Pawlowski et al. 2012). The selection process for novel high fidelity ‘phylogenetic markers’ and ‘DNA barcodes’ is mostly determined by laboratory practicalities. In silico analyses of complete genomes (Capella-Gutierrez et al. 2014) to identify combinations of genes most informative to establish phylogenetic relationships are conceptually similar. Current marker selection procedures are biased towards ranking single genes, rather than combinations of genes, reducing resolution power. The question arises which approach is optimal for identifying potential secondary or tertiary barcodes and how selection procedures, formerly achieved by ‘trial-and-error’, can be optimised to avoid arbitrary selections? At present, the growth of the number of complete fungal genome sequences and the steady increase of comparative genomic tools allow an unprecedented view on variability among genes and species, on sequence homology and gene synteny. The detection of orthologs and paralogs is crucial for inference of robust molecular classifications. Understanding should go beyond phylogenetics and determination of species limits (Grigoriev et al. 2014, Nagy et al. 2014). A key concept in the present study was the in silico selection process for alternative candidate barcodes, following strategies described by Robert et al. (2011) and Lewis et al. (2011) to search for gene regions in complete fungal genomes under several optimality criteria. Lewis et al. (2011) inferred suitable candidate genes using translated protein sequences (Pfam domains), whereas Robert et al. (2011) focused entirely on nucleic acids. Both studies identified gene sections and identified novel primer pairs that functioned in silico for the genomes available at the time, either as universal fungal primers, or as primers targeting phyla or classes. In our study, the primers were rigorously tested and later distributed for independent verification to contributing laboratories. Departing from the study of Robert et al. (2011), we tested and compared amplification efficiency of two nuclear ribosomal regions (ITS and LSU, D1–D2 domains of 26/28S), the 5’ primed end of β -tubulin2 (TUB2) and γ-actin (ACT), the section ‘6–7’ of the second largest subunit of the RNA-polymerase II gene (RPB2), the commonly used intermediate section of translation elongation factor 1-α (TEF1α) (Sasikumar et al. 2012) corresponding to the section 983–1567 bp (in the rust Puccinia graminis), two novel universal high fidelity TEF1α primer alternatives covering approximately the same region, three sections of the newly identified candidate region and fungus-specific gene, translation elongation factor 3 (TEF3) (Ypma-Wong et al. 1992, Belfield et al. 1995, Belfield & Tuite 2006, Greganova et al. 2011), and one of the small ribosomal proteins required for tRNA transfer, the 60S L10 (L1) (Brodersen & Nissen 2005). From the study of Lewis et al. (2011) based on Pfam domains, we tested ITS against the seven genes that could theoretically amplify a wide range of fungal taxa with a single primer pair, namely, phosphoglycerate kinase (PGK, PF00162), DNA topoisomerase I (TOPI, PF02919), Lipin/Ned1/Smp2 (LNS2, PF08235), Indole-3-glycerol phosphate synthase (IGPS, PF00218), Phosphoribosylaminoimidazole carboxylase PurE domain (PurE, PF00731), Peptide methionine sulphoxide reductase (Msr, PF01625) and Vacuolar ATPase (vATP, PF01992). Overall, universal primer performance was recorded for almost 20 000 individual PCR reactions, and selected sequence data were evaluated using a ‘distance’ – rather than a ‘character’ – matrix approach to accommodate a puristic barcoding concept, since the ‘barcoding-gap’ is traditionally calculated from Kimura-2-parametric distance model (K2P) (Kimura 1980). Our taxon selection covers several major lineages of the fungal kingdom and represents hundreds of species of economic, phytopathological and clinical importance among the Agaricomycotina, Pezizomycotina, Pucciniomycotina, Saccharomycotina and Ustilaginomycotina.

MATERIAL AND METHODS

Fungal cultures and specimens

For the barcode primer search by the complete genome approach of Robert et al. (2011), axenic cultures preserved and deposited in several biological resource centres, private collections and fungaria were used as sources for DNA. Tested fungal cultures are circumscribed according to their higher level taxonomic affiliation, main taxa (species level), quantity per ‘dataset’ source and corresponding collaborator in Table 1. A detailed taxon list for each ‘dataset’ is available upon request. Cultures were either activated from lyophilised or cryopreserved material and inoculated on various media, such as oatmeal (OA) and 2 % malt extract (MEA, Oxoid) agars, prepared according to Crous et al. (2009). Alternatively, preserved cultures were directly used for DNA isolation from lyophilised or cryopreserved stocks. The selected ‘datasets’ covered several important taxa within the Agaricomycotina, Pezizomycotina, Pucciniomycotina, Saccharomycotina and Ustilaginomycotina. In total 1 931 fungal strains and fungarium specimens representing > 1 500 fungal species (exact taxonomic status for some strains was not precisely determined ‘spp.’) were selected for benchmarking newly designed against published primers. For the barcode primer search by the Pfam domain approach of Lewis et al. (2011), fungal cultures were obtained from a number of laboratories at the Eastern Cereal and Oilseed Research Centre, ECORC, of Agriculture and Agri-Food Canada, CEF, Ottawa. They were supplied as fungal cultures from which DNA was extracted or as already purified DNA samples.

Table 1

Overview on datasets and selected taxa.

Dataset	Family	Main taxa	Quantity	Source, Collaborator
Penicillium I	Trichocomaceae	Penicillium spp.	96	CBS, Houbraken
Penicillium II	Trichocomaceae	Penicillium spp., Talaromyces spp.	96	CBS, Yilmaz
Scedosporium I	Microascaceae	Scedosporium apiospermum	30	Inst. Pasteur, Hermoso
Scedosporium II	Microascaceae	Scedosporium spp.	22	Westmead Hospital, Meyer & Irinyi
Onygenales	Arthrodermataceae and Ajellomycetaceae	Trichophyton, Epidermophyton, Arthroderma, Microsporum, Chrysosporium spp.	180	CBS, Dukik, Moreno, De Hoog, Stielow
Ascomycetous yeast I	Saccharomycetaceae	Candida albicans, C. catenulata, C. dubliniensis, C. glabrata, Kodamaea spp., Clavispora spp.	207	Westmead Hospital, Meyer & Irinyi
Ascomycetous yeast II	Saccharomycetaceae	Candida albicans, C. tropicalis, C. orthopsilosis, C. metapsilosis	96	CBS, Groenewald, Boekhout
Ascomycetous yeast III	Saccharomycetaceae	Debaromyces, Millerozyma spp.	24	CIRM-Levures, INRA, Casaregola, Mallet & Jacques
Ascomycetous yeast IV	Saccharomycetaceae	Debaromyces, Millerozyma, Saccharomyces spp.	83	Microbiology Perugia, Cardinali & Roscini
Colletotrichum	Glomerellaceae	Colletotrichum spp.	50	Senkenberg Museum Goerlitz, Damm
Hypocreales	Nectriaceae	Cylindrocarpon, Fusarium, Nectria spp.	96	CBS, Lombard
Fusarium	Nectriaceae	Fusarium fujikuroi sensu stricto	96	CBS, Al-Hatmi, Van Diepeningen
Ochroconis	Sympoventuriaceae	Ochroconis spp.	50	CBS, Samerpitak
Ceratocystis	Ceratocystidaceae	Ceratocystis spp.	21	FABI, Wingfield, De Beer, Barnes & Duong
Coniothyrium	Montagnulaceae	Coniothyrium spp.	96	CBS, Verkley & Stielow
Mycosphaerellaceae	Mycosphaerellaceae	Mycosphaerella, Septoria spp.	45	CBS, Quaedvlieg
Teratosphaeriaceae	Teratosphaeriaceae	Teratosphaeria spp. sensu lato	90	CBS, Egidi, Binder & Quaedvlieg
Polyporaceae	Polyporaceae	Polyporus, Trametes spp.	20	INRA, Chaduli, Lomascolo, Welti & Lesage-Meessen
Russula	Russulaceae	Russula spp.	47	Museum für Naturkunde Stuttgart, Eberhardt
Gasteromycetes	Sclerodermataceae, Phallaceae and Diplocystaceae	Pisolithus, Phallus and Astraeus spp.	32	Royal Botanic Garden Madrid, Martin
Basidiomycetous yeast I	Tremellaceae	Cryptococcus, Rhodotorula spp.	98	Westmead Hospital, Meyer & Irinyi
Basidiomycetous yeast II	Tremellaceae sensu lato	Cryptococcus, Kwoniella, Bullera spp.	174	DSMZ & Ruhr University Bochum, Yrukov & Begerow
Multi-Phyla	Chytridiaceae, Mortierellaceae, Nectriaceae,	Rhizoclosmatium, Mortierella spp., Fusarium spp., Drechslera spp., Cadophora,	182	Agriculture and Agri-Food Canada (AAFC), Ottawa
	Pleosporaceae, Pucciniaceae, Rhizophydiaceae,	Pyrenophora spp., Ulocladium, Epicoccum, Dendryphion, Alternaria, Puccinia spp.,
	Tilletiaceae, Wallemiaceae and 35 more	Rhizophydium spp., Tilletia spp., Wallemia spp. and 40 more

Standard molecular procedures

For the primers inferred from the complete genome through the approach by Robert et al. (2011), total genomic DNA was extracted from living cultures, cells preserved in liquid nitrogen or in lyophilisation and from dried fungarium specimens (Agaricomycotina only) using different variations of one-by-one or high-throughput 96-well plate DNA extraction techniques, routinely used by the respective collaborating laboratories (Ferrer et al. 2001, Ivanova et al. 2006, Yurkov et al. 2012, 2015, Feng et al. 2013, Verkley et al. 2014). DNA extraction and PCR protocols differed between participating laboratories. Primers and amplification conditions used varied between laboratories, an example of the ones used at the CBS is provided. PCR reactions for amplification of the ITS barcode employed primers ITS5/ITS1/ITS1F and ITS4 were performed under standard or semi-nested conditions in 12.5 μL reactions (the CBS-KNAW barcoding lab protocol) containing 2.5 μL purified DNA, 1.25 μL PCR buffer (Takara, Japan, incl. 2.5 mM MgCl2), 1 μL dNTPs (1 mM stock; Takara, Japan), 0.6 μL v/v DMSO (Sigma, Netherlands), forward-reverse primer 0.25 μL each (10 mM stock), 0.06 μL (5 U) Takara HS Taq polymerase, 7.19 μL MilliQ water (White et al. 1990, Stielow et al. 2012, Yurkov et al. 2012). PCR conditions for amplifying partial LSU rDNA using primers LR0R and LR5 differed only by their annealing temperature (55 °C instead of 60 °C) and increased cycle extension time (90 s per cycle). Amplification of partial γ-actin (ACT), covering the more variable 5’-end including two small introns, and partial β -tubulin 2 (TUB2), covering the variable 5’-end with up to four small introns, followed the protocol of Aveskamp et al. (2009) and Carbone & Kohn (1999) using the primers Btub2Fd and Btub4Rd, and ACT-512F, ACT-783R, respectively. TEF1α and RPB2 were amplified following the protocols of Rehner & Buckley (2005) and Liu et al. (1999), respectively (Table 2). PCR products were directly purified using FastAP thermosensitive alkaline phosphatase and shrimp alkaline phosphatase (Fermentas, Thermo Fisher Scientific). Cycle-sequencing reactions were set up using ABI BigDye Terminator v. 3.1 Cycle Sequencing kit (Thermo Fisher Scientific), with the manufacturers’ protocol modified by using a quarter of the recommended volumes, followed by bidirectional sequencing with a 3730xl DNA Analyser (Thermo Fisher Scientific). Sequences were archived, bidirectional reads assembled and manually corrected for sequencing artefacts using BioloMICS software v. 8.0 (www.bio-aware.com) (Vu et al. 2012). Edited sequences were exported to and aligned with MAFFT v. 7.0 (Katoh et al. 2005) and further corrected for indels and SNPs (single nucleotide polymorphisms) by replacing respective positions with ambiguity code letters.

Table 2

Benchmarking primers used to assess performance and versatility of newly designed primers.

Locus	Primer	Oligo nucleotides (5’-3’)	Reference
ITS	ITS5	GGAAGTAAAAGTCGTAACAAGG	Ward & Adams (1998)
	ITS4	TCCTCCGCTTATTGATATGC	White et al. (1990)
	ITS1	TCC GTA GGT GAA CCT GCG G	White et al. (1990)
	ITS1-F	CTT GGT CAT TTA GAG GAA GTA A	Gardes & Bruns (1993)
LSU	LROR	ACCCGCTGAACTTAAGC	Vilgalys & Hester (1990)
	LR5	TCCTGAGGGAAACTTCG
TUB2	Btub2Fd	GTBCACCTYCARACCGGYCARTG	Woudenberg et al. (2009)
	Btub4Rd	CCRGAYTGRCCRAARACRAAGTTGTC
	Bsens	ATCACWCACTCICTIGGTGGTGG	Lesage-Meessen et al. (2011)
	Brev	CATGAAGAARTGIAGACGIGGG
ACT	ACT512f	ATGTGCAAGGCCGGTTTCG	Carbone & Kohn (1999)
	ACT783r	TACGAGTCCTTCTGGCCCAT
	CA5R	GTGAACAATGGATGGACCAGATTCGTCG	Daniel et al. (2001), Daniel & Meyer (2003)
	CA14	AACTGGGATGACATGGAGAAGATCTGGC
RPB2	fRPB2-5f	GAY GAY MGW GAT CAY TTY GG	Liu et al. (1999), Binder & Hibbett (Clark University website)
	fRPB2-7cF	ATG GGY AAR CAA GCY ATG GG
	fRPB2-7cR	CCC ATR GCT TGY TTR CCC AT
TEF1α	EF1-983F	GCY CCY GGH CAY CGT GAY TTY AT	Rehner & Buckley (2005)
	EF1-1567R	ACH GTR CCR ATA CCA CCR ATC TT

For the primers inferred from the complete proteome through the Pfam approach by Lewis et al. (2011), genomic DNA from fungal cultures was extracted using the OmniPrep™ Genomic DNA Extraction Kit (G-Biosciences, St. Louis, Missouri). Fungal tissue was ground into a fine powder in a mortar and pestle with liquid nitrogen and stored at -80 °C. Approximately 50 mg of ground tissue was placed in a 1.7 mL microfuge tube, resuspended in 250 μL of lysis buffer and vortexed for several seconds. An additional 250 μL of lysis buffer was added to the resuspension and the tube was incubated for 15 min at 55–60 °C without the addition of Proteinase K. The samples were cooled to room temperature and 200 μL of chloroform was added to the tube. The tube was mixed by inversion several times and then centrifuged for 10 min at 14 000 ×g. The upper aqueous phase was removed to a new microfuge tube and 100 μL of precipitation solution was added. If no white precipitate was produced an additional 50 μL of precipitation solution was added to the tube. The white precipitate was pelleted by centrifugation and the supernatant was moved to a fresh tube. The genomic DNA was precipitated by the addition of 500 μL of isopropanol to the supernatant and inversion of the tube several times. The genomic DNA was pelleted by centrifugation and washed with 700 μL of 70 % ethanol. The ethanol was decanted and the pellet was air dried for 15 s prior to resuspension in 50 μL of TE Buffer. DNA concentration was determined using the Qubit® 2.0 Fluorometer (Life Technologies, Burlington, Canada) and working solutions were prepared at a concentration of 0.05 ng/μL. PCR was carried out on 0.1 ng of genomic DNA in a total volume of 20 μL using 0.2 mM dNTPs, 0.5 μM of each primer, 1× of Titanium Taq Buffer and Titanium Taq DNA Polymerase (Clontech) using an Eppendorf GradientS thermal cycler. For the ITS primers, an initial denaturation at 95 °C for 3 min was followed by 40 cycles at the following conditions: 30 s at 95 °C, 45 s at 58 °C and 2 min at 72 °C. A final extension at 72 °C for 8 min completed the PCR. For the remaining primers, Touchdown PCR was performed where an initial denaturation at 95 °C for 5 min was followed by 10 cycles of 45 s at 95 °C, 45 s starting at 68 °C and dropping by 1 °C per cycle until a temperature of 58 °C was reached and a 1 min extension at 72 °C. The initial 10 cycles were then followed by 35 cycles of 45 s at 95 °C, 45 s at 58 °C and 1 min at 72 °C. A final extension at 72 °C for 5 min completed the PCR.

In silico selection of gene and protein regions and initial seed primers

Two different genome-mining strategies were employed to identify potential new barcode markers and different strategies were used to develop them further and for validation. For the DNA-based approach, complete genome sequences of 74 fungi were downloaded from public repositories such as Broad, Genoscope and NCBI in 2011, covering most major lineages of the fungal kingdom. The initial approach to find ideal gene regions was described in Robert et al. (2011) as the ‘Ideal Locus Method’ (ILM). An ideal locus is a gene or gene region that would provide a phylogram possibly close to a ‘whole-genome phylogram’ (a phylogeny inferred from single copy orthologous genes). A ‘Best Pair of Primers Method’ (BPPM) was devised to identify short conserved sections at < 1 Kb apart that could serve as kernels to construct forward and reverse primers for species or selected taxonomic groups. Suitability as barcode candidate was assessed by calculating Pearson coefficients against the super matrix, built from overlapping orthologs (Kuramae et al. 2007). PCR primers are normally 18–24 bp long, but in practice only the 3’ end fit is critical for PCR success. Hence, an alternative ‘Short Primer Method’ (SPM) was developed as follows: i) listing all possible 12 bp primers as ‘Length 12’ (maximum to limit calculations); ii) searching for these 12-mer primers in all available fungal genomes, which is considerably faster than searching for primer pairs; iii) maintaining primers with sufficient quality in terms of PCR applicability and wide distribution. Factors in the equation are: F1 = number of most frequent nucleotides / size of the sequence; F2 = comparison of the sequence with itself, with an offset of three nucleotides, this factor is high for repetitive triplets (e.g. ACCACCACC…n); F3 = comparison of the sequence with its reverse complement, with an offset greater than half the sequence size (avoidance of hairpins). All factors (Fn) take a value of 0.25 for a random sequence, and a value of 1.0 for a poor sequence. Thus, the factors F2 and F3 were rescaled with the following equation: F = (F – 0.25) / 0.75, the final equation is given by: Quality = 1.0 – max (F1, F2, F3); iv) using these best primers to search for possible primer pairs; v) using the best primer pairs to extract the intervening DNA sequence. The target length of the intermediate sequence was set to 200 to 1 000 nucleotides, thresholds chosen to ensure a minimum variability and allow ease of amplification and Sanger sequencing. The formal DNA standard recommends amplicons of about 700 bp, reflecting the technical constraints of the Sanger sequencing prevalent when the standard was implemented; vi) comparing the in silico sequences and building distance matrices and trees to compare with a reference matrix and tree. This last step ensures that the selected DNA region produces a relatively coherent phylogeny compared to the reference matrix and that it would also be suitable for discriminating closely related species. The in silico design, execution and publication of the resulting primers based on identification of protein families (see Lewis et al. 2011) was done concurrently with the DNA-based method described above. These primers were extensively tested only at the Agriculture & Agri-Food Canada laboratory on smaller fungal test sets by Levesque, Seifert, Hambleton, Lewis and McCormick, with additional testing of primers done at the CBS-KNAW Fungal Biodiversity Centre. The in silico pipeline for identifying suitable protein sequences for primer design is briefly described below. The approach was inspired by the CARMA algorithm for taxonomic identification of metagenomic sequences (Krause et al. 2008). CARMA matches short environmental gene fragments to Pfam protein families and constructs a taxonomic profile for the sample. The objective is to assign translated input nucleotide sequences to Pfam accessions that can identify single copy gene regions in source organisms and then design degenerate primers from the alignment of these putatively orthologous sequences. This requires pre-processing the Pfam dataset, translating and assigning DNA sequences to Pfam groups, and adding sequences to the reference Pfam alignments. The CARMA pipeline was adapted for use in this project, although the original intent and our application are quite different. Data were processed in two stages. The first stage translates input sequences and assigns them to Pfam accessions; the second stage prepares alignments and attempts primer design for the selected set of taxa. Adding additional or updating data for existing organisms requires repetition of the first stage. Restricting the analysis to a taxonomic subset of the data requires that the final stage be repeated for the selected organisms. The pipeline requires up-to-date versions of the Pfam-A and NCBI taxonomy datasets; if the Pfam dataset is updated, the first stage of processing must be repeated. The input datasets are placed in a single folder for processing, with each organism in a unique directory named by ‘genus_species_strain’. The pipeline first processes source nucleotide sequences to identify Pfam domains within each sequence and creates clusters of conserved domains. Sequences are then translated into protein sequences using the orientation and frame from a translated BLAST search against the Pfam sequences. Each translated sequence is then processed to identify the region corresponding to the matched protein family and the protein sequence is written to a file. This functionality is as originally implemented in the CARMA pipeline. The second stage of the pipeline involves screening the Pfam matches for a user-defined subset of organisms to identify families that contain a single match for each of the selected organisms. Such families are considered to represent conserved, single copy gene regions. Next, these putatively orthologous sequences are added to the corresponding Pfam multiple sequence alignment. Adding the sequences to an existing curated alignment results in a superior alignment to de novo sequence alignment. Original sequences are then removed from the reference alignments so that only targeted organisms remain and the resulting alignment is screened for conserved blocks for primer design. Conserved regions flanking variable sites are identified and only alignments with two or more conserved blocks are selected so that pairs of forward and reverse primers can be sought in the separate blocks. The resulting primers were further screened for protein families containing primer pairs with additional desirable characteristics, such as optimum amplicon length.

Final primer design

Two different genome and proteome mining strategies were employed to identify the initial potential new barcode markers (Lewis et al. 2011, Robert et al. 2011) and different strategies were also used to develop them further and for validation. The selected gene regions by the complete genome approach of Robert et al. (2011) corresponding to a total of 54 alignments (available on request) were manually re-annotated using n- and x-BLAST and alignments resulting in identical annotations were realigned and redundant alignments discarded. These seed primers are given in Table 3. To design PCR primers, non-redundant alignments were imported into BioEdit v. 7.0.52 and visually screened for the most conserved sites to identify suitable primer binding sites. Priority was given to stable 3’ prime ends with at least eight nucleotide binding sites and the least possible amount of degeneracy. For several primers, ‘N’s were replaced with an inosine, ‘I’s, to improve binding stability and to reduce the quantity of primers in the actual synthesized wobble pool. Individual primer pairs are discussed in the results section.

Table 3

Initial seed primers from ‘nucleic acid based’ computational predictions (Robert et al. 2011).

Alignment group	Species in alignment (Al)	Primers (forward-reverse)	Functional annotation
1	51	acaagcgtttct-catcaagttcca	Hypothetical protein
2	41	acatggagaaga-catcaaggagaa	Gamma actin
3	44	accttcttgatg-atgttcttgatg	Translation elongation factor 1α
3b	46	caagaacatgat-catcaagaaggt	Translation elongation factor 1α
4	38	agtacttgtagg-cttggccttgta	60S ribosomal protein L15b
5	51	ggaacttgatgg-agaaacgcttgt	60S ribosomal protein L15b
6	45	ggtatcaccatc-caacaagatgga	Translation elongation factor 1α
7	45	tccatcttgttg-gatggtgatacc	Translation elongation factor 1α
8	36	gtccatcttgtt-tacttgaaggaa	Translation elongation factor 1α
9	56	gttcttggagtc-gtccatcttgtt	Translation elongation factor 1α
10	56	aacaagatggac-ctccaagaacga	Translation elongation factor 1α
11	41	aacaagatggac-tcaccactgaag	Translation elongation factor 1α
11b	42	tggtatctccaa-aacgtcaagaac	Translation elongation factor 1α
12	58	caacaagatgga-ctccaagaacga	Translation elongation factor 1α
13	42	cacttcttcatg-atggacgagatg	60S ribosomal protein L15b
14	37	catatgcttgtc-gactcgtcatct	Putative CD box sno-RNA protein,putative protein S6 kinase
15	43	cttcagtggtga-ccatcttgttga	Translation elongation factor 1α
16	51	gaacttgatggt-agaaacgcttgt	Hypothetical protein
17	36	gttccttcaagt-caacaagatgga	Hypothetical protein
18	36	tccatcttgttg-acttgaaggaac	Translation elongation factor 1α
19	48	tcttgacgttga-gtccatcttgtt	Translation elongation factor 1α
20	48	ttcttgacgttg-gtccatcttgtt	Translation elongation factor 1α
21	39	tcgttcttggag-tcttgatgaagt	Translation elongation factor 1α
21b	53	ttcttggagtca-tccatcttgttg	Translation elongation factor 1α
22	45	ttcttgacgttg-catgttcttgat	Translation elongation factor 1α
23	37	ttccttcaagta-caacaagatgga	Translation elongation factor 1α
24	45	ttcatcaagaac-tcaacgtcaaga	Translation elongation factor 1α
25	41	ttcagtggtgac-atcatgttcttg	Translation elongation factor 1α
26	39	tccttgatttcg-cttcatgacctt	Translation elongation factor 1α
27	37	tcaagaaggtcg-ctccaagaacga	Translation elongation factor 1α
28	46	tcaagaacatga-caacgtcaagaa	Translation elongation factor 1α
29	40	caacaagatgga-aagttcatcaag	Translation elongation factor 1α
30	47	gttcttgacgtt-gtccatcttgtt	Translation elongation factor 1α
31	48	aacaagatggac-caacgtcaagaa	Translation elongation factor 1α
32	39	gacttgatgaac-ccatcttgttga	Translation elongation factor 1α
33	45	catcaagaacat-tcaacgtcaaga	Translation elongation factor 1α
34	52	catcaagaacat-gactccaagaac	Translation elongation factor 1α
35	40	catcaagaaggt-gactccaagaac	Translation elongation factor 1α
36	40	atcaagaacatg-tcaccactgaag	Translation elongation factor 1α
37	51	caagcgtttctc-ccatcaagttcc	Translation elongation factor 1α
38	39	atctccaaggat-ctccaagaacga	Translation elongation factor 1α
39	52	atcaagaacatg-gactccaagaac	Translation elongation factor 1α
40	45	atcaagaacatg-tcaacgtcaaga	Translation elongation factor 1α
41	40	acttgatgaact-tccatcttgttg	Translation elongation factor 1α
42	37	acttcatcaaga-tcaacgtcaaga	Translation elongation factor 1α
43	40	acaagatggaca-gactccaagaac	Translation elongation factor 1α
44	38	aacaagatggac-aagttcatcaag	Translation elongation factor 1α
45	39	aaggtcatgaag-cgaaatcaagga	Translation elongation factor 2
46	42	catctcgtccat-catgaagaagtg	Beta tubulin 2
47	45	gttcttgaactt-cttcatcttcca	Translation elongation factor 3
48	39	tccttgatttcg-ccatcttggaga	Translation elongation factor 3
49	45	tggaagatgaag-aagttcaagaac	Translation elongation factor 3
50	42	tggaagatgaag-tgtcaagaccaa	Translation elongation factor 3
51	42	ttggtcttgaca-cttcatcttcca	Translation elongation factor 3
52	51	tggaacttgatg-gagaaacgcttg	60S ribosomal protein L10 (L1)
53	35	ggtatcaccatc-ctccaagaacga	Translation elongation factor 1α
54	40	ttcatcaagaac-tcaccactgaag	Translation elongation factor 1α

Seed primer sequences highlighted bold, were those qualifying for the last and final cross-laboratory trial.

For the Pfam domain approach by Lewis et al. (2011), a nucleotide search was first conducted within NCBI for Interpro names to help with primer validation, using keywords found within the paper (e.g., PF01625, PGK, IPR001576) and NCBI was searched for more recent accessions using trimmed amino acid sequences for each of the inferred protein regions as queries. The downloaded nucleotide sequences were added to the original data from Lewis et al. (2011). A ClustalW alignment was performed individually on the supplied trimmed sequences and NCBI downloaded sequences. For LNS2, the alignments were primarily from basidiomycete sequences as the objective was to design primers that would have a higher success for this group. The alignments were handled individually to ensure that all possible primers for each gene could be identified. Within each gene alignment, frequency tables were developed for conserved regions. All areas of the protein region that contained, at a minimum, 3 conserved nucleotides (> 85 %) at the 3’ end, were examined for the potential of becoming a primer site. In developing the primers, all base positions were examined visually. Any base with a frequency > 80 % was maintained in the primer. Primers in which there was no base with a frequency > 80 %, were resolved into degenerate primers, such that the minimum frequency > 80 % was maintained. Close to 500 primer pair combinations were tested. The overall number of forward primers designed, of reverse primers designed, and number of primer pairs tested, were 26, 23 and 125 for PGK; 16, 13 and 74 for TOP1; 20, 28 and 49 for LNS2; 24, 13 and 62 for IGPS; 17, 11 and 64 for PurE; 12, 9 and 27 for Msr; and 12, 8 and 32 for vATP, respectively.

Laboratory trials of gene-based primers (phases I and II)

For the primers designed by the complete genome approach from Robert et al. (2011), annealing temperatures were set at 42 °C and slowly ramped up to 52 °C. This allowed us to identify the most universally applicable primer pairs (one-by-one proof-of-principle) and those yielding nothing, or single or multiple fragments. The general PCR cycler setup was: 7 min 95 °C, 1 min 95 °C, 1 min 42–52 °C (+ 0.5 °C/cycle until 52 °C is reached), 2 min 72 °C, the latter three steps repeated 40 times, final elongation at 10 min 72 °C and cooling at 10 °C. The laboratory work was conducted in two initial phases. Phase I included a small set of genomic DNA extracts roughly covering several major fungal lineages (Table 4) representing strains with completely sequenced genomes, thus the same strains used for the in silico work described above (Robert et al. 2011). This trial aimed to identify well-performing primer sets and to assess their versatility, i.e. their ability to amplify the targeted gene from various genera. Primer pairs that yielded no product were discarded. Phase II was conducted with 48 DNA extracts (Table 5) to decrease taxon bias and increase the reliability and consistency of the entire experiment. Only primers that successfully amplified, either highly accurately and/or slightly inaccurately (defined as a single PCR product not exactly the calculated size), were further tested employing the PCR conditions described above. Amplifications that resulted in a single, clear fragment, without any further reaction optimisation, were tested under widely used successful PCR conditions, defined as: 5 min 95 °C, 1 min 95 °C, 1 min 48 or 50 °C, 2 min 72 °C, the latter three steps repeated 40 times, final elongation at 10 min 72 °C and cooling at 10 °C. This PCR protocol is generally applicable to those primers listed in Table 6. Only those pairs yielding single fragments under these conditions were kept for the final phase. Phase III included in depth testing of the best candidate primers, described in the Results section, with large sets of DNA extracts covering multiple species representing economically important genera such as Penicillium sensu stricto (s.str.), Fusarium s.str., or higher level ranks covering several genera within orders such as the Onygenales (Table 1). The selected taxa (number of strains or specimens) for this phase represent those covered by the consortium of participating laboratories. Benchmarking of the best newly designed candidate primers was conducted against commonly used and well-recognised primer pairs (Table 2). All reagents for phases I and II were standardised with enzymes and dNTPs from Takara (Japan), oligonucleotides synthesized by Integrated DNA Technologies (The Netherlands) and PCR reactions ran on SensoQuest PCR cyclers (Germany) as described above under ‘standard molecular procedures’. Reagents for phase III varied among laboratories and institutions according to their internal protocols, and served as a robust verification of the tested primers (detailed protocols can be requested from the respective ‘collaborator’ as indicated in Table 1).

Table 4

Initial test cultures used for primary laboratory trial I for primers inferred from ‘nucleic acid based’ computational predictions (Robert et al. 2011).

CBS number	Taxon
CBS 513.88	Aspergillus niger
CBS 818.72	Aspergillus oryzae var. brunneus
CBS 1954	Candida parapsilosis var. parapsilosis
CBS 115846	Cryphonectria parasitica
CBS 123668	Fusarium oxysporum f.sp. lycopersici
CBS 445.79	Laccaria bicolor
CBS 277.49	Mucor circinelloides f. lusitanicus
FGSC 9596	Nectria haematococca
CBS 708.71	Neurospora crassa
FGSC 10004	Phycomyces blakesleeanus
CBS 405.96	Schizophyllum commune
CBS 142.95	Trichoderma atroviride
CBS 109036	Trichophyton equinum
CBS 127170	Verticillium dahliae
CBS 115943	Zymoseptoria tritici

Table 5

Initial test cultures used for secondary laboratory trial II for primers inferred from ‘nucleic acid based’ computational predictions (Robert et al. 2011).

CBS number	Taxon
CBS 674.68	Ajellomyces dermatitidis
CBS 118699	Alternaria brassicicola
CBS 131.61	Aspergillus flavus var. flavus
CBS 126972	Aspergillus nidulans
FGSC 1144	Aspergillus niger
FGSC 1156	Aspergillus terreus
CBS 136.29	Bipolaris maydis
CBS 114389	Blastomyces dermatitidis
CBS 8758	Candida albicans var. albicans
CBS 113850	Coccidioides immitis
CBS 113843	Coccidioides posadasii
CBS 126970	Coprinus cinereus
CBS 8710	Filobasidiella neoformans
FGSC 9075	Fusarium graminarum
FGSC 9935	Fusarium oxysporum f.sp. lycopersici
CBS 123670	Fusarium verticillioides
FGSC 1089	Gibberella fujikuroi
CBS 287.54	Histoplasma capsulatum
CBS 2605	Lodderomyces elongisporus
CBS 113480	Microsporum canis
CBS 180.27	Neurospora tetrasperma
CBS 223.38	Neurospora tetrasperma
CBS 101191	Neurospora tetrasperma
CBS 372.73	Paracoccidioides brasiliensis
CBS 127171	Parastagonospora nodorum
FGSC 9002	Phanerochaete chrysosporium
CBS 6054	Pichia stipitis
CBS 126969	Podospora pauciseta
CBS 120258	Pseudocercospora fijiensis
CBS 117146	Pyrenophora tritici-repentis
CBS 128304	Pyricularia grisea
CBS 658.66	Pyricularia grisea
CBS 126971	Rhizopus orysae
CBS 124811	Schizophyllum commune
CBS 7116	Schizosaccharomyces japonicus
CBS 484	Sporidiobolus pararoseus
CBS 208.27	Sporobolomyces roseus
CBS 375.48	Talaromyces stipitatus
CBS 693.94	Trichoderma atroviride
CBS 392.92	Trichoderma reesei
CBS 383.78	Trichoderma reesei
CBS 127.97	Trichophyton equinum var. equinum
CBS 668.78	Uncinocarpus reesii
CBS 127172	Ustilago maydis
CBS 127169	Verticillium alboatrum
CBS 599	Yarrowia lipolytica
CBS 732	Zygosaccharomyces rouxii

Table 6

Super primers and best candidate primers inferred from ‘nucleic acid based’ computational predictions (Robert et al. 2011).

Locus	Original primer name	Final primer name	Sequence (5’-3’)
TEF1α	Al33_54_73_F1_forward	EF1-1018F	GAYTTCATCAAGAACATGAT
	Al33_879_859_R2_reverse	EF1-1620R	GACGTTGAADCCRACRTTGTC
	Al_34_EF1_300_F1_forward	EF1-1002F	TTCATCAAGAACATGAT
	AL34_EF1_1050_R_Tail_reverse	EF1-1688R	GCTATCATCACAATGGACGTTCTTGGAG
	Al33_129_148_F2_forward	Al33_alternative_f	GARTTYGARGCYGGTATCTC
	Al28_EF1_400_f	EF1_alterantive_3f	TTYGARGCYGGTATCTC
	Al28_EF1_900_R	EF1_alternative_3r	GAVACRTTCTTGACGTTGAA
60S L10 (L1)	AlGr52_412-433_f1	60S-908R	CTTVAVYTGGAACTTGATGGT
	Algr52_1102_1084_R1	60S-506F	GHGACAAGCGTTTCTCNGG
TEF3	Al50+51_EF3_2900_f	EF3-3185F	TCYGGWGGHTGGAAGATGAAG
	Al50+51_EF3_3300_R	EF3-3538R	YTTGGTCTTGACACCNTC
	Al47_EF3_1650_forward	EF3-3188F	GGHGGHTGGAAGATGAAG
	A47_EF3_2451_R1_reverse	EF3-3984R	TCRTAVSWGTTCTTGAACTT
	Al49_ EF3_44_63_F1_forward	EF3-3186F	CYGGHGGHTGGAAGATGAAG
	Al49_ EF3_846_829_r1_reverse	EF3-3984R2	TCRTAVSWGTTCTTGAAC

For the primers designed from Pfam domains (Lewis et al. 2011), initial PCR analysis of potential barcoding primers was performed on DNA from the ascomycetes Cadophora fastigiata and Pyrenophora teres f. teres using a standard PCR protocol and gradient annealing temperatures, ranging from 56–70 °C, to determine an optimum annealing temperature for further testing. Touchdown PCR (Don et al. 1991) was used to increase primer specificity. The final Touchdown temperatures of 68–58 °C was selected on the basis of the PCR giving the best combination of product concentration and single fragment size produced for two of the three initial ascomycetes tested. For IGPS and vATP, none of the primer pairs passed this first panel. Primer pairs that resulted in acceptable sequences for this first panel were carried forward and tested against a panel of eight organisms: Rhizophydium littoreum (Chytridiomycetes), Mortierella vinacea (Zygomycetes), the basidiomycetes Tilletia indica and Puccinia graminis, and the ascomycetes Pyrenophora teres f. teres, Debaryomyces hansenii, Penicillium verrucosum and Fusarium graminearum. Only primers for PGK, TOP1 and LNS2 were tested further with additional isolates and with tagging of M13 primers to improve sequencing success (Table 7). Throughout the process of primer development, newly acquired long sequences, as well as newly mined results of full length genes from GenBank (performed on a monthly basis) were used to improved primers for shorter fragments.

Table 7

Super primers inferred from ‘protein-based’ computational predictions (Lewis et al. 2011).

Locus	Original primer name	Final primer name	Sequence (5’ to 3’)
PGK	PF00162.1120.M13-8-F	PGK_480-F	TGTAAAACGACGGCCAGTACGATATCCGAGTCGACTTCAAYGTCCC
	PF00162.2081.M13-8-R	PGK_480-R	CAGGAAACAGCTATGACTCGAAGACACCRGGGGGACCGTTCCA
	PF00162.1433.M13-8-F	PGK_483-F	TGTAAAACGACGGCCAGTACGATGAGAACYTGCGHTTCCACRYYGAGGAGGARGG
	PF00162.1793.M13-8-R	PGK_483-R	CAGGAAACAGCTATGACCTTCTTGAAGGTGAARGCCAT
	PF00162A.675.1-F	PGK_511-F	GTYGSTGCYYTGCCMACCATCAA
	PF00162A.1915.1-R	PGK_511-R	ATCTTGTCRGMRACCTTRGCACC
	PF00162.1127.1-F	PGK_533-F	GTYGAYTTCAAYGTYCC
	PF00162.2081.1-R	PGK_533-R	ACACCDGGDGGRCCGTTCCA
TOP1	PF02919.2708.M13.1-F	TOP1_501-F	TGTAAAACGACGGCCAGTACGATACTGCCAAGGTTTTCCGTACHTACAACGC
	PF02919.3469.M13.8-R	TOP1_501-R	CAGGAAACAGCTATGACCCAGTCCTCGTCAACWGACTTRATRGCCCA
LNS2	PF08235.1463.8-F	LNS2_468-F	GGCCATGTGCTGAACATGATCGGHCGWGAYTGGAC
	PF08235.1821.8-R	LNS2_468-R	CGGTTGCCRAAKCCRGCATAGAAKGG

Bases in bold: M13 primers with an ACGAT spacer for the forward primers.

Sequence and data analysis (phase III)

To assure standardised laboratory procedures, a working agreement defining experimental conditions was distributed prior to primer testing in phase III, which defined the basis for recording amplification efficiency. To assemble a data matrix of binary variables, successful amplification was scored as ‘true’ and unsuccessful amplification as ‘false’. Recovery of multiple fragments was also classified as ‘false’. Single fragments of unexpected sizes were assumed to represent length variations (mostly false negatives), a common phenomenon when comparing intron-containing sequences, were classified as ‘true’. This phenomenon does not always reciprocally correlate with an incorrect amplification of targeted gene section or, subsequently, poor downstream (sequencing) success. All data were analysed and visualised with R statistical software (R Core Team 2014) with the ‘lattice’, ‘vcd’ and ‘MASS’ libraries, as well as related packages; the source code is available from Sarkar (2008). Primer amplification efficiencies were visualised as ‘barchart’ and ‘mosaicplot’. Gene maps with primer locations were created with the Qiagen CLC genomics suite (http://www.clcbio.com/products/clc-genomics-workbench/) primers designed by the complete genome approach by Robert et al. (2011) and with Geneious v. R6 (Biomatters http://www.geneious.com/) for the primers designed from Pfam domains (Lewis et al. 2011). For sequences generated from primers designed by the complete genome approach by Robert et al. (2011), sequences were stored as bidirectional reads and edited as described above, using BioloMICS (Vu et al. 2012) at CBS. Various software packages, differing between laboratories, were employed to assemble consensus sequences. Sequences were visually corrected for sequencing artefacts and quality controlled within individually aligned datasets. Quality controlled data was uploaded to the BioloMICS database and pairwise distances for selected (symmetric/orthogonal matrices only) datasets calculated using the optimistic reverse pairwise alignment algorithm. Symmetrical datasets were assembled to include sequence data for at least: complete ITS1-5.8S-ITS2, LSU, TEF1α (either from one of the newly designed TEF1α primer pairs, see Results), or the TEF1α AFTOL primer set EF1-983F / EF1-1567R, Table 2, spanning almost the same section), TEF3, and the section spanning the 60S ribosomal protein L10 (L1). The global dataset comprised 502 strains or specimens representing a fully symmetrical distance matrix with 5 gene partitions. Each dataset was analysed individually to create a single locus similarity matrix to generate an overview UPGMA tree. A second analysis used all loci in a concatenated matrix, followed by a multi-dimensional scaling analysis (MDS), visualised in n-dimensional space. All six similarity matrices (5 loci + 1 concatenated) were compared, to obtain pairwise cophenetic coefficients of correlation (Mantel test) between each similarity matrix (Smouse et al. 1986). A correlation super matrix was subsequently created, which was numerically rescaled and another final MDS performed; the results were plotted in an n-dimensional space to compare the ‘global’ between-loci performance. Eigenvalues for each axis were computed to assign weights to dimensions. Two examples, the Onygenales and Fusarium datasets were analysed individually as described above, with identical loci sampling, except for the Onygenales which was supplemented with ‘γ-Actin’ sequence data to provide examples on ‘local gene optima’. For sequences generated from primers designed by the Pfam approach of Lewis et al. (2011), bidirectional reads were edited manually using Geneious to generate consensus sequences. Sequences were aligned through MAFFT within Geneious and processed through R to compare intra- and inter-species pairwise distances (Schoch et al. 2014). TUB2 and TEF1 sequences of Penicillium and Fusarium strains, respectively, that were already available were added to the dataset as comparison for species resolution. The aligned sequences of each marker were analysed individually with the ‘ape’ package (Paradis et al. 2004) for R (R Core Team 2014) to generate raw pairwise distances for each marker and each pair of strains. The values were separated as intra- or inter-species distances and the function ggplot2 (Wickham 2009) was used with R to generate a barcode gap plot for each gene by overlaying the distribution of the intra- and inter-species distances.

RESULTS

In silico selection of gene regions and manual design of initial seed-primers (gene-based primers)

The distance super matrix, underlying the multiple alignments used to generate the reference topology, was correlated with all individual distance matrices using Pearson coefficient. One third of the genes (29.8 %) produced a phylogeny highly correlated with the super matrix (Robert et al. 2011). Seventy percent of the gene matrices correlated at > 0.70 with the super matrix, and only 25 genes had no or a very low correlation. For the 531-gene matrix, maximum correlations were found for 190 concatenated individual genes. The results from the computational inference indicated that a single or a very small number of genes were sufficient to reflect the reference topology. Unfortunately, none of the alignment sections used later for the design of initial seed-primers scored high among candidates that reflected the ideal reference topology (see Robert et al. 2011). However, key criteria for in silico selection of gene regions that would successfully qualify as secondary barcodes were universally met, including the conservation of primer sites among distant taxa (e.g. Ascomycota and Basidiomycota), with predicted amplicons no longer than 1 000 bp, a technical constraint for Sanger DNA sequencing and a requirement of the barcode standard. Qualification of gene regions was restricted to the 54 alignments that served for seeding novel primers and manual redesign, described above, and reflected a compromise between score against the super-matrix and PCR versatility (1 Kbp cut-off). However, the alignments are essentially the results of the in silico searches described as BPPM and SPM above, and non-redundant alignments were selected. Manually inspected alignments (available upon request) and the identified conserved sites yielded a large number of primers (Table 3), which were tested in trial I and reduced to the best performing candidates (Table 6) for trial II. A more rigorous test of primers successful in trial II was undertaken in trial III (Table 6), using sets of extracts with increased species sampling of specific taxonomic groups. Because of time and resource constraints, some of the putative primers were not tested and should be tested in future studies, e.g. those corresponding to TEF2 (Table 3; Al45, Al48). Gene maps, indicating exact primer locations, and exon and intron boundaries for each gene, were designed for the best performing candidates (Table 6) and for some important ‘standard’ markers (Table 2; the latter supplemented with positional information of commonly used primers). These gene maps represent the fungal rDNA operon (Fig. 1), TUB2 (Fig. 2), ACT (Fig. 3), RPB2 (Fig. 4), TEF1 (Fig. 5), TEF3 (Fig. 6) and 60S L10 (L1) (Fig. 7).

Fig. 1

Structural gene map and positional primer locations for fungal rDNA array.

Fig. 2

Structural gene map and positional primer locations for fungal β-tubulin 2.

Fig. 3

Structural gene map and positional primer locations for γ-actin.

Fig. 4

Structural gene map and positional primer locations for fungal RNA polymerase subunit II.

Fig. 5

Structural gene map and positional primer locations for fungal translation elongation factor-1α.

Fig. 6

Structural gene map and positional primer locations for fungal translation elongation factor 3.

Fig. 7

Structural gene map and positional primer locations for fungal 60S ribosomal protein L 10 (L1).

In silico design of initial seed-primers of protein regions and manual optimization (Pfam based primers)

For the seven protein families found by Lewis et al. (2011) that could be potentially amplified with a single primer pair, a total of 127 forward primers and 105 reverse primer pairs were tested, including some modified with M13 sequencing primers, for a total of 433 primer pairs always tested individually. In silico analysis and manual refinement resulted in six primer pairs that passed all quality filters, which were subsequently used as described above (Table 7). Gene maps for TOP1, PGK and LNS2, indicating primer locations, exon and intron boundaries for each gene are shown in Fig. 8, Fig. 9, Fig. 10, respectively.

Fig. 8

Structural gene map and positional primer locations for fungal TOP1 primarily based on Penicillium chrysogenum genome.

Fig. 9

Structural gene map and positional primer locations for fungal PGK primarily based on Penicillium chrysogenum genome.

Fig. 10

Structural gene map and positional primer locations for fungal LNS2 primarily based on Penicillium chrysogenum genome.

Laboratory testing of in silico designed primer sets designed from gene alignments

Laboratory tests were conducted for 71 sets of primers corresponding to nine distinct regions (or 15 alignment groups) listed in Table 3 for their ability to consistently amplify a short standardised gene region resulting in a single PCR fragment. Trial I With trial I, we identified pairs of primers that resulted in a single PCR product employing a low annealing temperature (Ta), what required adjustment of PCR parameters and conditions. PCR tests with several primer pairs that contained an inosine (= I), instead of third-base wobbles, resulted in almost no visible amplicons. We excluded all such primer pairs from further testing, including those that did not result in any detectable PCR products, and these are not mentioned further. Trial II For trial II, the Ta was unchanged but the set of extracts was expanded to test a broader range of fungi, and a subset of seven primer pairs was selected that gave optimal results during trial I. These primer sets were tested for their reliability and consistently over a broad range of fungal species, with preference for those amplifying the desired marker as single fragment. Some primer sets yielded multiple fragments. Some of the best primer sets, derived from alignment groups 2 (ACT), 33 (TEF1α) and 52 (L1) yielded a strong single signal with almost no secondary fragments. Primers for alignment groups 46 (TUB2) and 49 (TEF3) yielded multiple fragments, but had a strong primary signal. Of our Trial II candidates, the most problematic primers were those for a gene corresponding to a small RNA processing protein, a putative S6 kinase (alignment group 14), which resulted in multiple intense bands. Tailed primer design was a successful approach, resulting in PCR products from evolutionary diverse fungi for TEF1α, a gene until now difficult to amplify on a universal basis with known standard primers. However, because of our stringent testing procedure, only alignment groups highlighted in bold (Table 3) qualified further for trial III, and these selections were strictly based on universal fidelity among all tested strains in trial II. Trial III In the last experimental phase, primer testing was extended to species-specific DNA sets. These extracts were received from 10 collaborating taxonomic experts at CBS (Table 1) in sets of 96 extracts. This more extensive testing evaluated the capacity of sequences yielded by the selected primer pairs for their power to delimit closely related fungal species. The extracts represented relevant groups such as medically or phytopathologically relevant fungi, or were selected because reference datasets were available for comparison, or because the taxa represented complexes of species that were previously shown to be difficult to resolve. The performance of several primer pairs amplifying TEF3 (alignment groups 47, 49, 50 and 51) and newly designed primers for TEF1α (alignment groups 17, 28, 31, 33) was consistently excellent. During trial III, we sequenced larger numbers of PCR products corresponding to TEF3, the 60S L10 (L1) and TEF1α to test the performance of our primers in standardised Sanger sequencing applications. Our results convinced us that capillary sequencing and production of high quality trace files did not require major modifications for the tested PCR primers or amplification/sequence parameters. For all amplicons assessed by Sanger sequencing, the sequences obtained were identical to those recorded in the original in silico inferred alignments for the respective target species. The designated primer pairs for TEF2 (alignment groups 45, 48) gave very poor results and were excluded from further testing. The primer pair for alignment group 28 (TEF1α) initially gave good results, but performed poorly in later experiments and was excluded from further testing. However, in common with the ‘untested’ pairs, the primers for TEF2 and TEF1α (alignment group 28) may be amenable to species-specific redesign (Table 6; Al28 = EF1_alternative_3f, EF1_alternative_3r), and thus represent interesting ‘seed’ candidates for further experiments. Best performing primers (Table 6), were distributed to collaborating colleagues for independent testing on taxon sets (Table 1).

Amplification efficiencies

Optimal visualisation of multivariate data matrices, showing relative proportions of ‘within’ and ‘between’ categories (i.e. datasets vs primers), was achieved with the ‘mosaic plot’ function using ‘lattice’ in R (Sarkar 2008), with our categorical variables encoded as typical ‘survival data’. Each plot is sectioned in rows representing datasets and columns representing primer pairs separated by primers designed by the complete genome approach (Fig. 11a, b) and primers designed by the Pfam approach (Fig. 11c, d). Horizontal expansion of boxes indicates the ratio of PCR reactions within a ‘dataset’ or taxonomic group relative to the global quantity (proportion) of all other PCR values (outcome), and vertical expansion of boxes indicates the ratio of individuals within quantity (proportion) of PCR values (outcome) of a specific primer test (e.g. ITS). Horizontal lines indicate value ‘zero’. When a line with a dot is on the left, it equals ‘zero’ positives (= no amplification), and when the line with a dot is on the right, it indicates ‘zero’ negatives (= 100 % amplification). Darkly shaded boxes indicate proportion of amplifications, relative to no amplification, which are shown by pale grey shaded boxes. Mosaic plots outperform other plot types in visualising large quantities of data and provide a global overview on data proportions only. The frequencies of positive and negative amplification for each primer pair are shown in Fig. 11b and d.

Fig. 11

Mosaic plot, indicating proportions of success rates between datasets (a, b: complete genome approach; c, d: Pfam approach) and within tested primer pairs.

The mosaic plot shows that the overall amplification testing between laboratories to test the primers designed through the complete genome approach (Robert et al. 2011) was strongly biased towards novel primer pairs, revealed by the many horizontal lines in some datasets corresponding to the ACT, TUB2, TEF1α (AFTOL primers), ITS, LSU, and RPB2 pairs (Fig. 11a). Data corresponding to these pairs was mainly obtained from PCR experiments performed at CBS, and lack of data from collaborating laboratories can simply be explained by time and financial constraints. However, the results of these amplification experiments provide an important benchmark, even when inconsistently assessed, and valuable information on the universal performance of certain pairs. Promising results were obtained for several amplicons. The TEF1α region, with the AFTOL pair (EF1-983F/EF1-1567R), enjoyed relatively consistent amplification for different taxon sets, with exceptions being relatively poor results for the Onygenales and rock-inhabiting Teratosphaeriaceae (80 % average). Results for sections of the fungal rDNA operon, especially the complete ITS and partial LSU (D1/D2 domains), were high in efficiency for all taxon sets, 92 % and 91 %, respectively (Fig. 11b). These results confirm the outstanding potential of ITS as primary universal fungal DNA barcode; it performs exceptionally well under standard laboratory conditions. No particular distinction was made between the possible forward primer combinations using ITS5/ITS1/ITS1-F. Similarly, LSU had very similar amplification efficiencies with LR0R/LR5, and augments the useful species-level signal ITS with increased phylogenetic information. The results for RPB2 (fRPB2-5f/fRPB2-7cR) were less encouraging; only five taxon sets were tested for a single primer combination and the global performance of RPB2 can only be partly judged from our data. Two very commonly known gene sections widely employed in fungal phylogenetics (Fig. 2, 3), ACT and TUB2, had relatively consistent amplification efficiencies among datasets (mostly grey shaded boxes), showing that these pairs qualify well for universal amplification (84 % ACT and 86 % TUB2 on average, respectively). Unfortunately, as is obvious from Table 2, we tested different primer pairs for these genes, and no universal performing set was found (CA5R/CA14 for ascomycetous yeasts II only; Bsens/Brev for Polyporaceae only, amplification success data not shown). The problem is apparent with ascomycetous yeasts group II, where experiments were performed with CA5R and CA14 instead of ACT-512F/ACT-783R, and the entirely negative results of the latter PCRs were excluded from the analysis. A similar issue was observed in tests prior to this study for a Polyporaceae taxon set (data not shown), where only Bsens and Brev resulted in positive amplification (Lesage-Meessen et al. 2011). Surprisingly amplification of TUB2 (using Btub2Fd/Btub4Rd) in ascomycetous yeasts group II always resulted in a single PCR product, but these amplicons always yielded noisy trace files. β -Tubulins are known to be duplicated, particularly in basidiomycetous fungi (Ayliffe et al. 2001, Zhao et al. 2014). The inconsistent PCR efficiencies and taxon-specific primer pairs disqualify ACT and TUB2 as universal barcodes. Overall amplification efficiencies of primer pairs were inconsistently recorded in different laboratories and our conclusions on universality remain speculative. The most comprehensive results were retrieved for five of our newly designed primer pairs, EF1-1018F/EF1-1620R, EF1-1002F/1688R, 60S-908R/60S-506F, EF3-3185F/EF3-3538R and EF3-3188F/EF3-3984R. Performance of these three gene targets, as visualised in Fig. 11, varied among these pairs, but successful amplification proved that predictions of universal primer binding sites were accurately computed. Of these, the TEF1α primer candidates tested consistently among all labs. The poorest performance (57 % average) occurred with EF1-1002F/1688R, which had low efficiencies for all ascomycetous and basidiomycetous yeasts, and for the Ceratocystis, Onygenales, Penicillium 2, Russula and Teratosphaeriaceae taxon sets. In contrast, the TEF1α pair EF1-1018F/EF1-1620R had the highest fidelity for all taxon sets and among all tested protein coding genes with an average amplification success of 88 % (Fig. 11), almost equal to ITS (92 %) and LSU (91 %). As a second best among our novel primer pairs, we identified a pair corresponding to a gene encoding the 60S ribosomal protein L10 (L1), previously unused for barcoding or phylogenetic purposes. This primer pair had an average success rate of 77 %, with local optima over 95 %, and performed poorly only for some ascomycetous and basidiomycetous yeasts, and the Coniothyrium, Mycosphaerellaceae and Onygenales sets (∼40–60 %, Fig. 11). The novel region TEF3, although a fungal-specific gene, could not be universally amplified with high success rates, with neither the short nor long sections yielding satisfactory results for fidelity (short EF3-3185F/EF3-3538R, 68 % on average; long EF3-3188F/EF3-3984R, 52 % on average). Nevertheless, promising local optima for both TEF3 pairs were observed. The long section could be retrieved (∼70–90 % success) for the Sordariales (Colletotrichum, Ceratocystis, Scedosporium), Hypocreales (Fusarium s.str., Hypocreales s.lat.), the Dothidiomycetes (Mycosphaerellaceae) and Eurotiales (Penicillium 1, 2). The shorter TEF3 section was amplifiable for a broader spectrum of fungi with relatively high success rates, including the Sordariales, Hypocreales, Onygenales, Eurotiales and even within the basidiomycetes (∼70–90 % success). Despite this, failure to retrieve products for other taxa, such as the yeasts sensu lato, strongly decreased the universal efficiency of both TEF3 primers. We did not globally test a third TEF3 pair, EF3-3186F/EF3-3984R2, because of poor performance in the trial III. Nevertheless some taxon specific efficiency, in particular for Fusarium (Hypocreales) and Scedosporium, (Sordariales) were recorded. For the primers designed by the Pfam domain approach (Lewis et al. 2011), PGK533 primer pair performed the best for this locus, generating fragments of about 1 kb with a success rate about as good as ITS with a wide range of fungi, except for most Basidiomycetes (Fig. 11c, d). The TOP1 primer pair amplified an 800 bp fragment efficiently for most ascomycetes but performed rather poorly for lower fungi and most basidiomycetes, the notable exception being Pucciniomycotina which are very challenging for ITS (Fig. 1c, d). LNS2 generated a short fragment (less than 400 bp) but amplified efficiently in most lower fungi and in all basidiomycetes.

Barcode gap analyses

As expected, many closely related species have a pairwise distance of zero (Fig. 12). PGK, TOP1 and LNS2 all show a higher resolution than ITS. Our data shows that PGK and TOP1 are as good as TUB2 or TEF1α to resolve closely related Penicillium and Fusarium species, respectively (data not shown). There is insufficient data to make such statement for LNS2 as testing for ascomycetes was less of a priority than testing for basidiomycetes.

Fig. 12

Barcode gap analysis for ITS, PGK, TOP1, LNS2, TUB2 and TEF1.

Multi-dimensional scaling of sequence data

Results of the Mantel test (Smouse et al. 1986), comparing pairwise distance matrices of: i) the ‘global 502 taxa dataset’ (Fig. 13a, b, i); ii) ‘Fusarium’ (Fig. 13c, d, g); and iii) ‘Onygenales’ (Fig. 13e, f, h) by pairwise cophenetic coefficients of correlation, were plotted (numerically rescaled 0–1) in n-dimensional space. Rescaled cophenetic coefficients for each dataset are shown in Table 8. Assessment of global performance employing a comparison of TEF1α distances matrices (with sequences obtained from pair EF1-1018F/EF1-1620R; AFTOL pair EF1-983/EF1-1567 combined), TEF3 (sequences obtained from pair EF3-3185F/EF3-3538R, EF3-3188F/EF3-3984R combined), 60S, ITS and LSU for 502 taxa, showed that the first two axes contributed more than 75 % of the total information and the first three axes over 90 %. While 60S and ITS were rendered closely together and apart from LSU, neither TEF1α nor TEF3 were correlated closely to one of the rDNA matrices (Fig. 13a, b). The most significant result from visualising the cophenetic coefficients of correlation is apparent in Fig. 13a and i. The 3D plot and hierarchical clustering indicated that TEF1α correlated optimally with the overall concatenated (ALL) data matrix. Scoring of individual genes (matrices) is equivalent to distances as obvious from the dendrogram in Fig. 13i. The comparison of the ‘Fusarium’ matrices showed the same trend with the first three axes describing the vast majority of variance. Our results show that optimal correlation of the global dataset (ALL) was achieved with the 60S matrix (Fig. 13c, d, g), and the universal barcode ITS correlated poorly with the concatenated (ALL) object. The impact of the second dimension is obvious, because TEF3 is more distant from the concatenated matrix than ITS in the second but not the first dimension. Therefore, the weighted impact of dimensions causes the performance of TEF3 to appear inferior to ITS, presumably an artefact of distance over true character-based (cladistic/phylogenetic) inference (Felsenstein 2003). This could also be related to comparisons of cophenetic coefficients employing the Mantel test (Harmon & Glor 2010). Visualisation of the results from comparing data matrices obtained from the ‘Onygenales’ dataset contradicted those of the ‘502 taxon dataset’. We observed that the combination of two genes, ITS and TEF3, resulted in the ordination highly resembling the one of the concatenated (ALL) matrix. The impact of the third dimension scaled the LSU matrix more distantly than the two latter genes in Fig. 13e but was less pronounced in Fig. 13f. Second closest, and in addition to the two previous datasets, ACT was inferred as inferior to 60S and ITS when compared to the overall dataset. Both elongation factor matrices, TEF1α and TEF3, were inferior to ITS and were both distant from the overall concatenated matrix (ALL). Hierarchical clustering of individual matrices performance against the overall dataset is shown in Fig. 13h. In general, TEF1α performed the best among all ‘502’ investigated taxa when comparing 5+1 gene datasets, which was biased by local optima. The best candidate genes for Fusarium (60S) and Onygenales (ITS and 60S) rendered one of our new candidates as an optimal ‘local’ barcode for these two sets. The pairwise comparison for each dataset and each gene is given in Table 8 by rescaled (0–1) cophenetic correlation coefficients, with key results reflected by MDS analysis (Fig. 13) above.

Fig. 13

Multi-dimensional scaling of rescaled cophenetic correlation coefficients comparing distance matrices.

Table 8

Rescaled cophenetic correlation coefficients from Mantel test comparing distance matrices.

Gene/Dataset	Gene
Onygenales	60S	Actin	All	TEF1α	TEF3	ITS	LSU

60S	1	0.869922	0.945842	0.665818	0.741716	0.878008	0.828932
Actin	0.869922	1	0.86677	0.718523	0.709779	0.833556	0.7986
All	0.945842	0.86677	1	0.703723	0.801272	0.947347	0.879415
TEF1α	0.665818	0.718523	0.703723	1	0.58968	0.656179	0.654457
TEF3	0.741716	0.709779	0.801272	0.58968	1	0.739432	0.751655
ITS	0.878008	0.833556	0.947347	0.656179	0.739432	1	0.822984
LSU	0.828932	0.7986	0.879415	0.654457	0.751655	0.822984	1
Fusarium	60S	All	TEF1α	TEF3	ITS	LSU

60S	1	0.833529	0.748277	0.564056	0.717349	0.818429
All	0.833529	1	0.770656	0.773935	0.766443	0.737515
TEF1α	0.748277	0.770656	1	0.502797	0.616714	0.687664
TEF3	0.564056	0.773935	0.502797	1	0.523543	0.532899
ITS	0.717349	0.766443	0.616714	0.523543	1	0.645249
LSU	0.818429	0.737515	0.687664	0.532899	0.645249	1
Global	60S	All	TEF1α	TEF3	ITS	LSU

60S	1	0.678678	0.621512	0.68173	0.88979	0.748801
All	0.678678	1	0.77138	0.754324	0.698097	0.694224
TEF1α	0.621512	0.77138	1	0.572193	0.68485	0.635149
TEF3	0.68173	0.754324	0.572193	1	0.724653	0.670825
ITS	0.88979	0.698097	0.68485	0.724653	1	0.778558
LSU	0.748801	0.694224	0.635149	0.670825	0.778558	1

DISCUSSION

DNA-barcoding: a conceptual overview

In the early 1960s some milestone papers by Edward & Cavalli-Sforza (1964), Zuckerkandl & Pauling (1965) and Hennig (1965) gave rise to the novel field now known as molecular phylogenetics (Felsenstein 2003), including some decades later ‘DNA barcoding’. The goal of DNA barcoding is to apply high-throughput DNA sequencing technology to large-scale screening of one or a few reference genes to identify unknown specimens to species, and enhance discovery of new species (Hebert et al. 2003a, b, Stoeckle 2003). Proponents of ‘DNA-based taxonomy’ envisage the development of comprehensive databases of reference sequences, centred on voucher specimens or living cultures that represent all described species, against which sequences from newly sampled individuals can be compared. Given the long history of use of molecular markers for this purpose for identifying microbes (e.g. 16S rDNA for prokaryotes, Avise 2004), there is nothing fundamentally novel about DNA barcoding as applied in microbiology and mycology today, except for the increased scale and proposed standardisation. Standardisation by the selection of one or more reference genes is critical and stimulates large-scale phylogenetic analyses, but whether ‘one gene fits it all’ is still an open debate. Initial reactions to the coordinated DNA barcoding movement ranged from great enthusiasm, especially among ecologists (Janzen 2004), to condemnation, largely from traditional taxonomists. The proponents emphasized its application to modern meta-barcoding studies, increasing the precision and efficiency of field studies of rare, diverse and difficult to identify taxa across all kingdoms of organisms, while DNA barcoding is not possible without large-scale and well curated sequence databases (Schoch et al. 2014), founded in traditional systematics and referencing authoritatively identified voucher specimens. DNA barcoding should not be confused or confounded with efforts to resolve the ‘tree of life’ or establish large-scale phylogenies. DNA barcodes often have limited phylogenetic resolution, reflecting the use of a single-gene assay in an attempt to identify an entity to species or reveal inconsistencies between molecular variation and current species delimitations. Thus, while barcoding initiatives use similar knowledge and techniques, resolving phylogenies from species to major higher clade levels requires a different strategy for gene selection. Irrespective of which kingdom of organisms is targeted, the conceptual character and constraints of DNA barcoding always remain identical. We need to separate the two applications: i) molecular diagnostics of individuals relative to described taxa; and ii) DNA-enabled discovery of new species. Both are inherently phylogenetic and rely on a solid taxonomic background, including adequate sampling of variation within species and inclusion of all previously described extant species within a given genus or clade. Accurate species diagnoses rely on careful examination of intra-species variation compared with that observed between species. So a short DNA sequence (barcode) will allow reliable allocation of an individual to a described taxon. Such identifications should be accompanied by a clear statement of statistical support such as branch support in a phylogenetic analysis. This view is in agreement with best taxonomic practice that is acceptance of name changes or subdivisions should be based on multiple lines of evidence. As a strictly designated concept to species identification and discovery, a key theory of DNA barcoding is testing of ‘boundaries’ between taxa established by prior solid taxonomic research (Hebert et al. 2003a, b, Hebert & Gregory 2005, Meyer & Paulay 2005, Schindel & Miller 2005, Schoch et al. 2012).

One fungus, which genes?

In this study, we identified alternative universal gene targets for the fungal kingdom from available fully sequenced genomes (data processed in 2011) using taxonomy-aware in silico pipelines. Two approaches were used to infer potential novel candidates and corresponding primer pairs. Each approach yielded promising, but different candidate barcodes. The search for the ideal gene, using the genomic approach and alignment templates of Robert et al. (2011) was based on five criteria: i) presence in the majority of investigated genomes; ii) potential ease of PCR amplification and Sanger DNA sequencing; iii) restriction to putative single copy orthologs, constrained to genes harbouring conserved sites < 1 Kbp apart; iv) the potential to accurately delimit fungal species; and eventually v) an inherent phylogenetic signal. Often, the distance between conserved nucleotide stretches was too long, or conserved sites were not present across a sufficient proportion of the analysed genomes. Primers were tested and marker sequences we sequenced were the best compromise between technical feasibility, and a holistic ‘one-gene-fits-all’ barcode locus. We identified known genes and some not yet used for either barcoding or phylogenetic applications, resulting in a long list of candidate primers. While most alignment templates were functionally annotated (e.g. TEF1α), only 14 of 54 were identified as putatively novel candidate regions with no known track record in fungal DNA barcoding or phylogenetics. Although TEF1α is already well known to provide superior subordinate taxon resolution in some groups of Ascomycetes such as Trichoderma and Fusarium, and it is sufficiently present in public repositories, there was still a need to improve the universality of primers. From the genomic studies, TEF1α qualified as the only universal barcode candidate. Some of the newly identified genes through the Pfam approach, in particular TOP1 and PGK, which were tested more intensively for sequencing and analyses of barcode gaps, show great promise, especially for ascomycetes, where both performed well in genera with particularly narrow barcode gaps in the ITS, i.e. Penicillium and Fusarium. In general, slowly evolving genes (e.g. TEF1α, RPB2 exons) are more suitable for inference of deep phylogenetic relationships, while genes with higher evolutionary rates (e.g. ITS, TUB2) reflect more recent evolutionary and speciation events. Thus, genes or gene sections reflecting both of these characteristics are most suitable as barcodes if phylogenetic signal is considered important. Fragments of protein-coding genes can potentially meet both requirements, either by incorporating more variable intronic and more conserved exonic sequences, or by providing one level of variability as nucleotide sequences and a more conserved suite for phylogenies as amino acid sequences. Our results show that TEF1α is likely to be one of the most ideal candidates for non-rDNA barcodes because of its balanced ‘trade-off’s’, in particular when it comes to a broader community acceptance of extant users, and versatility among important fungal orders. Finding primer candidates to amplify a standardised DNA fragment with high phylogenetic information content in distantly related species was difficult with the computational resources available in 2011 when our research was conducted. In our in silico search of available genomes of dikaryotic fungi, only seven primer pairs covered 72 species of the 74 species studied. For ascomycetes, six primer pairs were identified as compatible with 57 species studied and 186 primer pairs with 56 species. The universality of our best-performing primers in silico was experimentally verified, at least for Asco- and Basidiomycota. Some primer candidates were located almost adjacent to each other with only a few bases difference at either the 5’ or 3’ primed ends. No new candidates with high local optima such as the MCM7 and Tsr1 primers (Schmitt et al. 2009) were discovered. From the genomic scans, the fragment of the translation elongation factor 1α (TEF1α) gene appears to be the most promising candidate as a universal secondary barcode. We remain cautious about fidelity for basal fungal lineages but we believe our primers for TEF1α are the most optimal compromise between universal taxon applicability and fidelity. In our study we have particularly tested fungal taxa of economic or applied importance. We foresee that a high fidelity secondary DNA barcode would be included in a very large portion of ongoing taxonomic studies. Based on our results and anticipating community acceptance, we propose the use of primers EF1-1018F and EF1-1620R (corresponding to TEF1α) as secondary universal DNA barcode primer pair for the fungal kingdom. Given the results of this extensive four-year multi-lab project, we are convinced that fungal DNA barcoding with ITS and TEF1α for the identification of unknown or validation of ‘primary-barcoded’ specimens will increase the resolution and lead to the development of more complete, structured datasets. Global consensus might also speed up the discovery and description of novel taxa in a standardised way. Another critical question is whether the proposal of a secondary DNA barcode actually is still relevant in a world where next generation sequencing is becoming routine in many laboratories, at a cost that is comparable to Sanger sequencing? The costs for sequencing a complete fungal genome are dropping rapidly. Despite this, the fungal taxonomic community lacks the computational resources and capacities to routinely process complete genomes for specimen identification or for phylogenetic studies, especially in the developing world. Undoubtedly, the rapid increase of publicly available fungal genomes will enhance our understanding of evolutionary processes governing niche adaption, mating or gene family expansion related to pathogenicity and other functional capabilities. Genome plasticity, architecture and high genome sequence identity between sister species will force us to rethink ‘tree thinking’ (Bapteste et al. 2005). Despite this, genome sequencing and its subsequent Big Data explosion have yet to establish a solid basis for molecular species identification, in particular when there is low sequence coverage per taxon, insufficient sampling of species diversity or, reciprocally, when complete genomic sequences may require an updated classification. Rank inflation, i.e. the tendency to reclassify taxa at higher taxonomic levels (such as a previously recognised family as an order) is currently commonplace in fungal taxonomy, and requires a more standardised approach. Agreement on conserved universal protein coding markers, such as TEF1α, that would allow a more precise nesting of unknown fungi among known higher level taxa will be valuable. DNA barcoding has great impact on human and animal health diagnostics. We hope that this paper, with its detailed gene maps and large quantity of tested and yet to be tested primers, will serve as a reference for expanded barcoding (e.g. further testing of the fungal specific translation elongation factor 3 in a clinical setting). It seems unlikely that there will be a similar study of this magnitude, employing PCR, Sanger DNA sequencing and testing so many universal primers on a large collection of fungal taxa. Therefore, we believe that the knowledge obtained during this study would convince the community to ratify TEF1α as secondary barcode in order to ensure its rapid application with the improved primer system, but also to promote the research on other target genes for other lineages of fungi.

75 in total

1. The promise of DNA barcoding for taxonomy.

Authors: Paul D N Hebert; T Ryan Gregory
Journal: Syst Biol Date: 2005-10 Impact factor: 15.683

2. DNA barcoding a useful tool for taxonomists.

Authors: David E Schindel; Scott E Miller
Journal: Nature Date: 2005-05-05 Impact factor: 49.962

Review 3. Integrative taxonomy: a multisource approach to exploring biodiversity.

Authors: Birgit C Schlick-Steiner; Florian M Steiner; Bernhard Seifert; Christian Stauffer; Erhard Christian; Ross H Crozier
Journal: Annu Rev Entomol Date: 2010 Impact factor: 19.686

4. Identification and typing of isolates of Cyphellophora and relatives by use of amplified fragment length polymorphism and rolling circle amplification.

Authors: Peiying Feng; Corné H W Klaassen; Jacques F Meis; M J Najafzadeh; A H G Gerrits Van Den Ende; Liyan Xi; G S de Hoog
Journal: J Clin Microbiol Date: 2013-01-09 Impact factor: 5.948

5. Detection and identification of fungal pathogens by PCR and by ITS2 and 5.8S ribosomal DNA typing in ocular infections.

Authors: C Ferrer; F Colom; S Frasés; E Mulet; J L Abad; J L Alió
Journal: J Clin Microbiol Date: 2001-08 Impact factor: 5.948

6. A Beauveria phylogeny inferred from nuclear ITS and EF1-alpha sequences: evidence for cryptic diversification and links to Cordyceps teleomorphs.

Authors: Stephen A Rehner; Ellen Buckley
Journal: Mycologia Date: 2005 Jan-Feb Impact factor: 2.696

7. Fungus-specific translation elongation factor 3 gene present in Pneumocystis carinii.

Authors: M F Ypma-Wong; W A Fonzi; P S Sypherd
Journal: Infect Immun Date: 1992-10 Impact factor: 3.441

8. Identification and phylogeny of ascomycetous yeasts from analysis of nuclear large subunit (26S) ribosomal DNA partial sequences.

Authors: C P Kurtzman; C J Robnett
Journal: Antonie Van Leeuwenhoek Date: 1998-05 Impact factor: 2.271

9. Rapid genetic identification and mapping of enzymatically amplified ribosomal DNA from several Cryptococcus species.

Authors: R Vilgalys; M Hester
Journal: J Bacteriol Date: 1990-08 Impact factor: 3.490

10. DNA barcoding works in practice but not in (neutral) theory.

Authors: Mark Y Stoeckle; David S Thaler
Journal: PLoS One Date: 2014-07-02 Impact factor: 3.240

108 in total

1. Phylogeny of saprobic microfungi from Southern Europe.

Authors: M Hernández-Restrepo; J Gené; R F Castañeda-Ruiz; J Mena-Portales; P W Crous; J Guarro
Journal: Stud Mycol Date: 2017-05-17 Impact factor: 16.097

2. Different Amplicon Targets for Sequencing-Based Studies of Fungal Diversity.

Authors: Francesca De Filippis; Manolo Laiola; Giuseppe Blaiotta; Danilo Ercolini
Journal: Appl Environ Microbiol Date: 2017-08-17 Impact factor: 4.792

Review 3. Online Databases for Taxonomy and Identification of Pathogenic Fungi and Proposal for a Cloud-Based Dynamic Data Network Platform.

Authors: Peralam Yegneswaran Prakash; Laszlo Irinyi; Catriona Halliday; Sharon Chen; Vincent Robert; Wieland Meyer
Journal: J Clin Microbiol Date: 2017-02-08 Impact factor: 5.948

4. Bioremediation of Dichlorodiphenyltrichloroethane (DDT)-Contaminated Agricultural Soils: Potential of Two Autochthonous Saprotrophic Fungal Strains.

Authors: Fabiana Russo; Andrea Ceci; Flavia Pinzari; Antonietta Siciliano; Marco Guida; Eligio Malusà; Małgorzata Tartanus; Artur Miszczak; Oriana Maggi; Anna Maria Persiani
Journal: Appl Environ Microbiol Date: 2019-10-16 Impact factor: 4.792

5. Generic hyper-diversity in Stachybotriaceae.

Authors: L Lombard; J Houbraken; C Decock; R A Samson; M Meijer; M Réblová; J Z Groenewald; P W Crous
Journal: Persoonia Date: 2016-04-29 Impact factor: 11.051

6. All that glitters is not Ramularia.

Authors: S I R Videira; J Z Groenewald; U Braun; H D Shin; P W Crous
Journal: Stud Mycol Date: 2016-06-29 Impact factor: 16.097

7. A synopsis of the saddle fungi (Helvella: Ascomycota) in Europe - species delimitation, taxonomy and typification.

Authors: I Skrede; T Carlsen; T Schumacher
Journal: Persoonia Date: 2017-09-14 Impact factor: 11.051

8. Mycosphaerellaceae - Chaos or clarity?

Authors: S I R Videira; J Z Groenewald; C Nakashima; U Braun; R W Barreto; P J G M de Wit; P W Crous
Journal: Stud Mycol Date: 2017-09-28 Impact factor: 16.097

9. Unexpected fungal communities in the Rehai thermal springs of Tengchong influenced by abiotic factors.

Authors: Kai-Hui Liu; Xiao-Wei Ding; Nimaichand Salam; Bo Zhang; Xiao-Fei Tang; Baiwan Deng; Wen-Jun Li
Journal: Extremophiles Date: 2018-02-23 Impact factor: 2.395

10. Novel taxa of thermally dimorphic systemic pathogens in the Ajellomycetaceae (Onygenales).

Authors: Karolina Dukik; Jose F Muñoz; Yanping Jiang; Peiying Feng; Lynne Sigler; J Benjamin Stielow; Joanna Freeke; Azadeh Jamalian; Bert Gerrits van den Ende; Juan G McEwen; Oliver K Clay; Ilan S Schwartz; Nelesh P Govender; Tsidiso G Maphanga; Christina A Cuomo; Leandro F Moreno; Chris Kenyon; Andrew M Borman; Sybren de Hoog
Journal: Mycoses Date: 2017-02-07 Impact factor: 4.377