Literature DB >> 22474405

Selection of marker genes using whole-genome DNA polymorphism analysis.

Harry M Bohle1, Toni Gabaldón.   

Abstract

Molecular markers serve to assign individual samples to specific groups. Such markers should be easily identified and have a high discrimination power, being highly conserved within groups while showing sufficient variability between the groups that are to be distinguished. The availability of a large number of complete genomic sequences now enables the informed selection of genes as molecular markers based on the observed patterns of variability. We derived a new scoring system based on observed DNA polymorphic differences, and which uses the Bayes theorem as adapted by Wilcox. For validation, we applied this system to the problem of identifying individual species within a prokaryotic (Vibrio) and a eukaryotic (Diphyllobothrium) genus for validation. Top-scoring candidates genes Chromosome segregation ATPase and ATPase-subunit 6 showed better discrimination power in Vibrio and Diphyllobothrium, respectively, as compared to standard molecular markers (recA, dnaJ and atpA for Vibrio, and 18s rRNA, ITS and COX1 for Diphyllobothrium).

Entities:  

Keywords:  Bayes’s theorem; DNA polymorphism; genome analysis; molecular marker

Year:  2012        PMID: 22474405      PMCID: PMC3315472          DOI: 10.4137/EBO.S8989

Source DB:  PubMed          Journal:  Evol Bioinform Online        ISSN: 1176-9343            Impact factor:   1.625


Background

Molecular methods to assign biological samples to specific groups (eg, taxonomic groups) have largely replaced morphological comparisons, allowing hundreds or even thousands of characters to be compared across samples.1 Historically, numerous DNA-based approaches encompassing random whole-genomic analysis have been used to discriminate groups of organisms. These include methods like, among many others, restriction fragment length polymorphism (RFLP), or random amplification of polymorphic DNA (RAPD).2,3 Alternatively, sequences from genes, usually selected by their conserved, housekeeping roles, can be used.2 However, it is often the case that existing markers provide insufficient resolution or are confounded by homoplasy, homologous recombination and lateral gene transfer.4,5 In recent years, thanks to great advances in sequencing technologies,6,7 the number and diversity of completely sequenced genomes is growing exponentially. This provides the basis for optimizing the selection of marker genes based on the analysis of the whole genetic complement of a given set of organisms. Earlier attempts to use whole-genome information to select marker genes that could best serve as predictors of phylogenetic relatedness include the use of scores based on the level of sequence identities from whole-genome alignments,8 or the selection of unique sequence signatures present in a few species.9 These methods, however, do not exploit the information from sequence variability within a species. Here we propose and evaluate an alternative algorithm for the selection of optimal genetic markers, which is based on the comparison of complete genomes. In brief, the basis of our strategy is to rank different genes according to the level of DNA polymorphism within and between defined taxonomic groups. More specifically, DNA polymorphism is measured as the average number of nucleotide differences per site,10 and a conditional probabilistic statistic based on Bayes’s Theorem as adapted by Willcox11 is used to prioritize genes, so that genes presenting higher levels of polymorphism between groups but lower variation within a group receive higher scores. In order to validate the methodology, we apply it to the problem of selecting marker genes for the identification of individual species within a prokaryotic (Vibrio) and a Eukaryotic ( Diphyllobothrium) genus. Publicly available genomic sequences were analyzed to select high-scoring marker genes, which were subsequently amplified and sequenced in a set of additional, non-sequenced strains of these groups. The discrimination power (DP) of these newly obtained sequences was compared to that of traditional marker genes.

Methods

Sequence data

Complete genome sequences were downloaded from the National Center of Bioinformatics Information (NCBI) in Genbank (.GBK) format. These were: (i) chromosome I from the following Vibrio species and strains: V. cholerae (NC_002505), V. vulnificus (NC_004459), V. parahaemolyticus (NC_004603), V. harveyi (NC_009783), V. fischeri (NC_006840), Alivibrio salmonicida (NC_011312), V. splendidus (NC_011753), V. cholerae (NC_009457), V. cholerae (NC_012578), and V. cholerae (NC_012668); (ii) Whole mitochondrial genomes from Different Diphyllobothrium species and strains: D. latum (NC_008945), D. nihonkaiense (NC_009463), D. latum (AB269325) and D. latum (DQ985706).

Alignments, polymorphism analysis, and molecular marker score calculation

Genome sequences mentioned above were divided into four different groups: (1) VibrioDS, containing only one representative genome for each Vibrio species, using the Vibrio cholerae strain (NC_002505); (2) VibrioSS, comprising the four different Vibrio cholerae strains; (3) DiphyllobothriumDS containing one genome per Diphyllobothrium species using NC_008945 as D. latum representative; (4) DiphyllobothriumSS containing all D. latum strains. Each group was aligned using MAUVE v2.3.1 using the progressiveAligner option.12 Output files were re- formatted to Variscan—extended multi-FASTA (XMFA) format with a custom PERL Script (XMFA.pl) and analyzed using Variscan v2.0.13 The resulting files were used as an input for the molecular marker score calculation implemented in a custom PERL script (SCORE.pl), and using two different window sizes of 300pb and 500 pb, for Vibrio and Diphyllobothrium, respectively. The final output consists of a plain text file listing the potential marker genes, sorted in a descending order of their scores.

Algorithm

The Bohle-Gabaldón (BG) score calculation is based on the level of DNA polymorphism in the Distinct Species (DS) group and Same Species (SS) groups, as inferred from the average of nucleotide differences per site (π̂). Not more than one SS group may be considered. The Bayes’s theorem as adapted by Willcox11 is used as follows. If the number of genome sequences in DS group is lower than 4 and there is no length constraint for the marker, formula (1) is used. If molecular marker with specific size is required (S ) formula (2) is used, S is the nucleotides length of gene i. Also, if the amount of whole-genomes for DS group is 4 or more, is possible include Tajima’s D (D) without specific size requirement (3) or with (4), which better account for the possibility of rare haplotypes. Based on Willcox conditions, higher π̂ in Different Species (π) and lower in Same Species (π) is better. For (D) in DS group more negative values are preferred and, finally, the size of molecular marker (S) is arbitrary. In order to reduce sequencing costs we selected rather small sizes (300 pb–500 pb). BG score using DNA polymorphism (less than 4 genomes): Scoring using DNA polymorphism and Size (less than 4 genomes) Scoring using DNA polymorphism and Tajima’s D14 (4 genomes and more): Scoring using DNA polymorphism, Tajima’s D and Size (4 genomes and more): The maximum value for Score is 1 using π =1, π =0, Tajima’s D = −2 and S =S. The minimum value for Score is 0 considering π =0, π =1, Tajima’s D = +2 and S ≠ S.

Experimental validation analysis

Additional Vibrio sequences for the candidate genes were obtained from biological samples stored in the Collection of Aquatic Important Microorganism (CAIM) at the Center of Research for Nutrition and Development (Mexico). Collected strains were: V. ordalii CAIM608, V. aestuarianus CAIM592, V. orientalis CAIM332, V. tubiashii CAIM313, V. splendidus CAIM319, V. cyclitrophicus CAIM 596, V. fortis CAIM629, V. parahaemolyticus CAIM320, V. harveyi CAIM513, V. rotiferianus CAIM577, V. mytili CAIM528, V. navarrensis CAIM609, V. fluvialis CAIM593, V. agarivorans CAIM615, V. mimicus CAIM602, V. metschnikovii CAIM317, V. vulnificus CAIM610, V. aerogenes CAIM906 and V. neptunius CAIM532. Similarly, additional sequences for candidate Diphyllobothrium marker genes were obtained from samples fixed in ethanol at the Parasitology Institute of Biology Center of the Czech Republic. These included the strains D. latum TS-07/17, D. pacificum TS- 06/30a.b., D. dendriticum TS-04/39, D. nihonkaiense TS-06/236, D. polyrugosum TS-05/58 and D. ditremum TS-02/32.

DNA purification and amplification

Genomic DNA from Vibrio species was purified using E.Z.N.A. Bacterial DNA Kit (Omega Biotek, USA). Diphyllobothrium samples were diluted (1) in nuclease-free water, macerated with mortar, to subsequently purify DNA using E.Z.N.A. Tissue DNA Kit (Omega Biotek, USA), following manufacturer’s instructions. The final volume for PCR were 50 μL with 5 μL Buffer 10x (20 nM Tris-HCl pH 8.0, 40 nM NaCl, 2 mM Sodium phosphate, 0.1 mM EDTA, 1 mM DTT, stabilizers, 50% (v/v) glycerol), 1 μL dNTPs (10 mM), 6 μL MgCl2 (50 mM), 1 μL primers (10 μM), 0.5 μL Platinum Taq DNA polymerase (2.5 U), 5 μL template DNA and 31.5 μL free nuclease water. Primers for target gene amplification were designed based on the level of observed sequence conservation. The primers used for Vibrio were forward 5′-ATG GTT TCA ATT AAN GGN TTR CCK CC-3′ and reverse 5′-TTA GAT GTA RAK ATC GAC MCC NA-3′ and for Diphyllobothrium target gene were forward 5′-ATG ATC TTT AGT GGT TAT TCA -3′ and reverse 5′-CTA ATG GTC CAC TGA AAA TGA TAA TAT-3′. The thermal profile used was the following: initial activation (2 min, 95 °C), followed by 35 cycles of denaturation (1 min, 95 °C), annealing (1 min, 55 °C) and extension (1 min, 72 °C), and a final extension (4 min, 72 °C). Electrophoresis agarose gel (1.5%) stained with Ethidium bromide was used to identify the PCR products from Vibrio (~300 pb) and Diphyllobothrium (~500 pb). PCR products were purified using Minielute gel extraction kit (QIAGEN, USA) and cloned using CloneJET PCR cloning kit (Fermentas, USA). This kit includes the positive selection cloning vector pJET1.2/blunt that contains a lethal gene which is disrupted by ligation of a DNA insert into the cloning site. As a result, only cells with recombinant plasmids are able to propagate. Finally, DNA from the E. coli top 10 colonies was purified using E.Z.N.A. bacterial DNA Kit (Omega Biotek, USA). Total DNA obtained from clones was amplified using primers pJET1.2 forward and reverse (Clone- JET, Fermentas, USA) with BigDye Terminator v3.1 Cycle sequencing Kit (Applied Biosystem, USA) using manufacturer’s instructions. The PCRs products were purified for Dyes using Dye Terminator Removal kit (Omega Biotek, USA) and sequenced using ABI PRISM 310 machine (Applied Biosystem, USA). The sequences obtained were edited, assembled, aligned and compared using CLC Genomics Workbench v3.5.5 (CLC Bio, Denmark).

Molecular marker discrimination power analysis

To prioritize the markers, we developed a simple Discrimination Power (DP) score (5) based in Bayes’s Theorem adapted by Willcox11 which evaluates the maximum identity (ΔI ) for each species in each molecular marker gene (x) analyzed. The maximum value for DP is 1 (ie, perfect molecular marker), if maximum difference of identity for the closest species in each species for each molecular marker tends to 0. The minimum value for DP is 0 when the level of identity of that marker in the closest species tends to 1 for each species.

Results

Automated prioritization of marker genes

Publicly available genomes from Vibrio and Diphyllobothrium were downloaded and subjected to the selection of marker genes approach aforementioned. For each genus, a list of potential marker genes sorted in descending order of their BG scores was produced. For Vibrio species (Table 1), the best molecular marker is a protein-coding gene with locus tag VC1988 in chromosome 1 of the reference genome V. cholerae NC_002505. This gene encodes a chromosome segregation ATPase, a protein essential for cell division that forms part of a chromosomal segregation complex. In the case of Diphyllobothrium, the analysis of completely sequenced mitochondrial genomes revealed the gene encoding the subunit 6 of the ATPase complex as the best potential marker gene (Table 2). This enzyme is part of the mitochondrial oxidative phosphorylation and is essential for the generation of ATP.15
Table 1

10 top-scoring marker genes for Vibrio species discrimination using S =300 pb.

ScoreiLocus tagSize (pb)πi(DS)πi(SS)Tarima’s D(DS)
0.00308VC19880.983870.034690.00000−0.09022
0.00252VC19540.336670.058090.00000−0.12885
0.00238VC21630.786670.037030.00000−0.08185
0.00237VC23540.476670.048470.00000−0.10258
0.00233VC26650.966670.033740.00000−0.07132
0.00222VC21890.596670.043960.00000−0.08477
0.00212VC19860.606530.041450.00000−0.08437
0.00208VC26580.821890.033180.00000−0.07621
0.00207VC26520.566670.036890.00000−0.09897
0.00207VC15340.598170.041500.00000−0.08352
Table 2

10 top-scoring marker genes o Diphyllobothrium species discrimination using S =500 pb.

ScoreGenSize (pb)πi(DS)πi(SS)
0.01175ATP65090.011960.00013
0.01066ND64580.011560.00015
0.00733ND33560.009440.00019
0.00563ND4L2600.008330.00028
0.00524COX25690.005960.00023
0.00479ND28780.008410.00015
0.00433ND412500.010830.00017
0.00404ND18900.007190.00022
0.00355ND515680.011150.00047
0.00230COX115650.007200.00004

Experimental Validation

In order to validate the effectiveness of our approach we amplified these marker genes from additional strains of known taxonomic assignment but with no current genomic sequences available. The effectiveness of the markers, as measured by the Discrimination Power score (DP) described above, was compared to that of common markers used previously for these species. These were atpA,16 dnaJ17 and recA,18 for Vibrio and 18S rRNA, COX1 and 18s rRNA + ITS + 5.8s rRNA19,20 for Diphyllobothrium. Twenty new sequences were obtained from the chromosome segregation ATPase gene in different Vibrio species. Remarkably, this gene showed the best Discrimination Power value (Table 3) with a DP score of 6.3 × 10−14. Standard markers showed lower discrimination powers: dnaJ (DP =3.5 × 10−19), atpA (DP =1.1 × 10−26) finally recA (DP =7.9 × 10−27). In the case of Diphyllobothrium, seven new sequences were obtained from ATPase-subunit 6 (ATP6) gene in different species. Again, the marker gene selected by our approach presented the highest Discrimination power (DP6 =7.9 × 10−6), followed by COX1 (D1 =5.8 × 10−6), ITS rRNA(DP =4.4 × 10−11) and 18s rRNA(DP18 =1.4 × 10−13) (Table 4).
Table 3

Prokaryotic molecular markers genes comparison using Discrimination power scoring.

SpeciesAccession numberSCrecAdnaJatpAChromosome segregation ATPase




CSCId(1-Id)CSCId(1-Id)CSCId(1-Id)CSCId(1-Id)
V. aestuarianusJN040521120.9990.00150.8010.19980.8990.101180.7080.292
V. alginolyticusNZ_AAPS01000071210.9990.001150.8830.117140.9670.033140.8090.191
V. choleraeNC_0025053110.9210.079110.9320.068110.9580.042110.8760.124
V. coralliilyticusNZ_ACZN010000154120.9710.029120.9040.09620, 120.9570.043130.7790.221
V. cyclitrophicusJN0405265180.9240.076180.9040.096180.9730.027180.8440.156
V. fischeriNC_0068406160.8760.124160.8520.148160.9180.082160.7610.239
V. fluvialisJN04052973, 110.8610.13920.8480.15240.8890.11110.6730.327
V. fortisJN040527850.8820.118190.8530.147150.9420.058180.8020.198
V. harveyiJN0405179150.9790.021150.9250.075150.9790.021140.8680.132
V. metschnikoviiJN04053110150.8450.15570.8250.175200.8350.16570.6460.354
V. mimicusJN0405301130.9210.07930.9320.06830.9580.04230.8760.124
V. neptuniusJN0405351240.9710.02940.9040.09640.9570.043190.5770.423
V. orientalisJN04052313170.8930.107130.8510.149190.9740.026190.7870.213
V. parahaemolyticusJN0405161490.9170.08320.8670.133150.9700.03090.8680.132
V. rotiferianusJN0405181590.9790.02190.9250.07590.9790.021140.7740.226
V. salmonicidaNC_0113121660.8760.12460.8520.14860.9180.08260.7610.239
V. shiloniiNC_ABCH0100004017130.8930.107130.8260.174150.9120.08810.5390.461
V. splendidusJN0405241850.9240.07650.9040.09650.9730.02750.8440.156
V. tubiashiiJN0405221990.8860.11414, 80.8530.147130.9350.065130.7870.213
V. vulnificusJN040533209, 110.8590.14190.8420.158170.9040.09690.7000.3
Discrimination power score7.980 × 10− 273.530× 10− 191.070 × 10− 266.310 × 10−14

Notes: Underline Score is highest. JN040516-JN040535: In this work.

Abbreviations: SC, Specie code; CSC, Closest specie code; Id, Identity (Match nucleotides/total nucleotides).

Table 4

Eukaryotic molecular markers genes comparison using Discrimination power scoring.

SpeciesAccession numberSC18s rRNACOX118s + ITS + 5.8s rRNAATPase6




CSCId(1-Id)CSCId(1-Id)CSCId(1-Id)CSCId(1-Id)
D. dendriticumJN040538120.9990.00140.9360.06420.9990.00130.9350.065
D. ditremumJN040539210.9990.00130.9020.09810.9990.0011,30.8920.108
D. latumJN04053631,20.9990.00140.9050.0951,20.9890.01110.9350.065
D. nihonkaienseJN04054041,2,30.9960.00430.9350.0651,20.9730.02710.9020.098
D. pacificumJN04054151,2,3,40.9640.03620.8490.1511,20.8530.1471,30.8230.177
Discrimination power value1.440 × 10− 135.848 × 10− 64.365 × 10− 117.914 × 10− 6

Notes: Underline Score is higher. JN040536-JN040541: In this work.

Abbreviations: SC, Specie code; CSC, Closest specie code; Id, Identity (Match nucleotides/total nucleotides).

Discussion

We have proposed and validated a novel approach for the informed selection of marker genes based on the observed levels of DNA polymorphism10 among whole genomic sequences. Our results indicate that our approach effectively selects marker genes for species differentiation. Besides having greater discrimination powers than traditional markers, our markers also reduced the number of species that showed identical sequences for the marker. Nevertheless, in both genera studies, there are still some species that are too closely related to be differentiated with a single marker. The use of a combination of markers, or the selection of specific markers for that group of species within the genus would be required. Our approach has some minimal requirements. For instance, if the goal is to obtain marker genes for species differentiation in a given genus, a minimum of three different strain genomes belonging to two different species within the genus is required. Moreover, the design of primers may present problems if the sequences are too divergent, although this problem is shared with other approaches. Our approach and scoring system method provides a new, powerful tool for the exploitation of available genome sequences to assist in the selection of marker genes. In both the eukaryotic and prokaryotic genera tested, the theoretical analyses showed excellent correlation with empirical results and showed a better performance than molecular markers previously proposed by different authors for the same species. The adaptation of Bayes theorem permitted the use of a conditioned statistic that prioritizes genes showing low DNA polymorphism inside the same species (different strains), while displaying high DNA polymorphism between different species.

Supplementary Data

The scoring system and the necessary re-formatting scripts have been implemented in PERL. The PERL scripts (SCORE. pl and XMFA.pl) and a user manual for Windows, Linux and Mac are available at http://www.bioinformatics.cl.
  20 in total

Review 1.  Genomic approaches to typing, taxonomy and evolution of bacterial isolates.

Authors:  V Gürtler; B C Mayall
Journal:  Int J Syst Evol Microbiol       Date:  2001-01       Impact factor: 2.747

2.  Gene sequences useful for predicting relatedness of whole genomes in bacteria.

Authors:  Daniel R Zeigler
Journal:  Int J Syst Evol Microbiol       Date:  2003-11       Impact factor: 2.747

3.  Expression of the mitochondrial ATPase6 gene and Tfam in Down syndrome.

Authors:  Sook Hwan Lee; Suman Lee; Hye Sun Jun; Hye Jin Jeong; Won Tae Cha; Yong Sun Cho; Jung Hwan Kim; Seung Yup Ku; Kwang Yul Cha
Journal:  Mol Cells       Date:  2003-04-30       Impact factor: 5.034

4.  VariScan: Analysis of evolutionary patterns from large-scale DNA sequence polymorphism data.

Authors:  Albert J Vilella; Angel Blanco-Garcia; Stephan Hutter; Julio Rozas
Journal:  Bioinformatics       Date:  2005-04-06       Impact factor: 6.937

5.  Phylogenetic analysis of vibrios and related species by means of atpA gene sequences.

Authors:  Cristiane C Thompson; Fabiano L Thompson; Ana Carolina P Vicente; Jean Swings
Journal:  Int J Syst Evol Microbiol       Date:  2007-11       Impact factor: 2.747

Review 6.  Phylogenetic understanding of clonal populations in an era of whole genome sequencing.

Authors:  Talima Pearson; Richard T Okinaka; Jeffrey T Foster; Paul Keim
Journal:  Infect Genet Evol       Date:  2009-05-27       Impact factor: 3.342

7.  Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.

Authors:  F Tajima
Journal:  Genetics       Date:  1989-11       Impact factor: 4.562

8.  Identification of bacteria by computer: theory and programming.

Authors:  W R Willcox; S P Lapage; S Bascomb; M A Curtis
Journal:  J Gen Microbiol       Date:  1973-08

9.  Morphologic and genetic identification of Diphyllobothrium nihonkaiense in Korea.

Authors:  Hyeong-Kyu Jeon; Kyu-Heon Kim; Sun Huh; Jong-Yil Chai; Duk-Young Min; Han-Jong Rim; Keeseon S Eom
Journal:  Korean J Parasitol       Date:  2009-12-01       Impact factor: 1.341

Review 10.  Update on the human broad tapeworm (genus diphyllobothrium), including clinical relevance.

Authors:  Tomás Scholz; Hector H Garcia; Roman Kuchta; Barbara Wicht
Journal:  Clin Microbiol Rev       Date:  2009-01       Impact factor: 26.132

View more
  3 in total

Review 1.  Recent trends in molecular diagnostics of yeast infections: from PCR to NGS.

Authors:  Toni Gabaldón
Journal:  FEMS Microbiol Rev       Date:  2019-09-01       Impact factor: 16.408

2.  Metabarcoding using multiplexed markers increases species detection in complex zooplankton communities.

Authors:  Guang K Zhang; Frédéric J J Chain; Cathryn L Abbott; Melania E Cristescu
Journal:  Evol Appl       Date:  2018-09-15       Impact factor: 5.183

3.  A New Comparative-Genomics Approach for Defining Phenotype-Specific Indicators Reveals Specific Genetic Markers in Predatory Bacteria.

Authors:  Zohar Pasternak; Tom Ben Sasson; Yossi Cohen; Elad Segev; Edouard Jurkevitch
Journal:  PLoS One       Date:  2015-11-16       Impact factor: 3.240

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.