| Literature DB >> 22266652 |
Matteo Ramazzotti1, Luisa Berná, Irene Stefanini, Duccio Cavalieri.
Abstract
The quest for genes representing genetic relationships of strains or individuals within populations and their evolutionary history is acquiring a novel dimension of complexity with the advancement of next-generation sequencing (NGS) technologies. In fact, sequencing an entire genome uncovers genetic variation in coding and non-coding regions and offers the possibility of studying Saccharomyces cerevisiae populations at the strain level. Nevertheless, the disadvantageous cost-benefit ratio (the amount of details disclosed by NGS against the time-expensive and expertise-demanding data assembly process) still precludes the application of these techniques to the routinely assignment of yeast strains, making the selection of the most reliable molecular markers greatly desirable. In this work we propose an original computational approach to discover genes that can be used as a descriptor of the population structure. We found 13 genes whose variability can be used to recapitulate the phylogeny obtained from genome-wide sequences. The same approach that we prove to be successful in yeasts can be generalized to any other population of individuals given the availability of high-quality genomic sequences and of a clear population structure to be targeted.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22266652 PMCID: PMC3351171 DOI: 10.1093/nar/gks005
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Scheme of the analysis pipeline developed in this work. After the collection of coding sequences from genomic files (with uniform annotation for all the ortholog genes), one per organism, the pipeline converts per-organism files into per-gene files. Then multiple sequence alignments are performed using ClustalW. The alignments can be clipped at the extremities to cope with alternative starts or premature stops and then, optionally, can be subjected to the removal of non-informative loci. The genes are then analyzed, as single entities or combined in doublets, triplets, other fashions or even all together, using Phylip (in this work with Kimura two-parameters distance metric followed by neighbor-joining) with optional bootstrap support. Finally, the trees are screened for the presence of predefined clusters, allowing selection of genes or combinations that perfectly satisfy an arbitrary scheme that, in this work, was based on the phylogenetic tree obtained by Liti et al. (12) with a genome-wide survey.
Figure 2.Reproduction of the genome-wide phylogenetic tree with our analysis pipeline and using only SNPs/indels in coding sequences. Colors and legends reflect the criteria used in Liti et al. (12) to allow a direct comparison. (A) Tree reproduced using all genes shared by all strains. (B) Tree obtained with the gene YJL099W. (C) Tree obtained with the gene YPR152C. (D) Tree obtained with the gene YJL057C. (E) Tree obtained with the gene YNL161W (branches have been scaled in cladogram mode to appreciate strain resolution).
A compendium of the trees analyzed in this work obtained with genes shared by the 39 S. cerevisiae strains in analysis
| Theoretical combinations | Tested combinations | Matching trees (SNP/indel-based) | Matching trees (full-length) | |||
|---|---|---|---|---|---|---|
| Percent matching | Genes | Percent matching | Genes | |||
| 1 | 5850 | 5809 | 0.2 | 12 | 0.05 | 3 |
| 2 | 17.108.325 | 16.869.336 | 1.84 | 5715 (5498) | 0.61 | 5452 (4753) |
| 3 | 33.349.828.200 | 1.000.000 | 5.35 | 5681 (5632) | 2.05 | 5325 (5222) |
The table reports the theoretical number of simple combinations obtainable with all genes taken k × k, the number of combinations actually tested and the percentage of matching trees (see ‘Materials and Methods’ section for a definition of ‘matching’) emerging from phylogenetic analysis derived from SNP/indel-based and full-length multiple sequence alignments.
aUsing only nodes supported by >60% bootstrap. For k = 2 (doublets) and k = 3 (triplets) only the trees matching with SNP/indel-based analysis were tested.
bCount if members (doublets or triplets) containing any of the 12 singletons were removed.
Description of the 13 genes that recapitulate the genome-wide analysis when used as single marker for phylogenetic analysis
| ORF | Name | Length | SNPs/indels | Description |
|---|---|---|---|---|
| 1429 | 80 | Putative protein of unknown function containing WW and FF domains; overexpression causes accumulation of cells in G1 phase | ||
| 2241 | 44 | Member of the ChAPs (Chs5p-Arf1p-binding proteins: Bch1p, Bch2p, Bud7p, Chs6p) family of proteins that forms the exomer complex with Chs5p to mediate export of specific cargo proteins, including Chs3p, from the Golgi to the plasma membrane | ||
| 2004 | 38 | Putative serine/threonine kinase; expression is induced during mild heat stress; deletion mutants are hypersensitive to copper sulfate and resistant to sorbate; interacts with an N-terminal fragment of Sst2p | ||
| YJL051W | 2469 | 50 | Bud tip localized protein of unknown function; mRNA is targeted to the bud by a She2p dependent transport system; mRNA is cell cycle regulated via Fkh2p, peaking in G2/M phase; null mutant displays increased levels of spontaneous Rad52p foc | |
| YKL068W | 2994 | 204 | Subunit of the nuclear pore complex (NPC) that is localized to both sides of the pore; contains a repetitive GLFG motif that interacts with mRNA export factor Mex67p and with karyopherin Kap95p; homologous to Nup116p | |
| YML080W | 1272 | 20 | Dihydrouridine synthase, member of a widespread family of conserved proteins including Smm1p, Dus3p and Dus4p; modifies pre-tRNA(Phe) at U17 | |
| YML056C | 1575 | 72 | Inosine monophosphate dehydrogenase, catalyzes the first step of guanosine monophosphate (GMP) biosynthesis, member of a four-gene family in | |
| YNL161W | 2791 | 628 | Serine/threonine-protein kinase that regulates cell morphogenesis pathways, including cell wall biosynthesis, mating projection morphology, bipolar bud site selection; regulates SRL1 mRNA localization via phosphorylation of substrate Ssd1p | |
| YNL125C | 2022 | 52 | Protein with similarity to monocarboxylate permeases, appears not to be involved in transport of monocarboxylates such as lactate, pyruvate or acetate across the plasma membrane | |
| YOR133W | 2529 | 23 | Elongation factor 2 (EF-2), also encoded by | |
| YAR042W | 3570 | 130 | Protein similar to mammalian oxysterol-binding protein; contains ankyrin repeats; localizes to the Golgi and the nucleus–vacuole junction | |
| YBL052C | 2499 | 68 | Histone acetyltransferase catalytic subunit of NuA3 complex that acetylates histone H3, involved in transcriptional silencing; homolog of the mammalian MOZ proto-oncogene; mutant has aneuploidy tolerance; sas3gcn5 double mutation is lethal | |
| YBR163W | 1758 | 38 | Mitochondrial 5′–3′-exonuclease and sliding exonuclease, required for mitochondrial genome maintenance; distantly related to the RecB nuclease domain of bacterial RecBCD recombinases; may be regulated by the transcription factor Ace2p |
Length indicates the length of the alignment when the full genes in all strains are taken into account. SNPs/indels indicates the number of SNPs or indels (all gaps included) present in that alignment. The description have been taken from the SGD reference annotation file. Bold indicates the genes whose phylogenesis was verified by full-length analysis with bootstrap support.
aGenes able to mimic the phylogenetic relationships of the validation genomes.
Summary of the results obtained from the combinatorial analysis of the 13 genes proved to be able per se to recapitulate the genome-wide analysis
| Combinations C(13, | SNP/indel-based analysis | Full-length analysis | |||
|---|---|---|---|---|---|
| Matching trees | Strain resolution | Matching trees | Strain resolution | ||
| 1 | 13 | 12 | 0 | 3 | 0 |
| 2 | 78 | 76 | 7 | 56 | 1 |
| 3 | 286 | 286 | 50 | 211 | 21 |
Matching trees are those correctly mapping the five clusters observed with the genome-wide analysis by Liti and coworkers. The strain resolution columns indicate how many trees have the ability to resolve the phylogenesis of the single strains within the same cluster, in at least three out of five clusters. Full-length analyses were supported by bootstrap and a 60% cutoff was used to remove weak nodes.
Figure 3.Phylogenetic relationships of the eight validation strains with respect to the 39 learning strains. (A) Full SNPs/indels phylogenomic tree of the 47 strains. (B) Recapitulated tree obtained using the combination of SNPs/indels of the three genes YBR163W, YJL051W and YPR152C. Validation strains are marked in bold. Color scheme is the same as in Figure 1.
Description of the 10 genes taken from the literature and used in this work for evaluating their phylogenetic performances in terms of singletons, doublets and triplets (see Table 2 for additional information)
| ORF | Name | Length | SNPs/indels | Description |
|---|---|---|---|---|
| YER168C | 1641 | 25 | ATP (CTP): tRNA-specific tRNA nucleotidyltransferase; different forms targeted to the nucleus, cytosol and mitochondrion are generated via the use of multiple transcriptional and translational start sites | |
| YOR065W | 930 | 12 | Cytochrome c1, component of the mitochondrial respiratory chain; expression is regulated by the heme-activated, glucose-repressed Hap2p/3p/4p/5p CCAAT-binding complex | |
| YNL117W | 1665 | 19 | Malate synthase, enzyme of the glyoxylate cycle, involved in utilization of non-fermentable carbon sources; expression is subject to carbon catabolite repression; localizes in peroxisomes during growth in oleic acid medium | |
| YOR328W | 4695 | 71 | ATP-binding cassette (ABC) transporter, multidrug transporter involved in the pleiotropic drug resistance network; regulated by Pdr1p and Pdr3p | |
| YML109W | 2835 | 71 | Protein with a role in regulating Swe1p-dependent polarized growth; interacts with Cdc55p; interacts with silencing proteins at the telomere; implicated in the mitotic exit network through regulation of Cdc14p localization; paralog of Zds1p | |
| YKL043W | 1101 | 22 | Transcriptional activator that enhances pseudohyphal growth; physically interacts with the Tup1–Cyc8 complex and recruits Tup1p to its targets; regulates expression of | |
| YGR044C | 903 | 15 | Zinc finger protein involved in control of meiosis; prevents meiosis by repressing | |
| YGL254W | 900 | 16 | Transcription factor involved in sulfite metabolism, sole identified regulatory target is | |
| YDR160W | 2559 | 46 | Component of the SPS plasma membrane amino acid sensor system (Ssy1p-Ptr3p-Ssy5p), which senses external amino acid concentration and transmits intracellular signals that result in regulation of expression of amino acid permease genes | |
| YKL109W | 1665 | 41 | Subunit of the heme-activated, glucose-repressed Hap2p/3p/4p/5p CCAAT-binding complex, a transcriptional activator and global regulator of respiratory gene expression; provides the principal activation function of the complex |
Summary of the results obtained from the combinatorial analysis of the 10 genes taken from the literature and proposed as phylogenetic probes (see Table 3 for a description of the columns and the data)
| Combinations C(10, | SNP/indel-based analysis | Full-length analysis | |||
|---|---|---|---|---|---|
| Matching trees | Strain resolution | Matching trees | Strain resolution | ||
| 1 | 10 | 0 | 0 | 0 | 0 |
| 2 | 45 | 6 | 0 | 2 | 0 |
| 3 | 120 | 25 | 0 | 11 | 0 |
The average values of dN/dS, the NI and the P-value for the MKT and the codon adaptation index (CAI) with the correspondent standard deviation for the 13 genes under analysis
| Gene name | dN/dS | MKT | Codon bias | ||
|---|---|---|---|---|---|
| Average | NI | CAI | SD | ||
| YJL099W | 0.2059 | 1.178 | NS | 0.172 | 0.014 |
| YJL057C | 0.2235 | 0.132 | 0.001 | ||
| YJL051W | 0.5575 | 0.136 | 0.001 | ||
| YKL068W | 0.2347 | 0.122 | 0.001 | ||
| YML080W | 0.1334 | 2.209 | NS | 0.136 | 0.003 |
| YML056C | 0.1236 | 0.117 | 0.001 | ||
| YNL161W | – | 0.107 | 0.001 | ||
| YNL125C | 0.1514 | 1.513 | NS | 0.088 | |
| YOR133W | 0.0424 | 1.339 | NS | 0.165 | 0.008 |
| YAR042W | 0.1818 | 0.154 | 0.003 | ||
| YBL052C | 0.2378 | 1.59 | NS | 0.137 | 0.024 |
| YBR163W | 0.5597 | 0.003 | |||
| YPR152C | 0.4786 | 1.526 | NS | 0.112 | 0.006 |
aThese genes do not present orthologs in S. paradoxus. Significant results are presented in bold.
SD = standard deviation; NS = not significant.