| Literature DB >> 27492233 |
Xing-Xing Shen1, Leonidas Salichos2, Antonis Rokas3.
Abstract
Molecular phylogenetic inference is inherently dependent on choices in both methodology and data. Many insightful studies have shown how choices in methodology, such as the model of sequence evolution or optimality criterion used, can strongly influence inference. In contrast, much less is known about the impact of choices in the properties of the data, typically genes, on phylogenetic inference. We investigated the relationships between 52 gene properties (24 sequence-based, 19 function-based, and 9 tree-based) with each other and with three measures of phylogenetic signal in two assembled data sets of 2,832 yeast and 2,002 mammalian genes. We found that most gene properties, such as evolutionary rate (measured through the percent average of pairwise identity across taxa) and total tree length, were highly correlated with each other. Similarly, several gene properties, such as gene alignment length, Guanine-Cytosine content, and the proportion of tree distance on internal branches divided by relative composition variability (treeness/RCV), were strongly correlated with phylogenetic signal. Analysis of partial correlations between gene properties and phylogenetic signal in which gene evolutionary rate and alignment length were simultaneously controlled, showed similar patterns of correlations, albeit weaker in strength. Examination of the relative importance of each gene property on phylogenetic signal identified gene alignment length, alongside with number of parsimony-informative sites and variable sites, as the most important predictors. Interestingly, the subsets of gene properties that optimally predicted phylogenetic signal differed considerably across our three phylogenetic measures and two data sets; however, gene alignment length and RCV were consistently included as predictors of all three phylogenetic measures in both yeasts and mammals. These results suggest that a handful of sequence-based gene properties are reliable predictors of phylogenetic signal and could be useful in guiding the choice of phylogenetic markers.Entities:
Keywords: correlation; gene function; gene tree; nuclear gene; phylogenetic signal; prediction
Mesh:
Year: 2016 PMID: 27492233 PMCID: PMC5010910 DOI: 10.1093/gbe/evw179
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
Information on the 52 Gene Properties Used in This Study
| Property | Name | Description |
|---|---|---|
| Sequence-based | Aln_quality | Average of column confident scores (calculated using GUIDANCE2 from |
| AlnLen | Alignment length | |
| AlnLen_nogaps | Alignment length after exclusion of all sites containing gaps | |
| CAM | Number of sites containing RGC-CAM substitutions (as defined by | |
| CAM_pct | Percentage of CAM substitutions | |
| Gap_pct_mean | Percent average of sites containing gaps across taxa | |
| Gap_pct_var | Variance of percentage of sites containing gaps across taxa | |
| GC_pct_mean | Percent average of GC content of all sites across taxa | |
| GC_pct_var | Variance of GC content percentage of all sites across taxa | |
| GC1_pct_mean | Percent average of GC content of first codon positions across taxa | |
| GC1_pct_var | Variance of GC content percentage of first codon positions across taxa | |
| GC2_pct_mean | Percent average of GC content of second codon positions across taxa | |
| GC2_pct_var | Variance of GC content percentage of second codon positions across taxa | |
| GC3_pct_mean | Percent average of GC content of third codon positions across taxa | |
| GC3_pct_var | Variance of GC content percentage of third codon positions across taxa | |
| nonCAM | Number of sites containing RGC_non-CAM substitutions (as defined by | |
| nonCAM_pct | Percentage of non-CAM substitutions | |
| PI_pct_mean | Percent average of pairwise identity across taxa | |
| PI_pct_var | Variance of percentage of pairwise identity across taxa | |
| PI_sites | Number of parsimony-informative sites | |
| PI_sites_pct | Percentage of parsimony-informative sites | |
| RCV | Relative nucleotide composition variability (as defined by | |
| Varsites | Number of variable sites | |
| Varsites_pct | Percentage of variable sites | |
| Function-based | CAI | Codon adaptation index for a |
| CBI | Codon bias index for a | |
| CC_regions | Number of coiled–coil regions for a | |
| Cen_distance | The physical distance between gene and centromere divided by chromosome length for a | |
| Function-based | Exons | Number of exons in a |
| Gen_interactions | Number of genetic interactions for a | |
| Gene_expression | Number of mapped reads per kilobase for a given gene from one million mapped reads (calculated using 2-replicate RNA-Seq data of | |
| GO_numbers | Number of Gene Ontology terms for a | |
| InterPros | Number of unique domains for a | |
| Paralogs | Number of paralogs of a | |
| Phy_interactions | Number of physical interactions for a | |
| Prot2Tran | Number of protein isoforms divided by number of transcripts for a | |
| Protein_abundance | Protein abundance levels for a | |
| Proteins | Number of protein isoforms for a | |
| Rel_distance | The physical position of a | |
| Repeats | Number of repeat elements for a | |
| Syn_codons_fre | Frequency of synonymous codons for a | |
| TFs | Number of transcription factors targeting a given gene (calculated using the Yeastract database of | |
| Transcripts | Number of transcripts for a | |
| Tree-based | Inter_len_mean | Average length of internal branches across the maximum likelihood tree of a given alignment |
| Inter_len_var | Variance of lengths of internal branches across the maximum likelihood tree of a given alignment | |
| Leaf_len_mean | Average length of external branches across the maximum likelihood tree of a given alignment | |
| Leaf_len_var | Variance of lengths of external branches across the maximum likelihood tree of a given alignment | |
| Leaf2node_mean | Average of the sum of all branch lengths that are between the outgroup node and each ingroup node across the maximum likelihood tree of a given alignment | |
| Leaf2node_var | Variance of the sum of all branch lengths that are between the outgroup node and each ingroup node across the maximum likelihood tree of a given alignment | |
| Total_treelen | Sum of all branch lengths across the maximum likelihood tree of a given alignment | |
| Treeness | Proportion of sum of internal branch lengths over sum of all branch lengths across the maximum likelihood tree of a given alignment (as defined by | |
| Treeness/RCV | Treeness divided by RCV (as defined by |
. 1.—The eMRC phylogenies inferred from 2,832 yeast genes (A) and 2,002 mammalian genes (B). Branch support values near internodes are indicated in order of bootstrap support values (* represents 100%) using ASTRAL (Mirarab, Reaz, et al. 2014), GSF, and ICA. The branch lengths were estimated on the eMRC topology, as implemented in RAxML (Stamatakis 2014) (-f e option). Note that the eMRC topology is identical to the ASTRAL topology.
. 2.—The correlation networks of 52 gene properties in yeasts (A) and mammals (B). Networks were explored and visualized with the interactive platform Gephi 0.8.2 (Bastian et al. 2009). The size of each node (nodes are depicted by circles) is proportional to the number of connections (edges) where Pearson’s coefficient r was ≥ 0.1. The full descriptions of the 52 gene properties are given in table 1. Values for the Pearson’s coefficients and the correlation networks are provided in supplementary tables S4–S6, Supplementary Material online.
. 3.—Heat maps representing all correlations between 52 gene properties and three phylogenetic measures (ABS; TCA all, RFD, Normalized RFD in recovering the eMRC phylogeny) before (A) and after (B) simultaneously controlling for gene alignment length and evolutionary rate in yeast and mammalian data sets. Only correlations having Pearson’s coefficient values ≥ 0.1 and P values < 0.05 are displayed in the heat map. Black cells represent cases in which the SD of a gene property is zero. The full descriptions of the 52 gene properties are given in table 1. Detailed values for the Pearson’s coefficients are provided in supplementary tables S7 and S8, Supplementary Material online.
. 4.—Relative importance of each of the gene properties to three phylogenetic measures in yeasts (A) and mammals (B). Full descriptions of the 52 gene properties are given in table 1. Note that three gene properties (Transcripts, Proteins, and Prot2Tran) whose SDs are zero, are not included in the analysis of yeast gene properties (A); similarly, the TFs gene property, which has a lot of missing data, is not included in the analysis of mammal gene properties (B). The exact values of relative importance of each gene property to each phylogenetic measure can be found in supplementary table S9, Supplementary Material online.
. 5.—The relative performance of optimal models comprised of varying numbers of gene property predictors in predicting the values of each of three phylogenetic measures in yeasts (left panel) and mammals (right panel). For a given number of gene property predictors and the training data, the best regression model was determined by the subset selection technique. For each given best regression model, its MSE in predicting the accuracy in the testing data was calculated in yeasts (A) and mammals (B). The model with the lowest MSE value in each analysis is indicated by the red dot and was considered the best subset selection. The identity and overlap of gene property predictors of the three different phylogenetic measures in the two data sets are summarized into Venn diagrams (C and D). Full descriptions of the 52 gene properties are given in table 1. Note that three predictors (Transcripts, Proteins, and Prot2Tran) whose SDs are zero, and the Treeness predictor, which is too collinear with other predictors, are not included in yeasts (A); similarly, the TFs predictor, which has much data missing, is not included in mammals (B). Detailed values from the analysis of subset selections in yeasts and mammals are provided in supplementary tables S10 and S11, Supplementary Material online.