Literature DB >> 27222536

Controlling for Phylogenetic Relatedness and Evolutionary Rates Improves the Discovery of Associations Between Species' Phenotypic and Genomic Differences.

Xavier Prudent1, Genis Parra1, Peter Schwede1, Juliana G Roscito1, Michael Hiller2.   

Abstract

The growing number of sequenced genomes allows us now to address a key question in genetics and evolutionary biology: which genomic changes underlie particular phenotypic changes between species? Previously, we developed a computational framework called Forward Genomics that associates phenotypic to genomic differences by focusing on phenotypes that are independently lost in different lineages. However, our previous implementation had three main limitations. Here, we present two new Forward Genomics methods that overcome these limitations by (1) directly controlling for phylogenetic relatedness, (2) controlling for differences in evolutionary rates, and (3) computing a statistical significance. We demonstrate on large-scale simulated data and on real data that both new methods substantially improve the sensitivity to detect associations between phenotypic and genomic differences. We applied these new methods to detect genomic differences involved in the loss of vision in the blind mole rat and the cape golden mole, two independent subterranean mammals. Forward Genomics identified several genes that are enriched in functions related to eye development and the perception of light, as well as genes involved in the circadian rhythm. These new Forward Genomics methods represent a significant advance in our ability to discover the genomic basis underlying phenotypic differences between species. Source code: https://github.com/hillerlab/ForwardGenomics/.
© The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

Entities:  

Keywords:  evolutionary and comparative genomics; gene loss.; phenotype–genotype associations

Mesh:

Year:  2016        PMID: 27222536      PMCID: PMC4948712          DOI: 10.1093/molbev/msw098

Source DB:  PubMed          Journal:  Mol Biol Evol        ISSN: 0737-4038            Impact factor:   16.240


Introduction

Evolution has led to a striking diversity of phenotypes between species. Many phenotypic differences between species are due to differences in their DNA. Today hundreds of animals have sequenced genomes and many more will be sequenced in future by projects like Genome-10K (Haussler et al. 2009), i5K (Robinson et al. 2011), and individual labs. These genomes provide an unprecedented opportunity to uncover the genomic changes that underlie phenotypic changes between species, which is a key question in genetics and evolutionary biology (Wray 2013; Dunn and Ryan 2015). Some examples of genomic changes that are involved in phenotypic changes between species have already been discovered. For example, amino acid and cis-regulatory differences in Foxp2 are likely involved in the evolution of human speech (Enard et al. 2009; Maricic et al. 2013). Several hearing genes show convergent amino acid changes in echolocating bats and cetaceans (Li et al. 2008; Liu et al. 2010; Shen et al. 2012), and experiments showed that convergent changes in the prestin gene underlie high-frequency hearing in these echolocating lineages (Liu et al. 2014). Furthermore, losses of ancestral genes have been associated with phenotypic changes (Stedman et al. 2004; Cheng and Detrich 3rd 2007; Meredith et al. 2014). Apart from changes affecting genes, comparative genomics uncovered prominent differences in the noncoding portion of the genome. For example, genomic screens detected noncoding regions that are conserved in chimpanzee and other mammals but have accelerated substitution rates in human or are completely deleted in the human lineage (Pollard et al. 2006a, b; Prabhakar et al. 2006; McLean et al. 2011; Hubisz and Pollard 2014). Experiments provided compelling evidence that some of these human-specific genomic changes are involved in human-specific phenotypes related to specific features of the brain and limb or the loss of penile spines (Pollard et al. 2006b; Prabhakar et al. 2008; McLean et al. 2011; Capra et al. 2013). However, despite this progress, we still know little about which genomic differences are associated with particular phenotypic differences between species. To detect such associations without relying on experiments, computational approaches are helpful, but are challenged by the fact that even closely related species exhibit numerous genomic and phenotypic differences. To predict associations between phenotypic and genomic differences, we previously developed a computational approach called Forward Genomics that focuses on phenotypes that are independently lost in different species (Hiller et al. 2012b). Forward Genomics relies on two main concepts. First, the genetic information for an ancestral phenotype is often conserved in the descendant species in which this phenotype is maintained due to purifying selection. If the phenotype is lost in a descendant species, this genetic information will evolve neutrally. Thus, genomic regions that only contain information for this phenotype will diverge faster and, over time, will be lost in the trait-loss lineages. This principle is exemplified by the loss of olfactory receptors in whales and dolphins (Kishida et al. 2007; McGowen et al. 2008), the loss of egg yolk genes in placental mammals (Brawand et al. 2008), or the loss of enamel-related genes in birds and other toothless species (Meredith et al. 2011, 2014). Second, the repeated loss of a phenotype will result in a specific genomic signature where those genomic regions that only contain information for this phenotype are conserved in the trait-preserving lineages and are diverged or completely lost in the trait-loss lineages. Forward Genomics uses this specific independent divergence signature to obtain specificity in a genome-wide search for genomic regions involved in the loss of a given phenotype. As a proof of concept, by applying Forward Genomics to the trait “synthesis of vitamin C”, which is independently lost in primates, guinea pigs, and many bats, we detected higher sequence divergence in the Gulo gene, which encodes the vitamin C synthesizing enzyme (Hiller et al. 2012b). Applied to another phenotype, the absence of phospholipids in the bile of guinea pigs and horses, Forward Genomics detected the loss of the Abcb4 gene, which encodes a phospholipid-secreting transporter (Hiller et al. 2012b). Forward Genomics has also been used together with transcriptional profiling to detect noncoding genes that might play a role in the evolution of folded brains by searching for noncoding RNAs that are conserved among gyrencephalic mammals and diverged in lissencephalic mammals (Johnson et al. 2015). Finally, the power of matching independent phenotypic changes with independent genomic divergence was further demonstrated in a recent study by Marcovitz, et al. (2016) who successfully detected numerous associations between conserved noncoding regions and morphological and physiological phenotypic changes. Our previous Forward Genomics implementation quantified sequence divergence for each species by reconstructing the likely ancestral DNA sequence of a given genomic region (Hiller et al. 2012b). We defined sequence divergence between an extant species and the common ancestor of all species of interest as the percent of identical bases: %id = id/(id + subs + ins + del) * 100, where id, subs, ins, and del are the numbers of identical bases, substitutions, inserted bases, and deleted bases, respectively. Here, we refer to this value as the global %id value, because divergence is measured between the common ancestor and an extant species (fig. 1). Although sequence divergence in a trait-loss lineage is likely the result of neutral evolution, %id values alone cannot distinguish between divergence caused by neutral or non-neutral processes (supplementary fig. 1, Supplementary Material online). To associate specific genomic regions to the given phenotype, our previous implementation searches for a region where all trait-loss species have a lower global %id value (higher sequence divergence) compared with all trait-preserving species (fig. 1). This method is called “perfect-match”.
Fig. 1.

Overview of three Forward Genomics methods. (A) Global %id values are computed by comparing the reconstructed sequence of the common ancestor of the species of interest (blue circle) to the sequence of an extant species. Local %id values are computed between the sequences at the start and end of each branch, which is either a reconstructed ancestral sequence (blue or green circle) or the sequence of an extant species. The branches in the phylogenetic tree are proportional to the number of substitutions per neutral site. Outgroup species are used to reconstruct the common ancestor. (B) The perfect-match method (Hiller et al. 2012b) assumes that the given phenotypic presence/absence (checkmark/cross) vector includes trait-losses in independent lineages and conducts a genome-wide search for genomic regions where all trait-loss species have a lower global %id value (higher sequence divergence) compared with all trait-preserving species. This is illustrated by a positive grey margin that separates the global %id values of both groups of species. (C) The GLS Forward Genomics method derives a covariance matrix that captures the phylogenetic relatedness between species. As illustrated for the first two species, the covariance between two species is the summed length of the branches that are shared between both species (highlighted in orange). The variance of a species is the summed length of all branches from the common ancestor to this species. Lower case letters indicate the length of the branches in the phylogenetic tree. A phylogenetic GLS approach (Grafen 1989) is used to compute a linear regression between the transformed normalized global %id values and the phenotypic pattern. The significance of a positive slope of the regression line is used as the significance of the association between phenotype and genotype. (D) The Branch method uses Dollo parsimony to estimate ancestral phenotypic states given the presence/absence pattern of the trait in the extant species. Each branch is then classified as trait-loss (red) or trait-preserving (blue). Local %id values are normalized by the expected value of a branch of the same length. If a genomic region is involved in the trait-loss, we expect that trait-loss branches are associated with lower normalized local %id values. The significance of a positive Pearson correlation coefficient is used as the significance for the association between phenotype and genotype.

Overview of three Forward Genomics methods. (A) Global %id values are computed by comparing the reconstructed sequence of the common ancestor of the species of interest (blue circle) to the sequence of an extant species. Local %id values are computed between the sequences at the start and end of each branch, which is either a reconstructed ancestral sequence (blue or green circle) or the sequence of an extant species. The branches in the phylogenetic tree are proportional to the number of substitutions per neutral site. Outgroup species are used to reconstruct the common ancestor. (B) The perfect-match method (Hiller et al. 2012b) assumes that the given phenotypic presence/absence (checkmark/cross) vector includes trait-losses in independent lineages and conducts a genome-wide search for genomic regions where all trait-loss species have a lower global %id value (higher sequence divergence) compared with all trait-preserving species. This is illustrated by a positive grey margin that separates the global %id values of both groups of species. (C) The GLS Forward Genomics method derives a covariance matrix that captures the phylogenetic relatedness between species. As illustrated for the first two species, the covariance between two species is the summed length of the branches that are shared between both species (highlighted in orange). The variance of a species is the summed length of all branches from the common ancestor to this species. Lower case letters indicate the length of the branches in the phylogenetic tree. A phylogenetic GLS approach (Grafen 1989) is used to compute a linear regression between the transformed normalized global %id values and the phenotypic pattern. The significance of a positive slope of the regression line is used as the significance of the association between phenotype and genotype. (D) The Branch method uses Dollo parsimony to estimate ancestral phenotypic states given the presence/absence pattern of the trait in the extant species. Each branch is then classified as trait-loss (red) or trait-preserving (blue). Local %id values are normalized by the expected value of a branch of the same length. If a genomic region is involved in the trait-loss, we expect that trait-loss branches are associated with lower normalized local %id values. The significance of a positive Pearson correlation coefficient is used as the significance for the association between phenotype and genotype. The perfect-match method has three main limitations, however. First, the phylogenetic relatedness between species is not taken into account. Instead, perfect-match simply matches a phenotypic presence/absence profile to a genomic divergence profile. Second, perfect-match does not control for differences in evolutionary rates (proportional to the branch lengths in fig. 1), which influence the global %id values. Third, perfect-match can only rank candidate loci but it cannot compute the significance of this association. Given that independent losses of phenotypes are frequent (McGhee 2011; Hiller et al. 2012b) and given the growing number of sequenced genomes, Forward Genomics has broad applicability to predict genomic loci involved in phenotypic changes between species. It is therefore worthwhile to increase the sensitivity to detect such phenotype–genotype associations. Here, we present two new Forward Genomics methods that overcome the three above-mentioned limitations of the perfect-match method by (1) directly controlling for phylogenetic relatedness, (2) controlling for differences in evolutionary rates, and (3) computing a statistical significance. We systematically compare these methods on 32 simulated datasets and on real data and show that both new methods provide a significant advance over the previous perfect-match method. We further used these new methods to detect genomic differences involved in the loss of vision in blind subterranean mammals. This genome-wide screen detected many genes with a function in eye development and the perception of light, as well as genes involved in the circadian rhythm. An implementation of the new methods is available at https://github.com/hillerlab/ForwardGenomics/.

Results

Two New Forward Genomics Methods Control for Phylogenetic Relatedness and Differences in Evolutionary Rates

Both new Forward Genomics methods require a phylogenetic tree with branch lengths proportional to the number of substitutions per neutral site. Since Forward Genomics only considers species with sequenced genomes, we can safely assume that this tree is available or can be inferred from genomic sequence data. Our first new Forward Genomics method makes use of the global %id values of a given genomic region, computed between all extant species of interest and their common ancestor (fig. 1). To address the three limitations described above, we first control for the phylogenetic relatedness between species by converting relatedness into a covariance matrix, which is a well-known method from phylogenetic comparisons (Grafen 1989; Martins and Hansen 1997; Pagel 1997, 1999). The covariance between two species is the sum of the length of all branches that are shared by these two species, as illustrated in figure 1. Second, since global %id values are influenced by the total branch length (L) from the common ancestor to an extant species, we control for differences in the evolutionary rates between species by normalizing global %id values as . As shown in supplementary figure 2, Supplementary Material online, this removes the influence of differences in evolutionary rates. Third, to compute the significance of the association between and the phenotype, we use a phylogenetic Generalized Least Square (GLS) approach (Grafen 1989; Martins and Hansen 1997; Pagel 1997, 1999) and compute the significance of a positive slope of the linear regression line. This Forward Genomics method is referred to as GLS. Unlike perfect-match and GLS that use global %id values, the second new method relies on sequence divergence measured for every branch, and is called “branch method”. This method first uses parsimony to classify branches into two groups: branches where the trait is lost and branches where the trait is preserved (fig. 1). Then, for a given genomic element, we test if the trait-loss branches tend to have higher sequence divergence compared with the trait-preserving branches. To this end, we compute a local %id value for every branch in the tree by comparing the sequence at the start and end of a branch (fig. 1). In contrast to the global %id values, local %id values are independent of each other, as every branch in the phylogenetic tree represents independent evolution. Thus, the branch method does not need to further control for phylogenetic relatedness. However, local %id values are influenced by the length of the respective branch, similar to global %id values that are influenced by the total distance to the common ancestor (supplementary fig. 2, Supplementary Material online). To compare local %id values between the trait-loss and trait-preserving branches, we remove the influence of the branch length by calculating the difference between a given local %id value and the expected value for a branch of the same length evolving under purifying selection. These expected values are pre-computed from simulated data (see Methods). Thus, normalized local %id values >0 or around 0 indicate conservation of ancestral sequence along this branch, while values <0 indicate neutral evolution. To test if trait-loss branches are associated with lower normalized local %id values, we compute the significance of a positive Pearson correlation coefficient (fig. 1). Like GLS, the branch method controls for phylogenetic relatedness and differences in evolutionary rates, and computes the significance of the association between phenotypic loss and sequence divergence.

Simulating Genome Evolution and Trait Loss Shows That Both New Methods Outperform the Perfect-Match Method

To compare the three Forward Genomics methods (fig. 1), one needs to know which genomic regions are involved in the given phenotypic change (positives) and which genomic regions are not involved (negatives). Since we lack comprehensive knowledge of real phenotypic changes, we created datasets by simulating whole genome evolution and independent trait loss with parameters as realistic as possible. In brief, we evolved an ancestral genome along the placental mammal phylogeny (fig. 2 left, supplementary fig. 3, Supplementary Material online) to obtain 30 simulated genomes of placental mammals. These genomes contain a total of 368,767 functional elements (147,776 coding exons, 220,991 nonexonic elements) that evolve under selection. To assure that simulated data are statistically similar to the real data, we annotated our ancestral genome such that the functional elements in the evolved simulated genomes match the length, nucleotide composition and %id distribution of real genomes (see Methods). In order to simulate independent trait loss, we randomly selected a small subset of 210 functional elements (111 coding exons from 10 genes, and 99 nonexonic elements) that are involved in the loss of this trait (positives) and let them evolve neutrally after trait loss. Thus, 99.94% of all functional elements evolve under selection in all lineages and are considered as negatives. Then, we computed global and local %id values for all functional elements and compared how well the three Forward Genomics methods can distinguish positives from negatives, based solely on sequence divergence.
Fig. 2.

Performance of the three Forward Genomics methods on simulated data shown by precision-sensitivity plots (right) for three of the 32 trait-loss scenarios (left). Trait losses occurred at the red crosses in the phylogeny, and following trait loss the 210 trait-involved genomic regions evolved neutrally along the parts of the branches shown in red. The red cross in the precision-sensitivity plot for the perfect-match method marks its performance when we consider only genomic regions where all trait-loss species have a lower global %id value compared with all trait-preserving species. The other 29 trait loss scenarios are shown in supplementary figures 5–33, Supplementary Material online.

Performance of the three Forward Genomics methods on simulated data shown by precision-sensitivity plots (right) for three of the 32 trait-loss scenarios (left). Trait losses occurred at the red crosses in the phylogeny, and following trait loss the 210 trait-involved genomic regions evolved neutrally along the parts of the branches shown in red. The red cross in the precision-sensitivity plot for the perfect-match method marks its performance when we consider only genomic regions where all trait-loss species have a lower global %id value compared with all trait-preserving species. The other 29 trait loss scenarios are shown in supplementary figures 5–33, Supplementary Material online. Performance of the three Forward Genomics methods on 32 trait-loss scenarios. (A) The sensitivity at 90% precision is plotted for 32 different trait-loss scenarios. Consistently, the GLS and branch method improve the sensitivity compared with the perfect-match method. (B–F) Properties of the trait-loss scenarios and properties of the trait-involved genomic regions influence the performance: (B) Age of the trait loss, measured by how long the trait-involved elements evolved neutrally; (C) number of independent trait-losses; (D) evolutionary rate in the trait-loss lineages; (E) length of trait-involved elements; and (F) strength of selection on trait-involved elements in the branches where they evolve under selection. Weak, medium, or strong refers to genomic regions that accept mutations with an average probability of > 0.66, 0.33–0.66, <0.33, respectively. To comprehensively compare the performance of the three methods, we tested a total of 32 trait-loss scenarios (supplementary table 1, Supplementary Material online) that differ in the age of the trait loss (0.025, 0.05, 0.075, or 0.1 substitutions per neutral site ago), the number of independent trait-loss lineages (between 2 and 4) and the evolutionary rate of the trait-loss lineages (low, medium, and high evolutionary rates, see supplementary fig. 3, Supplementary Material online). Given that the vast majority of functional elements are negatives, receiver operating characteristics plots that compare sensitivity and specificity are not suitable on such highly imbalanced datasets (Saito and Rehmsmeier 2015). The reason is that Forward Genomics achieves a high specificity (a high proportion of genomic regions that are not involved in the trait loss are correctly identified as such) irrespective of the total number of negatives (supplementary fig. 4, Supplementary Material online). However, the primary goal of Forward Genomics is not to achieve a high specificity; instead we aim at achieving a high precision (also called positive predictive value; Saito and Rehmsmeier 2015), which is defined as the proportion of elements that are involved in the loss of this trait (positives) out of all elements predicted to be involved in the trait loss. For this reason, we compare the sensitivity that is achieved at a certain precision, which depends on the total number of negatives (supplementary fig. 4, Supplementary Material online), and plotted the precision–sensitivity curve for each scenario. The results for three trait loss scenarios are shown in figure 2; the results for the other 29 scenarios are shown in supplementary figures 5–33, Supplementary Material online. To systematically compare the perfect-match, GLS, and branch method across all 32 scenarios, we used the sensitivity that was achieved for a high precision of 90%. Compared with the perfect-match method, the GLS method achieved the same sensitivity for 3 and a higher sensitivity for 27 scenarios (fig. 3, supplementary table 1, Supplementary Material online). The branch method achieved a higher sensitivity for all 32 scenarios compared with the perfect-match method. Furthermore, the branch method often outperformed GLS, especially if trait-loss occurred in species with low evolutionary rates.
Fig. 3.

Performance of the three Forward Genomics methods on 32 trait-loss scenarios. (A) The sensitivity at 90% precision is plotted for 32 different trait-loss scenarios. Consistently, the GLS and branch method improve the sensitivity compared with the perfect-match method. (B–F) Properties of the trait-loss scenarios and properties of the trait-involved genomic regions influence the performance: (B) Age of the trait loss, measured by how long the trait-involved elements evolved neutrally; (C) number of independent trait-losses; (D) evolutionary rate in the trait-loss lineages; (E) length of trait-involved elements; and (F) strength of selection on trait-involved elements in the branches where they evolve under selection. Weak, medium, or strong refers to genomic regions that accept mutations with an average probability of > 0.66, 0.33–0.66, <0.33, respectively.

The GLS and Branch Method Outperform the Perfect-Match Method Regardless of the Properties of the Trait-Loss Scenarios or the Properties of Trait-Involved Genomic Regions

Next, we compared how differences in the trait-loss scenarios affect the performance of the three Forward Genomics methods. To this end, we averaged the sensitivity at a precision of 90% considering only those scenarios (1) where trait-loss happened 0.025, 0.05, 0.075, or 0.1 substitutions per neutral site ago, (2) where 2, 3, or 4 independent trait-losses happened, and (3) where trait-loss happened in lineages with low, medium, or high evolutionary rates (supplementary fig. 3, Supplementary Material online). To assure a fair comparison, we averaged the performance of equivalent trait-loss scenarios (supplementary table 1, Supplementary Material online). First, all three methods performed substantially better if the trait-loss occurred long ago (fig. 3). The reason is that neutral evolution over longer evolutionary times will result in higher sequence divergence, which is easier to distinguish from purifying selection. Second, the GLS and branch method performed better for a higher number of independent losses (fig. 3), which was not the case for the perfect-match method. Third, while the sensitivity of the GLS method varies slightly with differences in the evolutionary rates in the trait loss lineages, the branch and perfect-match method have increased sensitivity if the trait loss includes lineages with low evolutionary rates (fig. 3). We further compared how differences in the trait-involved regions affect the performance of each method. For all scenarios where the trait-loss happened 0.05 substitutions per neutral site ago, we averaged the sensitivity at 90% precision separately considering trait-involved elements (1) that differ in length and (2) that are under weak, medium, or strong selection (supplementary table 1, Supplementary Material online). First, we found that the GLS and branch method have increased sensitivity to detect longer elements (fig. 3E). The likely reason is that longer, neutrally evolving elements accumulate mutations more evenly over time. In contrast, shorter, neutrally evolving elements may not accumulate mutations over short timescales simply by chance, which makes them harder to identify. Second, while the sensitivity of the branch method does not depend on the strength of selection, the GLS and perfect-match method have a higher sensitivity to detect trait-involved elements evolving under high constraint (fig. 3). Importantly, regardless of the differences in the trait-loss scenarios or differences in the trait-involved elements, the GLS and branch method consistently outperform the perfect-match method. This shows that controlling for phylogenetic relatedness and differences in evolutionary rates improves the sensitivity to detect genomic regions that are involved in phenotypic differences.

The GLS and Branch Method Are Robust to Uncertainties in the Phylogeny

Next, we investigated how robust are the two new methods with respect to precise knowledge of the topology of the phylogenetic tree. While the placental mammalian phylogeny is overall well resolved, there are three conflicting hypothesis for the base of the placental mammals (Morgan et al. 2013; Romiguier et al. 2013). Our simulated genomes evolved along a tree that represents the Atlantogenata hypothesis, where afrotherians and xenarthrans are sister lineages. To test robustness to the tree topology, we used the same simulated data but ran Forward Genomics with trees representing the Epitheria (afrotherians and boreoeutherians are sister lineages) and the Exafroplacentalia (xenarthrans and boreoeutherians are sister lineages) hypothesis (supplementary fig. 34A, Supplementary Material online). We found that the sensitivity that was achieved for a precision of 90% is typically only slightly affected by the uncertainty of the topology at the placental mammal base (fig. 4).
Fig. 4.

Robustness of the new Forward Genomics methods to uncertainties in the phylogenetic tree. The sensitivity at a precision of 90% of all 32 trait loss scenarios is shown for (A) the Epitheria and the Exafroplacentalia tree topology (supplementary fig. 34A, Supplementary Material online) and (B) three trees where random noise was added to each branch length (supplementary fig. 34B, Supplementary Material online). Solid lines show the results using the phylogeny that was used to produce the simulated data (reproduced from fig. 3 for comparison). Please note that the perfect-match method considers neither topology nor branch lengths, thus always gives the same results. The number of scenarios where the achieved sensitivity is higher than the sensitivity of the perfect-match method is shown in the legend.

Robustness of the new Forward Genomics methods to uncertainties in the phylogenetic tree. The sensitivity at a precision of 90% of all 32 trait loss scenarios is shown for (A) the Epitheria and the Exafroplacentalia tree topology (supplementary fig. 34A, Supplementary Material online) and (B) three trees where random noise was added to each branch length (supplementary fig. 34B, Supplementary Material online). Solid lines show the results using the phylogeny that was used to produce the simulated data (reproduced from fig. 3 for comparison). Please note that the perfect-match method considers neither topology nor branch lengths, thus always gives the same results. The number of scenarios where the achieved sensitivity is higher than the sensitivity of the perfect-match method is shown in the legend. Next, we tested how uncertainties in the branch lengths affect the results. To this end, we added random noise to each branch length by sampling from a normal distribution with a standard deviation of 0.025. Repeating this three times resulted in three trees where the branch lengths changed on average 0.019 (maximum 0.078) substitutions per neutral site (supplementary fig. 34B, Supplementary Material online). We found that the sensitivity at 90% precision varies depending on the scenario and the tree, however we observed both increases and decreases compared with the sensitivity obtained with the true branch lengths (fig. 4). Importantly, despite these changes in the topology and the branch lengths, the GLS and the branch method still achieve a higher sensitivity than the perfect-match method for the majority of the 32 trait loss scenarios (fig. 4).

Both the GLS and Branch Method Outperform Perfect-Match for the Trait “Loss of Vitamin C Synthesis”

To test if the GLS and the branch method also have increased sensitivity on real data, we used the independent loss of Vitamin C synthesis in Haplorrhini primates, in guinea pigs, in many microchiroptera families, and in the megachiroptera Pteropus vampyrus as a test case. Since it is well known that the loss of the Gulo gene is responsible for this trait loss (Hiller et al. 2012b), we considered all exons of the Gulo gene as trait-involved regions and all other 184,723 conserved coding regions as negatives. Previous results from the perfect-match method using a genome alignment with less species (Hiller et al. 2012b) found only one conserved coding region, which corresponded to Gulo exon 2, with lower (not-normalized) global %id values for all trait-loss species compared with all trait-preserving species (positive margin, as illustrated in fig. 1). Consistent with this, the perfect-match and both new methods detected Gulo exon 2 as the top hit (fig. 5) using a newer alignment with more species (Methods). To compare how well the three Forward Genomics methods can identify the conserved coding regions that correspond to the 11 coding exons of Gulo, we ranked each coding region. As shown in figure 5, the 11 Gulo exons are ranked much higher with the GLS and, in particular, with the branch method, where all 11 exons are within the top 164 hits. This shows that GLS and the branch method outperform perfect-match on real data and that both new methods have increased sensitivity to detect genomic regions involved in the loss of vitamin C synthesis phenotype.
Fig. 5.

The GLS and branch method outperform the perfect-match method on the trait “loss of vitamin C synthesis”. (A) Gulo exons are ranked higher with the GLS and the branch method than with perfect-match. For GLS and the branch method, each conserved coding region was ranked by its P-value. For perfect-match, we used the size of the margin for ranking, which is the difference between the lowest %id value of a trait-preserving species and the highest %id value of a trait-loss species. Gulo exon 2 is ranked first for all three methods. (B) The significance of most Gulo exons increases if the megabat P. vampyrus is excluded from the list of trait loss species. The trait loss in P. vampyrus happened more recently than in Haplorrhini primates, guinea pig, and the microbat M. lucifugus. We computed the difference between the margin (perfect-match) and the log P-value (GLS and branch method) between the screen that used all nonvitamin C synthesizing species and the screen where P. vampyrus was excluded. Positive differences indicate a better match to the trait loss. The significance of Gulo exons 9 and 10 decreases because both exons are deleted in P. vampyrus. Gulo exon 1, which only encodes the start codon, is excluded.

The GLS and branch method outperform the perfect-match method on the trait “loss of vitamin C synthesis”. (A) Gulo exons are ranked higher with the GLS and the branch method than with perfect-match. For GLS and the branch method, each conserved coding region was ranked by its P-value. For perfect-match, we used the size of the margin for ranking, which is the difference between the lowest %id value of a trait-preserving species and the highest %id value of a trait-loss species. Gulo exon 2 is ranked first for all three methods. (B) The significance of most Gulo exons increases if the megabat P. vampyrus is excluded from the list of trait loss species. The trait loss in P. vampyrus happened more recently than in Haplorrhini primates, guinea pig, and the microbat M. lucifugus. We computed the difference between the margin (perfect-match) and the log P-value (GLS and branch method) between the screen that used all nonvitamin C synthesizing species and the screen where P. vampyrus was excluded. Positive differences indicate a better match to the trait loss. The significance of Gulo exons 9 and 10 decreases because both exons are deleted in P. vampyrus. Gulo exon 1, which only encodes the start codon, is excluded. Among the four nonvitamin C synthesizing lineages, the megabat (P. vampyrus) has lost this trait most recently, as closely related bats are able to synthesize vitamin C and the Gulo gene is intact in other species of the Pteropus genus (Cui et al. 2011a, b). Since our simulations showed that there is lower sensitivity to detect genotype–phenotype associations for recent trait losses, we tested if excluding P. vampyrus from the list of trait-loss lineages would increase the sensitivity to detect Gulo exons. Indeed, the significance of most Gulo exons increased for all three Forward Genomics methods (fig. 5), which makes the Gulo genomic locus to stand out even more in a genome-wide screen (supplementary figure 35, Supplementary Material online). This supports our findings based on simulated data and suggests a strategy that combines results from two Forward Genomics searches: one search that includes all trait loss species and a second search that may detect additional associations by excluding lineages with a recent trait loss.

The GLS and Branch Method Detect Numerous Genes Associated with the Loss of Vision in Blind Mammals

We applied our new methods to detect genes involved in the loss of vision in two blind mammals, the blind mole rat and the cape golden mole (Fang et al. 2014). Both species live in a subterranean environment, have rudimentary eyes completely covered by skin, and a degenerated visual system (Cooper et al. 1993). We built a genome alignment of these two and 17 other mammals and three nonmammalian outgroup species, and computed local and global %id values of 184,412 conserved coding regions (see Methods). We ranked each gene by the P-value and number of exons, which resulted in a set of 208 genes detected by the GLS and/or branch method. We found that this set is enriched for genes known to be involved in eye development and perception of light, and genes expressed in the retina and lens (fig. 6, table 1). Furthermore, knockout or mutations in these genes are associated with abnormal eye phenotypes in mouse or eye diseases in human. Among these 208 genes are six crystalline genes that are involved in the formation of the lens and maintenance of its transparency and refractive index. Several other genes code for structural components of the lens, the retina and the cornea, or are involved in the transduction of light stimulus (table 1). The complete list of 36 genes with a function in eye development and vision is given in supplementary table 2, Supplementary Material online. Overall, this shows that the new Forward Genomics methods can detect many genes that are likely involved in the degeneration of the visual system in these two subterranean species.
Fig. 6.

The GLS and branch method detects several conserved coding regions that are diverged in two blind mammals, the blind mole rat, and the cape golden mole. Manhattan plots show the genomic location of 184,412 conserved coding regions and their associated P-values computed by the GLS (A) and branch method (B). All conserved coding regions that correspond to exons of the genes with a function in eye development and perception of light (supplementary table 2, Supplementary Material online) are shown in red.

Table 1

Functional Enrichments of the 208 Genes for Which the GLS and Branch Method Detected Increased Divergence in Blind Mammals.

OntologyAdjusted P-valueGenes
GO biological process
  Sensory perception of light stimulus (GO:0050953)2.3E−09CRYBB1;ABCA4;CRYBB3;CRYBA1;KRT12;CRYBB2;CRYBA4;CACNA1F;ARR3;GUCY2F;USH2A;GABRR2;BFSP2;OPN1MW;RDH5;GJA8;RGR;IMPG1
  Visual perception (GO:0007601)2.3E−09
  Sensory perception (GO:0007600)8.8E−05CRYBB1;ABCA4;CRYBB3;CRYBA1;CRYBB2;KRT12;CRYBA4;TAAR3;ARR3;CACNA1F;GUCY2F;USH2A;GABRR2;BFSP2;OPN1MW;RDH5;GJA8;RGR;IMPG1
  Lens development in camera-type eye (GO:0002088)0.008LIM2;CRYBA2;CRYBA1;GJA8;GJE1
  Detection of light stimulus (GO:0009583)0.011OPN1MW;GJA10;ABCA4;RDH5;CACNA2D4;CACNA1F;RGR;GUCY2F
  Detection of visible light (GO:0009584)0.026OPN1MW;GJA10;ABCA4;RDH5;CACNA2D4;CACNA1F;GUCY2F
GO molecular function
  Structural constituent of eye lens (GO:0005212)1.5E−09LIM2;BFSP2;BFSP1;CRYBB1;CRYBB3;CRYBA2;CRYBB2;CRYBA1;CRYBA4
MGI mammalian phenotype
  MP0005551_abnormal_eye_electrophysiology7.8E−08ABCA4;CACNA2D4;CACNA1F;ARR3;USH2A;GUCY2F;GABRR1;GJA10;RDH5;SLC16A8;GJA8;RGS11;RGR
  MP0002697_abnormal_eye_size0.020LIM2;HECTD1;HSF4;CRYBA1;CRYBB2;GJA8;GJE1
  MP0005193_abnormal_anterior_eye0.018LIM2;BFSP2;BFSP1;HSF4;KRT12;CRYBB2;CRYBA1;GJA8;LYST;GJE1
  MP0003787_abnormal_imprinting0.032SNRPN;ARID4A;ARID4B
  MP0008877_abnormal_DNA_methylation0.040
  MP0005253_abnormal_eye_physiology0.040BFSP2;ABCA4;RDH5;GJA8;RGR
Human phenotype ontology
  Zonular cataract (HP:0010920)2.6E−07BFSP1;CRYBB1;HSF4;CRYBB3;CRYBA1;CRYBB2;CRYBA4;GJA8
  Corneal dystrophy (HP:0001131)0.007OPN1MW;CRYBB1;CRYBB2;KRT12;CRYBA4;GJA8
  Nuclear cataract (HP:0100018)0.004CRYBB1;HSF4;CRYBB3;GJA8
OMIM
  Cataract4.6E−09LIM2;BFSP2;BFSP1;CRYBB1;CRYBB3;HSF4;CRYBA1;CRYBB2;CRYBA4;GJA8
Human gene atlas
  Retina0.009CH25H;ABCA4;RDH5;CRYBB2;SLC16A8;RGR;ARR3;IMPG1;TNS1
Mouse gene atlas
  Lens0.044TKTL1;LIM2;UHRF2;CRYBB1;CYP3A44;CRYBB3;CRYBA2;CRYBA1;CRYBB2;CRYBA4;GJE1;BFSP2;BFSP1;EWSR1;GJA10;CAPRIN2;HSF4;PCNX;GJA8;BIRC7;AGFG1
  Retina0.044PPM1N;ABCA4;BRAF;CACNA2D4;ARR3;USH2A;GABRR2;GABRR1;PIK3CA;OPN1MW;BC030499;UBN2;FBXL5;IMPG1;DRD4;SERINC4
Pfam InterPro domains
  Crystallin5.8E−05CRYBB1;CRYBA2;CRYBA1;CRYBB2;CRYBA4

Enrichments were computed by Enrichr (Chen et al. 2013).

The GLS and branch method detects several conserved coding regions that are diverged in two blind mammals, the blind mole rat, and the cape golden mole. Manhattan plots show the genomic location of 184,412 conserved coding regions and their associated P-values computed by the GLS (A) and branch method (B). All conserved coding regions that correspond to exons of the genes with a function in eye development and perception of light (supplementary table 2, Supplementary Material online) are shown in red. Functional Enrichments of the 208 Genes for Which the GLS and Branch Method Detected Increased Divergence in Blind Mammals. Enrichments were computed by Enrichr (Chen et al. 2013). Apart from genes related to vision, Forward Genomics detected divergence in genes involved in the transduction of light stimulus and regulation of the circadian clock: three gamma-aminobutyric acid (GABA) C receptors (Gabrr1, Gabrr2, Gabrr3) and the melatonin receptor 1b (Mtnr1b). These three GABA receptors are primarily found in the retina and are involved in inhibition of light-evoked synaptic currents from retinal neurons (Lukasiewicz et al. 2004). Removal of this inhibition is potentially an adaptation to the low light levels in the burrows, resulting in an increased sensitivity to light which, in turn, might allow for regulating the circadian clock. Mtnr1b is also expressed primarily in the retina. This gene is inactivated in the blind mole rat (as shown previously by Fang et al. 2014) as well as in the cape golden mole due to stop codon and frameshift mutations in exon 1 and the deletion of exon 2 (supplementary fig. 36, Supplementary Material online). Furthermore, the naked mole rat, another independent subterranean lineage, has also lost Mtnr1b (Kim et al. 2011). Melatonin is a hormone that is produced at night, in response to the absence of light stimulus from the retina, and is the major regulator of the daily biological rhythm of animals. The two blind species analyzed here live in a subterranean, constantly dark environment where minimal light propagates into the burrows (Kott et al. 2014), however, they still have an intrinsic rhythmicity, which can be measured by their thermoregulatory capacity (Haim et al. 1983; Pevet et al. 1984). Melatonin is also known for its role as a potent antioxidant and free radical scavenger (Hardeland et al. 1995). Blind mole rats produce melatonin in the greatly developed Harderian gland, which has high levels of the melatonin-synthesizing enzyme (Balemans et al. 1980) and this was suggested to contribute to protection against higher levels of reactive oxygen species produced by oxidative stress that the animals are subjected to in their hypoxic subterranean environment (Caballero et al. 2006). Given that sustained melatonin levels interfere with the circadian clock (Pevet et al. 1984), the loss of one melatonin receptor might be a way of overcoming potentially deleterious effects on the circadian clock.

Discussion

The growing number of sequenced genomes provides an unprecedented opportunity to use computational approaches to discover which genomic changes underlie particular phenotypic changes between species. While this is challenging due to the large number of genomic and phenotypic differences, computational methods can detect statistical signals for the subset of phenotypic differences that comprise independent losses (Hiller et al. 2012b; Marcovitz et al. 2016). Our previously developed Forward Genomics method searched for matches between a phenotypic presence/absence pattern and a genomic divergence profile, and thus did not account for the phylogenetic relatedness between species. Here, we present two new methods that directly control for phylogenetic relatedness and compute the statistical significance of the associations. Both methods also control for differences in evolutionary rates, which influences sequence divergence. We showed that both methods have substantially improved sensitivity in detecting associations between phenotypic and genomic differences. Forward Genomics is more effective the older the independent trait losses are, because old losses give enough time for neutral evolution to leave a clearly detectable sequence divergence signal in those genomic regions that are involved in the trait. If trait-loss is known to be recent in some lineages, but older in at least two other independent lineages, it can be advantageous to run a second Forward Genomics search that excludes those lineages where the trait was lost more recently. This second search may detect additional associations that were missed in a first search where all trait loss species were included (fig. 5). Two sources can be used to estimate when a trait was lost. First, dated fossils can provide information when a trait was still present and when it was lost along one lineage. Second, species divergence times give upper and lower boundaries for when a trait was lost. If a sister lineage to a trait-loss species has also lost the trait, we would assume by parsimony that the loss is as old as the common ancestor. In contrast, if sister species possess this trait, the loss is younger than the split of these species. The latter is illustrated for vitamin C synthesis: it is known that R. leschenaultia, a sister species to the nonvitamin C synthesizing bat P. vampyrus can synthesize this vitamin (Cui et al. 2011a, b). Both species split around 23 Ma (Hedges et al. 2006), which gives an upper bound for the age of this trait loss. Ideally, one would like to estimate how much sequence divergence can be expected in a lineage after trait-loss. This requires a phylogenetic tree where branch lengths correspond to substitutions per neutral site. The increasing availability of species with sequenced genomes will facilitate reliably estimating such molecular phylogenies. While more and more genomes are sequenced, there is still a substantial mismatch between the number of species with sequenced genomes and the number of species for which phenotypic data are available. For example, out of 46 mammals included in a very large phenotype dataset (O’Leary et al. 2013), only 20 have sequenced genomes (Marcovitz et al. 2016). To effectively utilize the present and future genomes to associate phenotypic differences and genomic differences, comprehensive phenotypic knowledge of the sequenced species must be accessible to computers (Deans et al. 2012, 2015), as opposed to free text descriptions in which the majority of our phenotypic knowledge is still described. Phenotype databases such as Morphobank (O’Leary and Kaufman 2011) that provide phenotypic character matrices have been successfully used to associate phenotypic and genomic differences (Marcovitz et al. 2016). However, in order to reuse, search, and compare phenotypic data, it is necessary to use ontologies that provide a defined vocabulary of anatomical features and relationships between them. The Phenoscape Knowledgebase uses ontologies to describe phenotypes of natural species and phenotypes of model organism mutants (Dahdul et al. 2010), which allows deriving hypotheses which genes may be involved in phenotypic differences between natural species (Manda et al. 2015). For example, candidate genes for the loss of the tongue and scales in the catfish lineage were detected by comparing natural and mutant phenotypes, and indeed, these genes have different expression patterns in the channel catfish that are consistent with their involvement in these phenotypic losses (Edmunds et al. 2016). Another advantage of computer-interpretable phenotype data is that missing data, which is common in phenotypic character matrices, can be drastically reduced by using machine reasoning to infer presence/absence states (Dececchi et al. 2015), which in turn will broaden the applicability of approaches like Forward Genomics. The two new methods presented here represent a significant advance in our ability to discover the genomic basis underlying phenotypic differences between species. With the increasing number of sequenced genomes and with an increasing accessibility of phenotypic knowledge, these Forward Genomics methods will contribute to our understanding of how nature’s phenotypic diversity has evolved.

Methods

GLS

In the GLS method, we control for the phylogenetic relatedness between species by computing the covariance matrix based on a phylogenetic tree. Given a tree with n species, the elements in the n × n covariance matrix are defined as and where is the total branch length from the common ancestor to species i and is the total branch length shared by species i and j (fig. 1). Then, we use a phylogenetic GLSs approach (Grafen 1989), implemented in the R package caper (https://cran.r-project.org/web/packages/caper/), to compute a linear regression between the normalized %id values and the phenotype. The P-value of a positive slope of the regression line is used as the significance of the association between the genomic and phenotypic difference.

Branch Method

The branch method computes a local %id value, which corresponds to the divergence between the sequence at the start and end of a given branch (fig. 1). To remove the influence of the branch length on the local %id value, we pre-computed the expected local %id value of a branch of length b that evolves under selection. To this end, we simulated genome evolution for b substitutions per neutral site (see below) and averaged the local %id value of all functional elements. Supplementary table 3, Supplementary Material online, shows the expected local %id values for branch lengths varying from 0.01 to 1.0 substitutions per neutral site in steps of 0.01. Then, we obtained a normalized local %id value by calculating the difference between the given local %id value and the local %id value that is expected for a branch of the same length evolving under selection. To classify branches as trait-loss or trait-preserving, we used Dollo parsimony, which allows for an unambiguous reconstruction of ancestral character states by assuming that lost traits cannot be regained. Alternatively, one could use maximum likelihood to reconstruct ancestral character states, which allows lost traits to be regained. Branches where a trait was likely regained should then be excluded, as they might not fully preserve the ancestral trait information, which would confound the analysis. It should be noted that for all our simulated scenarios, for the loss of vitamin C synthesis, and for the loss of vision, both maximum likelihood and Dollo parsimony lead to the same branch classification. Individual branches in a tree can be very short. On such short branches, it is more likely that not enough random mutations occur such that the sequence divergence of a neutrally evolving element can be distinguished from an element that evolves under selection. Therefore, we assigned a branch-length-dependent weight to each branch that is proportional to the power to detect neutral evolution along this branch. To compute the weight for a branch of length b, we first obtained the local %id distribution for elements evolving neutrally and elements evolving under selection by simulating genome evolution for b substitutions per site. Then, we define and as the fraction of the neutral and selection distribution below the %id value The weight of a branch of length b is then . Thus, if the two distributions are nonoverlapping, the weight will be ∼1. If the two distributions are similar as it is expected for short branches, the weight will be <1. Because the average constraint differs between coding regions and nonexonic elements, weights and expected local %id values were computed separately for these two groups (supplementary fig. 37 and supplementary table 4, Supplementary Material online). These weights are used to calculate a weighted Pearson correlation between the normalized local %id values of the trait-loss and the trait-preserving branches.

Annotating the Ancestral Genome for Simulating Genome Evolution

We used Evolver (http://www.drive5.com/evolver/) to simulate the evolution of an entire ancestral genome along a phylogeny. Evolver models genome evolution including substitutions, insertions, and deletions, transposon insertions, and tandem repeat expansion and contraction. Important for our purpose is that Evolver uses an ancestral genome, where genes (untranslated and coding regions, start and stop codons, and splice sites) and nonexonic functional elements are explicitly annotated and the bases in functional regions evolve under a specified level of constraint. Evolver has an explicit model of protein evolution and maintains the gene structure for genes under selection. Evolver requires an ancestral genome, where the type, position, and constraint of functional elements are annotated, and a phylogeny along which the genome will evolve. To assure that we compare the different Forward Genomics methods on realistic data, we created an ancestral genome annotation such that the functional elements in the simulated evolved genomes match the length, nucleotide composition and %id distribution of real genomes. We chose the mouse genome as our ancestral genome because it is well assembled and well annotated. First, we replaced N’s by random bases in chromosome 1–19 of the mouse mm10 assembly using Evolver’s “evo –findns”. Second, we assigned the position of functional elements to this genome. To annotate the position of 5′ and 3′ untranslated regions and coding regions of genes, we used the longest isoform of knownGenes from the UCSC genome browser (Rosenbloom et al. 2015). To annotate the position of conserved nonexonic elements in the ancestral genome, we used PhastCons elements (Siepel et al. 2005) from the UCSC mouse mm10 60way alignment (Rosenbloom et al. 2015) that are longer than 70 bp and that do not overlap exons. This results in a length and nucleotide composition distribution that is comparable to real data. Third, we needed to adjust the evolutionary constraint in the ancestral genome such that simulating genome evolution produces a distribution of global %id values that matches the real %id values. While we use the global %id value as a proxy for constraint, Evolver uses an “acceptance probability” that specifies the probability of a base to accept a mutation. Therefore, we created a map between global %id values and mean acceptance probability. To this end, in a first pass, we evolved a genome with randomly assigned mean acceptance probabilities for 0.19 substitutions per neutral site, which is the distance of human to the placental mammal ancestor, and measured the global %id value of all functional elements. Then we grouped all elements according to their %id value into bins of width 2%. Thus, for each %id value bin, we get the distribution of mean acceptance probabilities that resulted in evolved elements with this %id value. Second, we used this map to assign mean acceptance probabilities to each functional element in the ancestral genome based on the real human %id value. Specifically, we iterated over each functional element, obtained the real human %id value and then sampled from the mean acceptance probability distribution of the respective bin. Given the mean acceptance probability of an element, we used Evolver’s “evo –assprobs” to assign base-wise acceptance probabilities to each base in this functional element. We excluded functional elements that are shorter than 70 bp, which results 147,776 coding exons and 220,991 nonexonic elements (368,767 elements in total) in the final ancestral genome.

Simulating Genome Evolution and 32 Different Trait-Loss Scenarios

Simulating the entire phylogeny for all 19 chromosomes requires 1.1 TB disk space and over 65 CPU days, which is not feasible for testing many different trait-loss scenarios. To reduce runtime and disk space, we selected the 210 trait-involved elements only from chromosome 1. Then we simulated the evolution of chromosome 1 for each trait-loss scenario and extracted the %id values of the 210 trait-involved elements (positives). To obtain the %id values of all negative elements, we simulated the evolution of chromosome 1–19 only once, evolving all 368,767 functional elements under selection in all lineages. The %id values of these negatives were used in all trait-loss scenarios. To assure comparability between scenarios, we used a fixed set of trait-involved elements for all scenarios. This set comprises a total of 210 randomly selected elements (111 coding exons, 99 nonexonic elements) that are a representative sample as they closely match the length and mean acceptance probability distribution of all functional elements (supplementary fig. 38, Supplementary Material online). For each scenario, we evolved chromosome 1 of the ancestral genome along the placental mammal phylogeny. The phylogenetic tree of placental mammals with branch length values corresponding to substitutions per neutral site was downloaded from http://hgdownload.soe.ucsc.edu/goldenPath/mm10/multiz60way/. To simulate independent trait-loss, we stopped the simulation 0.025, 0.05, 0.075, or 0.1 substitutions per site before reaching the endpoint of a branch leading to a trait-loss species. Then, we removed any constraint on the 210 trait-involved regions by setting the probability of accepting a mutation to 1 and restarted the simulation. Global and local %id values were computed for all elements and used for Forward Genomics.

Applying Forward Genomics to the Loss of Vitamin C Synthesis

We used the UCSC 60way genome alignment where 59 vertebrates are aligned to the mouse mm10 assembly (Rosenbloom et al. 2015). To detect conserved regions in this alignment, we used PhastCons (Siepel et al. 2005) with the parameters expected-length = 45, target-coverage = 0.3 and rho = 0.3, and GERP (Davydov et al. 2010) with default parameters. Then, we extracted conserved coding regions by intersecting conserved elements with coding exons from the Ensembl, RefSeq, and UCSC knownGenes annotation. We required that each conserved coding region is at least 30 bp. This resulted in 184,723 conserved coding regions covering 27.1 Mb (1.03% of the mm10 assembly). Percent identity values were computed by reconstructing ancestral sequences as described in Hiller et al. (2012b). Bases with a low-quality score (>1% error rate) for the assemblies where quality scores are available were ignored in the %id calculation. Similarly, unaligning regions that map to assembly gaps were ignored and not counted as lost ancestral sequence (Hiller et al. 2012a). To screen for regions associated with the loss of vitamin C, we applied our R implementation of the three Forward Genomics methods (https://github.com/hillerlab/ForwardGenomics/) using the following species as trait-loss species: microbat (myoLuc2), megabat (pteVam1), tarsier (tarSyr1), squirrel monkey (saiBol1), marmoset (calJac3), baboon (papHam1), rhesus macaque (rheMac3), gibbon (nomLeu2), orangutan (ponAbe2), gorilla (gorGor3), chimpanzee (panTro4), human (hg19), and guinea pig (cavPor3).

Applying Forward Genomics to the Loss of Vision in Blind Mammals

To analyze the loss of vision in the blind mole rat and cape golden mole, we first build a genome alignment with mouse as the reference species that included both blind species. Specifically, we used the UCSC lastz/chain/net pipeline (Kent et al. 2003) to build pairwise genome alignments between mouse (mm10 assembly) and the following species: rat (rn5), guinea pig (cavPor3), pika (ochPri3), rabbit (oryCun2), prairie vole (micOch1), blind mole rat (nanGal1), squirrel (speTri2), human (hg19), crab-eating macaque (macFas5), bushbaby (otoGar3), cow (bosTau7), dog (canFam3), horse (equCab2), cat (felCat5), elephant (loxAfr3), manatee (triMan1), cape golden mole (chrAsi1), opossum (monDom5), Anolis lizard (anoCar2), chicken (galGal4), and frog (xenTro7). For all species, we used lastz (Schwartz et al. 2003) version 1.03.54 with the parameters H = 2,000 Y = 3,000 L = 3,000 K = 2,400, and the HoxD55 scoring matrix, and kept all local alignment that have at least one ≥30 bp region with ≥60% sequence identity and ≥1.8 bits entropy as described in Hiller et al. (2013). For all nonmammalian species, we additionally used highly-sensitive local alignments (Hiller et al. 2013) with lastz parameters W = 5, L = 2,700, and K = 2,000. For mammals, we kept only alignment chains with a score of ≥70,000 that span ≥9,000 bp in both genomes. In order to keep also chains with very strong alignments spanning only a shorter region, we also kept chains with a score of  ≥150,000 that span ≥6,000 bp in both genomes. For nonmammals, we kept only alignment chains with a score of ≥15,000. All other chains are discarded as they typically do not represent strong syntenic alignments. Chains were ‘netted’ using chainNet (Kent et al. 2003). The pairwise syntenic alignment nets are the input to MULTIZ (Blanchette et al. 2004) to build a multiple alignment. The neutral distances between all species were determined using phyloFit (Siepel et al. 2005) and 4-fold degenerate sites. The tree with branch lengths measuring substitutions per neutral site is given in supplementary figure 39, Supplementary Material online. As above, we used PhastCons and GERP to obtain 184,412 conserved coding regions covering (27.4 Mb, 1.04% of the mm10 assembly). After applying the GLS and branch method to all conserved coding regions, we selected those multi-exon genes where at least two exons are in the top 1,000 of the most significant hits and selected those single exon genes that are in the same top 1,000 hits. This resulted in a list of 141 (124 multi-exon and 17 single exon) genes for the GLS method and 164 (132 multi-exon and 32 single exon) genes for the branch method. The union of both lists comprises 208 genes. We used Enrichr (Chen et al. 2013) to detect functional enrichments of these 208 genes (table 1). Similar enrichments related to eye and vision were also found for the individual sets of 141 and 164 genes, however the 164 genes detected by the branch method have additional functional enrichments (supplementary table 5, Supplementary Material online).

Data Availability

The following data are available at http://bds.mpi-cbg.de/hillerlab/ForwardGenomics/: data of all 32 simulated trait loss scenarios (%id values, output of the Forward Genomics implementation, all scripts to reproduce the results), phastCons and GERP conservation scores (bigWig format) and the conserved elements (bed format), the conserved coding regions (bed format) and their local and global %id values that we used for the loss of vitamin C and loss of vision phenotype, and the genome alignment (maf format, 9.5 GB) that includes the blind mammals and the associated phylogenetic tree. An R implementation of the three Forward Genomics methods is available at https://github.com/hillerlab/ForwardGenomics/.

Supplementary Material

Supplementary tables 1–5 and figures 1–39 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
  59 in total

1.  TimeTree: a public knowledge-base of divergence times among organisms.

Authors:  S Blair Hedges; Joel Dudley; Sudhir Kumar
Journal:  Bioinformatics       Date:  2006-10-04       Impact factor: 6.937

2.  Parallel sites implicate functional convergence of the hearing gene prestin among echolocating mammals.

Authors:  Zhen Liu; Fei-Yan Qi; Xin Zhou; Hai-Qing Ren; Peng Shi
Journal:  Mol Biol Evol       Date:  2014-06-19       Impact factor: 16.240

3.  Genome-wide adaptive complexes to underground stresses in blind mole rats Spalax.

Authors:  Xiaodong Fang; Eviatar Nevo; Lijuan Han; Erez Y Levanon; Jing Zhao; Aaron Avivi; Denis Larkin; Xuanting Jiang; Sergey Feranchuk; Yabing Zhu; Alla Fishman; Yue Feng; Noa Sher; Zhiqiang Xiong; Thomas Hankeln; Zhiyong Huang; Vera Gorbunova; Lu Zhang; Wei Zhao; Derek E Wildman; Yingqi Xiong; Andrei Gudkov; Qiumei Zheng; Gideon Rechavi; Sanyang Liu; Lily Bazak; Jie Chen; Binyamin A Knisbacher; Yao Lu; Imad Shams; Krzysztof Gajda; Marta Farré; Jaebum Kim; Harris A Lewin; Jian Ma; Mark Band; Anne Bicker; Angela Kranz; Tobias Mattheus; Hanno Schmidt; Andrei Seluanov; Jorge Azpurua; Michael R McGowen; Eshel Ben Jacob; Kexin Li; Shaoliang Peng; Xiaoqian Zhu; Xiangke Liao; Shuaicheng Li; Anders Krogh; Xin Zhou; Leonid Brodsky; Jun Wang
Journal:  Nat Commun       Date:  2014-06-03       Impact factor: 14.919

4.  Antioxidant activity in Spalax ehrenbergi: a possible adaptation to underground stress.

Authors:  Beatriz Caballero; Cristina Tomás-Zapico; Ignacio Vega-Naredo; Verónica Sierra; Delio Tolivia; Rüdiger Hardeland; María Josefa Rodríguez-Colunga; Alma Joel; Eviatar Nevo; Aaron Avivi; Ana Coto-Montes
Journal:  J Comp Physiol A Neuroethol Sens Neural Behav Physiol       Date:  2006-02-15       Impact factor: 1.836

5.  Human-mouse alignments with BLASTZ.

Authors:  Scott Schwartz; W James Kent; Arian Smit; Zheng Zhang; Robert Baertsch; Ross C Hardison; David Haussler; Webb Miller
Journal:  Genome Res       Date:  2003-01       Impact factor: 9.043

6.  Myosin gene mutation correlates with anatomical changes in the human lineage.

Authors:  Hansell H Stedman; Benjamin W Kozyak; Anthony Nelson; Danielle M Thesier; Leonard T Su; David W Low; Charles R Bridges; Joseph B Shrager; Nancy Minugh-Purvis; Marilyn A Mitchell
Journal:  Nature       Date:  2004-03-25       Impact factor: 49.962

7.  Human-specific loss of regulatory DNA and the evolution of human-specific traits.

Authors:  Cory Y McLean; Philip L Reno; Alex A Pollen; Abraham I Bassan; Terence D Capellini; Catherine Guenther; Vahan B Indjeian; Xinhong Lim; Douglas B Menke; Bruce T Schaar; Aaron M Wenger; Gill Bejerano; David M Kingsley
Journal:  Nature       Date:  2011-03-10       Impact factor: 49.962

8.  Forces shaping the fastest evolving regions in the human genome.

Authors:  Katherine S Pollard; Sofie R Salama; Bryan King; Andrew D Kern; Tim Dreszer; Sol Katzman; Adam Siepel; Jakob S Pedersen; Gill Bejerano; Robert Baertsch; Kate R Rosenbloom; Jim Kent; David Haussler
Journal:  PLoS Genet       Date:  2006-08-23       Impact factor: 5.917

9.  The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets.

Authors:  Takaya Saito; Marc Rehmsmeier
Journal:  PLoS One       Date:  2015-03-04       Impact factor: 3.240

10.  Finding our way through phenotypes.

Authors:  Andrew R Deans; Suzanna E Lewis; Eva Huala; Salvatore S Anzaldo; Michael Ashburner; James P Balhoff; David C Blackburn; Judith A Blake; J Gordon Burleigh; Bruno Chanet; Laurel D Cooper; Mélanie Courtot; Sándor Csösz; Hong Cui; Wasila Dahdul; Sandip Das; T Alexander Dececchi; Agnes Dettai; Rui Diogo; Robert E Druzinsky; Michel Dumontier; Nico M Franz; Frank Friedrich; George V Gkoutos; Melissa Haendel; Luke J Harmon; Terry F Hayamizu; Yongqun He; Heather M Hines; Nizar Ibrahim; Laura M Jackson; Pankaj Jaiswal; Christina James-Zorn; Sebastian Köhler; Guillaume Lecointre; Hilmar Lapp; Carolyn J Lawrence; Nicolas Le Novère; John G Lundberg; James Macklin; Austin R Mast; Peter E Midford; István Mikó; Christopher J Mungall; Anika Oellrich; David Osumi-Sutherland; Helen Parkinson; Martín J Ramírez; Stefan Richter; Peter N Robinson; Alan Ruttenberg; Katja S Schulz; Erik Segerdell; Katja C Seltmann; Michael J Sharkey; Aaron D Smith; Barry Smith; Chelsea D Specht; R Burke Squires; Robert W Thacker; Anne Thessen; Jose Fernandez-Triana; Mauno Vihinen; Peter D Vize; Lars Vogt; Christine E Wall; Ramona L Walls; Monte Westerfeld; Robert A Wharton; Christian S Wirkner; James B Woolley; Matthew J Yoder; Aaron M Zorn; Paula Mabee
Journal:  PLoS Biol       Date:  2015-01-06       Impact factor: 8.029

View more
  26 in total

1.  RERconverge: an R package for associating evolutionary rates with convergent traits.

Authors:  Amanda Kowalczyk; Wynn K Meyer; Raghavendran Partha; Weiguang Mao; Nathan L Clark; Maria Chikina
Journal:  Bioinformatics       Date:  2019-11-01       Impact factor: 6.937

2.  Robust Method for Detecting Convergent Shifts in Evolutionary Rates.

Authors:  Raghavendran Partha; Amanda Kowalczyk; Nathan L Clark; Maria Chikina
Journal:  Mol Biol Evol       Date:  2019-08-01       Impact factor: 16.240

3.  A functional enrichment test for molecular convergent evolution finds a clear protein-coding signal in echolocating bats and whales.

Authors:  Amir Marcovitz; Yatish Turakhia; Heidi I Chen; Michael Gloudemans; Benjamin A Braun; Haoqing Wang; Gill Bejerano
Journal:  Proc Natl Acad Sci U S A       Date:  2019-09-30       Impact factor: 11.205

4.  Phylogenetic Permulations: A Statistically Rigorous Approach to Measure Confidence in Associations in a Phylogenetic Context.

Authors:  Elysia Saputra; Amanda Kowalczyk; Luisa Cusick; Nathan Clark; Maria Chikina
Journal:  Mol Biol Evol       Date:  2021-06-25       Impact factor: 16.240

5.  Subterranean mammals show convergent regression in ocular genes and enhancers, along with adaptation to tunneling.

Authors:  Raghavendran Partha; Bharesh K Chauhan; Zelia Ferreira; Joseph D Robinson; Kira Lathrop; Ken K Nischal; Maria Chikina; Nathan L Clark
Journal:  Elife       Date:  2017-10-16       Impact factor: 8.140

6.  A genomics approach reveals insights into the importance of gene losses for mammalian adaptations.

Authors:  Virag Sharma; Nikolai Hecker; Juliana G Roscito; Leo Foerster; Bjoern E Langer; Michael Hiller
Journal:  Nat Commun       Date:  2018-03-23       Impact factor: 14.919

7.  Independent erosion of conserved transcription factor binding sites points to shared hindlimb, vision and external testes loss in different mammals.

Authors:  Mark J Berger; Aaron M Wenger; Harendra Guturu; Gill Bejerano
Journal:  Nucleic Acids Res       Date:  2018-10-12       Impact factor: 16.971

8.  Loss of RXFP2 and INSL3 genes in Afrotheria shows that testicular descent is the ancestral condition in placental mammals.

Authors:  Virag Sharma; Thomas Lehmann; Heiko Stuckas; Liane Funke; Michael Hiller
Journal:  PLoS Biol       Date:  2018-06-28       Impact factor: 8.029

9.  REforge Associates Transcription Factor Binding Site Divergence in Regulatory Elements with Phenotypic Differences between Species.

Authors:  Björn E Langer; Juliana G Roscito; Michael Hiller
Journal:  Mol Biol Evol       Date:  2018-12-01       Impact factor: 16.240

10.  Limited Evidence for Parallel Molecular Adaptations Associated with the Subterranean Niche in Mammals: A Comparative Study of Three Superorders.

Authors:  Kalina T J Davies; Nigel C Bennett; Chris G Faulkes; Stephen J Rossiter
Journal:  Mol Biol Evol       Date:  2018-10-01       Impact factor: 16.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.