Literature DB >> 28288115

Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data.

Yi-Fei Huang¹, Brad Gulko^1,2, Adam Siepel¹.

Abstract

Many genetic variants that influence phenotypes of interest are located outside of protein-coding genes, yet existing methods for identifying such variants have poor predictive power. Here we introduce a new computational method, called LINSIGHT, that substantially improves the prediction of noncoding nucleotide sites at which mutations are likely to have deleterious fitness consequences, and which, therefore, are likely to be phenotypically important. LINSIGHT combines a generalized linear model for functional genomic data with a probabilistic model of molecular evolution. The method is fast and highly scalable, enabling it to exploit the 'big data' available in modern genomics. We show that LINSIGHT outperforms the best available methods in identifying human noncoding variants associated with inherited diseases. In addition, we apply LINSIGHT to an atlas of human enhancers and show that the fitness consequences at enhancers depend on cell type, tissue specificity, and constraints at associated promoters.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2017 PMID： 28288115 PMCID： PMC5395419 DOI： 10.1038/ng.3810

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 38.330

Introduction

In the human genome, most nucleotides that are associated with diseases or other phenotypes, or that show signatures of natural selection, fall outside of protein-coding genes[1-3]. Many of these nucleotides appear to fall in cis-regulatory elements, including promoters, enhancers, and insulators. Similar observations hold across most animals and plants[4-7]. Recent efforts to characterize noncoding sequences using high-throughput biochemical assays have produced a wealth of data, identified many regulatory elements, and clarified general aspects of gene regulation[8-12]. Nevertheless, a substantial gap remains between the outcomes of these experiments and a detailed understanding of noncoding function, for several reasons. First, these assays generally measure genomic and epigenomic features roughly correlated with, but not directly indicative of, regulatory function. Second, they generally have relatively low resolution along the genome, identifying regions hundreds of nucleotides long, rather than pinpointing single nucleotides. Third, these measures are highly condition-specific, and data have only been generated for a small subset of cell types and conditions. As a consequence, there is a pressing need for computational methods that more precisely predict regulatory function by jointly considering the results of numerous such assays together with complementary data, such as annotations of protein-coding genes and measures of evolutionary conservation across species. The development of statistical and machine-learning methods that attempt to address this integrative prediction challenge has emerged as an active, fast-moving area of research. Recently published methods in this area can be roughly divided intothree categories: (1) machine-learning classifiers that attempt to separate known disease variants from putatively benign variants using a variety of genomic features (e.g., GWAVA[13] and FATHMM-MKL[14]); (2) sequence- and motif-based predictors for the impact of noncoding variants on cell-type-specific molecular phenotypes, such as chromatin accessibility or histone modifications (e.g., DeepBind[15], DeepSEA[16] and Basset[17]); and (3) evolutionary methods that consider data on genetic variation together with functional genomic data and aim to predict the effects of noncoding variants on fitness (e.g., CADD[18], DANN[19], FunSeq2[20], and fitCons[3]). A limitation of methods of the first class is that they depend strongly on the available training data, which may be limited and may not be representative of the broader class of regulatory sequences of interest. Methods of the second class have the limitation that the significance of molecular phenotypes at the organismal level is often unclear. Evolutionary methods, by contrast, obtain their signal not primarily from previously assigned class labels, but instead from signatures of natural selection over many generations. They are therefore both less data limited, and more focused on phenotypes that truly influence fitness, than the other methods. This approach is likely to be particularly powerful for detecting regulatory variants that tend to be under strong purifying selection, such as rare variants associated with severe diseases. Evolution-based methods also naturally integrate over cell types, an important strength when the relevant tissue- or cell-types for a condition of interest are unknown. Among the available evolution-based methods, fitCons[3] is unique in explicitly characterizing the influence of natural selection at each genomic site of interest using a full probabilistic evolutionary model and patterns of genetic variation within and between species. FitCons makes a distinction between functional genomic and comparative genomic data, first defining several hundred clusters of genomic positions with distinct functional genomic “signatures,” and then estimating the fraction of nucleotides under natural selection within each cluster from polymorphism and divergence data. These estimates are obtained using the INSIGHT evolutionary model[21,22], and are interpreted as the probabilities that mutations in each cluster of genomic sites will have fitness consequences (fitCons scores). In this manner, fitCons aggregates information about natural selection from large numbers of sites with similar functional profiles based on evolutionary first principles. A major limitation of the method, however, is that it scales poorly with the available functional genomic data. In particular, the number of clusters considered by the method increases exponentially with the number of functional genomic annotations, which keeps it from taking advantage of the growing body of functional genomic data. A related problem is that the restriction to small numbers of genomic features leads to a relatively coarse-grained, blocky pattern of scores along the genome, which does not allow for fine distinctions among nearby nucleotide sites. In this paper, we describe a new method, (LINSIGHT; pronounced lin-site), that is based on the existing INSIGHT/fitCons framework but has vastly improved speed, scalability, genomic resolution, and prediction power. The main idea behind LINSIGHT is to bypass the clustering step of fitCons and instead couple the probabilistic INSIGHT model directly to a generalized linear model for genomic features. This strategy results in a more streamlined model that scales linearly, rather than exponentially, with the available data, and can make direct use of the input data, with no need for discretization. By integrating a large number of genomic features, LINSIGHT provides a precise, high-resolution description of the fitness consequences of noncoding mutations in the human genome. We demonstrate that LINSIGHT outperforms state-of-the-art prediction methods in the task of prioritizing noncoding disease variants from the Human Gene Mutation database (HGMD)[23] and the NCBI ClinVar database[24]. Furthermore, we use LINSIGHT to show that the evolutionary constraints on human enhancers depend on their associated tissue types, degree of tissue specificity, and associated promoters, which has important implications for understanding the evolution of cis-regulatory elements and for improving variant prioritization methods. Our LINSIGHT scores are available as a track on the Cold Spring Harbor Laboratory mirror of the UCSC Genome Browser (hg19 assembly). The LINSIGHT software is available from our public laboratory GitHub repository.

Results

LINSIGHT combines INSIGHT with a scalable linear model

The original INSIGHT and fitCons methods[3,21,22] infer the selective pressure on noncoding sites, and hence the likely fitness consequences of noncoding mutations, by contrasting patterns of genetic variation at each focal site with the patterns at nearby genomic regions that are likely to be free from the influence of selection (“neutrally evolving sites”). To address the problem that genetic variation within species and between closely related species (such as the human and chimpanzee) are sparse across the genome, fitCons pools information across the thousands of genomic sites assigned to each discrete cluster. The key idea behind LINSIGHT is instead to accomplish this pooling of information across sites indirectly, using a generalized linear model (Figure 1 and Table 1; see Supplementary Note and Supplementary Tables 1 and 2 for complete details). In particular, the parameters of the INSIGHT model that describe natural selection (ρ and γ) are determined as linear-sigmoid functions of the genomic features local to each site. (The third selection parameter from INSIGHT, η, is omitted because positive selection has a negligible effect in this setting; see Supplementary Note.) Thus, the probability of fitness consequences of mutations at each site i, denoted ρ, is assumed to depend on genomic features at that site, such as its RNA expression level (RNA-seq read depth), chromatin accessibility (DNase-I hypersensitive sites), and histone modifications or bound transcription factors (ChIP-seq peaks), as well as on features based on annotations (e.g., distance to nearest transcription start site, match to known TFBS motif) and comparative genomics (e.g., phyloP[25] or phastCons[4] scores). We refer to ρ as the at site i. This scoring strategy has several major advantages: it requires no clustering and no discretization, and it scales linearly with the available genomic features, allowing hundreds of features to be considered. In contrast to fitCons, the scalability of the method enables data to be pooled across cell types, and it allows the scores to reach single-nucleotide resolution along the genome. Nevertheless, LINSIGHT continues to benefit from the advantages of the probabilistic INSIGHT model of molecular evolution.

Fig. 1

Conceptual overview of LINSIGHT. (a) Like the fitCons method[3], LINSIGHT estimates probabilities that mutations at each genomic site will have fitness consequences, based on patterns of genetic polymorphism within a species (here, humans) and patterns of divergence from closely related outgroup species (chimpanzee, orangutan, and rhesus macaque). Patterns of genetic variation at the focal site and other sites like it are contrasted with those in neutrally evolving regions nearby. Red circles indicate human single nucleotide polymorphisms and blue circles indicate nucleotide substitutions between species. (b) LINSIGHT combines the probabilistic graphical model from INSIGHT[21,22] with a generalized linear model. The selection parameters from INSIGHT, ρ and γ, are defined in a sitewise manner by linear combinations of local genomic features, followed by sigmoid transformations. The figure summarizes the behavior at a particular focal site i. The matching shaded regions in the left of (a) and the left of (b) indicate corresponding portions of the INSIGHT model and the phylogeny and sequence data. See Table 1 for definitions of all parameters and variables.

Table 1

Summary of key model parameters and variables

	Parameters inherited from INSIGHTa
ρ_i	Probability that site i is under selection. Interpreted as the LINSIGHT score for site i
γ_i	Expected relative rate of low-frequency derived alleles at site i given that it is under selection
λ_i	Neutral substitution rate at site i
θ_i	Neutral polymorphism rate at site i
β = (β₁,β₂,β₃)	Fractions of neutral polymorphisms with low-, intermediate-, and high-frequency derived alleles
	Variables inherited from INSIGHTa
X_i = (X_i^maj,X_i^min,Y_i)	Observed polymorphism data at site i, including major allele, minor allele, and minor-allele frequency class
Z_i	Human-chimpanzee ancestral allele at site i
A_i	Human ancestral allele at site i
S_i	Indicator for whether or not site i is under selection
	Components of LINSIGHT’s generalized linear modela
D_i = (d_i,1,…,d_i,m)	Genomic feature vector at site i
W_ρ = (w_ρ,1,…,w_ρ,m)	Weight vector for ρ (free parameters)
Wγ = (w_γ,1,…,w_γ,m)	Weight vector for γ (free parameters)
g()	Sigmoid function for ρ (Gompertz)
h()	Sigmoid function for γ (logistic)

See Supplementary Note and Supplementary Table 1 for full details.

All parameters of the LINSIGHT model are estimated simultaneously from genome-wide data by maximum likelihood using an online stochastic gradient descent algorithm (Methods). The gradients for the feature weights are efficiently computed by the back-propagation method widely used in neural network training[26]. Indeed, the model can be considered a type of neural network, albeit one without hidden layers. Its main disadvantage relative to fitCons—the assumption of an additive, linear relationship between features and selection parameters—could be addressed by adding hidden layers to the neural network, although we have found its performance to be excellent without this extension. Notably, the amount of data available for training is large in comparison to the number of free parameters and we have not yet found regularization to be necessary, but it could easily be added if necessary.

LINSIGHT scores across the human genome are generally consistent with, but often improve on, previous measures of evolutionary conservation

We applied LINSIGHT to a large public data set consisting of complete genome sequences for multiple human individuals and nonhuman primates, comparative genomic data for mammals and vertebrates, and a wide variety of functional genomic data, and we generated LINSIGHT scores for all positions across human reference genome (Methods). We considered a total of 48 genomic features, falling in three general classes: conservation scores, predicted binding sites, and regional annotations (Table 2 and Supplementary Table 3).

Table 2

Summary of genomic features used for LINSIGHT scores

Class	Genomic featurea	Spatial resolution
Conservation	phyloP score	High
	phastCons element	High
	SiPhy element	High
	CEGA element	High
Binding site	Conserved TFBS	High
	rVISTA TFBS	High
	SwissRegulon TFBS	High
	Predicted TFBS within ChIP-seq peak	High
	Conserved miRNA binding site	High
	Splicing site predicted by SPIDEX	High
Regional annotation	ChIP-seq peak of transcription factor	Low
	DNase-I hypersensitive site	Low
	UCSC FAIRE peak	Low
	RNA-seq signal	Low
	Histone modification peak	Low
	FANTOM5 enhancer	Low
	Predicted distal regulatory module	Low
	Distance to nearest TSS	Low

Each “genomic feature” listed here may actually correspond to multiple features in the model. For example, four features are derived from phyloP scores: two from the mammalian phyloP scores and two from the vertebrate phyloP scores. See Supplementary Table 3 for complete details.

The distributions of INSIGHT scores in annotated regions of the noncoding genome are generally consistent with previous observations based on conservation scores[1,4,25]. For example, splice sites are very highly constrained (median LINSIGHT score of 0.956, indicating a 95.6% probability of fitness consequences due to mutations at these nucleotide sites), whereas annotated TFBSs show reduced, but still substantial, constraint (median score of 0.240 for TFBSs shared across species, median score of 0.106 for all TFBSs from the Ensembl Regulatory Build[27]; Figure 2a). Other promoter regions (median score of 0.073) and untranslated regions (UTRs; median scores of 0.128 and 0.076 for 5’ and 3’ UTRs, respectively) are somewhat less constrained, and unannotated intronic and intergenic regions exhibit the least constraint (median scores of 0.044–0.048). As observed previously, 5’ UTRs show somewhat more constraint than 3’ UTRs, although both types of UTRs contain subsets of sites subject to strong selection (LINSIGHT score > 0.8)[4,25]. The estimate for the more conserved TFBSs (0.240) is roughly similar to, if slightly lower than, previous estimates directly obtained from experimentally defined TFBSs (~30-40% of sites under selection[22,28]), despite that it was obtained indirectly in this case via the generalized linear model. The genome-wide average of the LINSIGHT scores is about 0.07, suggesting that about 7% of noncoding sites are under evolutionary constraint, consistent with numerous previous studies[3,4,29-31].

Fig. 2

Summary of LINSIGHT scores across the noncoding human genome (3.001 billion nucleotide sites). (a) Distributions of LINSIGHT scores for various genomic regions. Intergenic, intronic, UTRs, and 1-kb promoters were defined based on GENCODE annotations (version 19); TFBSs were predicted from ChIP-seq peaks (Ensembl Regulatory Build); and conserved TFBSs were obtained from the UCSC Genome Browser. Within each violin plot, width represents density and black dot represents median LINSIGHT score. Note the logarithmic vertical scale. (b) UCSC Genome Browser display showing LINSIGHT scores alongside those from fitCons, phastCons, phyloP, and GERP++. LINSIGHT integrates functional genomic data together with conservation scores and other features to provide a high-powered, high-resolution measure of potential function. In this example, it is the only method to highlight a variant from HGMD (CR065653) that is associated with up-regulation of the telomerase reverse transcriptase (TERT) gene. See Supplementary Figure 3 for additional examples.

Across all noncoding positions in the genome, the LINSIGHT scores are fairly well correlated with those from other recently published methods particularly within conserved elements, which are enriched for regulatory function (see Supplementary Note and Supplementary Figure 1). On the task of identifying likely regulatory elements, the methods that make use of functional genomic data generally perform better than pure conservation methods, and LINSIGHT is among the best at this task (see Supplementary Note). For example, LINSIGHT has good power to identify transcription factor binding sites from the ORegAnno database[32] (AUC = 0.926), outperformed only by the DeepSEA functional significance score (AUC = 0.965) and FunSeq2 (AUC = 0.950) (Supplementary Figure 2). Thus, despite that it relies on an evolutionary objective function, LINSIGHT maintains good performance in the prediction of regulatory elements. Consistent with these general trends, LINSIGHT highlights many of the regions identified by conservation methods such as phastCons[4], phyloP[25], and GERP++[33], but also identifies some regions that have relatively low conservation scores yet are likely to have important biological functions. An example is HGMD variant CR065653 in a putative enhancer, associated with upregulation of the telomerase reverse transcriptase (TERT) gene, which obtains an elevated LINSIGHT score, but is not identified by phastCons, phyloP, or GERP++ as being under constraint (Figure 2b). This example also demonstrates that the genomic resolution of the LINSIGHT scores is dramatically better than that of fitCons, and approaches the nucleotide resolution of phyloP and GERP++. LINSIGHT can identify functional variants not only in enhancers but also in promoter regions (Supplementary Figure 3a) and associated with splicing (Supplementary Figure 3b). Thus, it is useful as a general predictor of functional noncoding sites under evolutionary constraint.

LINSIGHT accurately identifies disease-associated variants in noncoding regions

We tested the ability of LINSIGHT to identify noncoding nucleotide positions that are associated with inherited human diseases, using the HGMD[23] and ClinVar[24] databases to define positive examples, and common polymorphisms (MAF > 1%), which are unlikely to be functionally important, to define negative examples. For comparison, we evaluated the CADD[18], Eigen[34], DeepSEA[16], FunSeq2[20], GWAVA[13], and phyloP[25] methods on the same task. For each scoring method, we computed false positive vs. true positive rates for the complete range of score thresholds, displaying the results as receiver operating characteristic (ROC) curves and measuring prediction power by the area-under-the-curve (AUC) statistic. Because the results of these tests can be highly sensitive to the criteria for selecting negative examples, we considered three schemes of increasing stringency (following ref. [13]): a random sample of negative examples (unmatched), negative examples matched by distance to the nearest transcription start site (matched TSS), and negative examples matched by specific genomic region (matched region; see Methods for details). In all cases, equal numbers of positive and negative examples were considered. Overall, LINSIGHT outperformed all other methods in all comparisons (Figure 3). Its absolute prediction power varied across matching schemes in a predictable manner, being highest in the unmatched comparison (e.g., AUC = 0.897 for HGMD) and decreasing in the matched TSS (AUC = 0.759) and matched region (AUC = 0.661) comparisons. The same effect also occurred for most other methods, but the methods that make heavier use of regional information, such as FunSeq2, suffered more as the matching stringency increased. These observations highlight the difficulty of distinguishing functional sites from nearby nonfunctional sites, which is considerably harder than separating regions enriched in functional sites from the genomic background. Nevertheless, LINSIGHT has some power for this challenging task. In almost all cases, the AUCs were considerably higher for ClinVar than for HGMD, apparently because ClinVar is heavily enriched for variants in splice sites, which are relatively easy to identify (Supplementary Figure 4). An exception to this rule was GWAVA, which performs exceptionally well on HGMD (cross-validation AUCs of 0.71–0.97)[13] and much more poorly on ClinVar (AUCs of 0.734–0.884), but GWAVA was trained using HGMD[13] and its performance on that data set appears to reflect overfitting (it is not shown in the HGMD ROC plots for this reason). This dependency on the training set for GWAVA demonstrates one of the pitfalls of pure classification strategies, and highlights a strength of the evolution-based strategy, which does not require a training set. Nevertheless, phyloP performs quite poorly on the HGMD data set, showing that scores based exclusively on evolution are of limited usefulness in this task.

Fig. 3

Prediction power of various computational methods for distinguishing disease-associated noncoding variants from variants not likely to have phenotypic effects. True positive and false positive rates are proportions of disease and neutral variants, respectively, having scores that exceed each threshold, as the threshold is varied. Power is quantified using the Area Under the Curve (AUC) statistic. Results are shown for positive examples from the HGMD[23] (1495 variants) and ClinVar[24] (101 variants not in HGMD) databases. Only autosomal variants were included and duplicated variants were removed. Common SNPs (MAF > 1%) were used as negative examples and were either randomly selected (unmatched), matched to positive examples by distance to nearest transcription start site (matched TSS), or matched to positive examples within 1 kb along the genome (matched region). The numbers of positive and negative examples were balanced by subsampling, which was performed 100 times to obtain average true positive and false positive rates. LINSIGHT is compared with CADD[18], phyloP[25], FunSeq2[20], DeepSEA[16], Eigen[34], and GWAVA[13]. FitCons is not included because it performs poorly on this task due to its low genomic resolution and cell-type specificity. GWAVA results are not shown for the HGMD data set because GWAVA was trained on this data set.

The performance advantage of LINSIGHT was maintained when performance was measured using precision-recall curves in place of standard ROC curves (Supplementary Figure 5) and when rare variants were used in place of common variants as negative examples (Supplementary Figures 6 & 7). These performance advantages are statistically significant in most cases, with a few exceptions mostly stemming from the small size of the ClinVar data set (Supplementary Tables 4 and 5). In addition, a more detailed comparison with CADD showed that training CADD’s logistic regression model using LINSIGHT’s features resulted in improved performance, but not enough to make it competitive with LINSIGHT (Supplementary Table 6). Thus, the excellent performance of INSIGHT in these tests appears to derive both from its use of a broad collection of informative features along the genome and its probabilistic model of evolution. To gain insight into which genomic features were most informative, we systematically omitted groups of related features and reassessed the prediction performance of LINSIGHT (Supplementary Note). Briefly, we found that regional features, such as ChIP-seq peaks and DNase-I hypersensitive sites (Table 2), were broadly useful in distinguishing genomic regions enriched for functional variants from the genomic background, but conservation scores were most important in separating functional sites from nearby nonfunctional sites (Supplementary Figure 8). Predicted binding sites were most informative in promoter regions.

The evolutionary constraints on enhancers are context-dependent

LINSIGHT is also potentially useful for studying the influence of natural selection on noncoding sequences. Compared with other measures of selection, LINSIGHT has the advantages of considering both functional genomic and population genomic data, of detecting the influence of selection on relatively recent time scales (e.g., since the human/chimpanzee divergence), and of providing a model-based, easily interpretable measure of fitness consequences. With these advantages in mind, we used LINSIGHT to gauge the degree of evolutionary constraint on enhancers in the human genome, considering in particular the relationships of constraint with the number and type of active cell types, and with constraint at the target promoter of each enhancer. We analyzed nearly 30,000 enhancers (median length 293 bp) from a recently published atlas of active enhancers in dozens of human cell types and tissues, which were identified based on their transcriptional signatures[35]. This approach of annotating enhancers based on enhancer-associated RNAs (eRNAs) has been shown to identify elements having active roles in gene regulation in a cell-type-specific fashion with high genomic resolution[35-37]. First, we examined the relationship between the LINSIGHT scores and the number of cell types in which each enhancer is active. We found that the LINSIGHT scores were significantly positively correlated with the number of active cell types (Spearman’s ρ = 0.284, p < 10‒15; Figure 4a), indicating that a broader spectrum of activity across cell types is associated with stronger purifying selection. To ensure that this observation reflected real differences in selective pressure and not simply correlations with the epigenomic features considered by LINSIGHT, we retrained LINSIGHT using only conservation scores and predicted binding sites and obtained essentially identical results (Supplementary Figure 9a). Furthermore, a partial correlation test indicated that the LINSIGHT scores were still strongly correlated with the number of cell types when controlling for eRNA expression level averaged across all FANTOM5 libraries (partial Spearman’s ρ = 0.24; p < 10‒15). These findings parallel similar findings for protein-coding genes[38-40] and TFBSs[22] and likely reflect a general correlation between pleiotropy and constraint (see Discussion).

Fig. 4

Evolutionary constraints on enhancers. (a) Probability of fitness consequences for mutations in enhancers (measured by average LINSIGHT score) is positively correlated with the number of cell types in which each enhancer is active (Spearman’s rank correlation coefficient ρ = 0.284; two-tailed p-value < 10−15). Results are shown for 29,303 enhancers in 69 cell types. (b) Probability of fitness consequences for mutations in enhancers is positively correlated with probability of fitness consequences for mutations in associated promoters (Spearman’s rank correlation coefficient ρ = 0.150; two-tailed p-value < 10−15). Results are shown for 25,067 enhancer-promoter pairs.

Second, we examined the relationship between the LINSIGHT score and the tissue type in which each enhancer is active, focusing on enhancers active in a single tissue type. We found that tissue-specific enhancers associated with sensory perception (olfactory region and parotid gland), the immune system (lymph node), digestion (stomach), and male reproduction (penis and testis) had the lowest LINSIGHT scores, whereas tissue-specific enhancers associated with tissues such as smooth muscle, the skin, and the urinary tract and bladder had the highest LINSIGHT scores (Supplementary Figure 10). These findings are also broadly consistent with findings for protein-coding genes, which have indicated that sensory, immune, dietary, and male reproductive genes are associated with relaxation of constraint and/or positive selection[40,41]. Interestingly, enhancers active in tissues associated with female reproduction (e.g., uterus, female gonad, and vagina) appeared to be under substantially more constraint than those active in tissues associated with male reproduction. Finally, we compared the LINSIGHT scores at enhancer/promoter pairs predicted from co-expression across tissues[35]. The LINSIGHT scores for these paired enhancers and promoters are weakly but significantly correlated (Figure 4b and Supplementary Figure 9b), indicating that the same types of evolutionary pressures tend to act at both members of each pair. Together, these results indicate that the evolutionary constraints on enhancers are dependent on several factors, including their degree of tissue specificity, the particular tissues in which they are active, and the evolutionary constraints associated with their target promoters.

Discussion

As sequencing costs fall and appreciation for regulatory variation grows, whole genome sequencing is rapidly supplanting exome sequencing as the primary technique for identifying and characterizing genetic variants that have phenotypic consequences. Hence, there is an increasing need for computational methods that can effectively prioritize noncoding variants based on their likelihood of phenotypic importance. In this paper, we address this problem with a new computational method, called LINSIGHT, that combines the evolutionary model of our previously developed INSIGHT method with a generalized linear model for functional genomic data and genome annotations, resulting in substantially improved scalability, resolution, and power. We have generated LINSIGHT scores across the human genome, making use of a large collection of publicly available population, comparative, and functional genomic data, and we find the scores to be consistent with previously available scores in many respects, but to improve on them in others. In particular, on the task of identifying human disease-associated variants from the HGMD and ClinVar databases, LINSIGHT offered the best performance of several methods we tested, across a range of types of variants and test designs. Importantly, LINSIGHT requires no training set of known regulatory or disease variants and therefore is expected to have better generalization properties than “supervised” machine-learning classifiers (see Introduction). In conceptual terms, the new LINSIGHT method is closely related to our previous fitCons method[3], with the primary difference being that LINSIGHT pools data across sites implicitly through the use of its generalized linear model, whereas fitCons pools data by explicitly clustering sites according to discretized functional genomic signatures. In effect, LINSIGHT trades the restrictions of a linearity assumption for the benefits of computational speed, a reduced parameterization, and scalability to very large numbers of genomic features. Notably, the new model design also has a number of important side benefits. First, it avoids the need for discretization of the genomic features. In addition, as the number of features grows larger, the genomic resolution of the scores naturally becomes much finer, approaching the nucleotide-level resolution of conservation scores. Finally, the generalized linear model can readily be extended to a “deep” neural network through the addition of hidden layers. While it remains to be seen how much this extension will help in practice, in principle it can capture the types of nonlinearity and interactions between features that have been observed in this setting (for examples, see references [3] and [42]). Our approach to characterizing noncoding variants is based on the premise that natural selection in the past, at individual nucleotide sites, provides useful information about phenotypic importance in the present. This assumption clearly will not hold in all cases. For example, variants that increase the risk for post-reproductive diseases or that influence phenotypes dependent on the modern human environment will not necessarily show signs of historical purifying selection. In addition, traits dependent on highly epistatic loci or on the aggregate contributions of large numbers of loci may have difficult-to-detect marginal contributions to fitness at individual nucleotides. Nevertheless, our results indicate that the evolution-based approach is useful for many phenotypes of interest. Furthermore, in comparison to the available high-throughput experimental methods, evolution-based methods have the crucial advantage of measuring the importance of genetic variants in real organisms in their natural environments over many generations. Using LINSIGHT, we examined the influence of negative selection on enhancers, considering the relationships between constraint on enhancers and numbers of active cell types, tissue of activity, and constraint at associated promoters. LINSIGHT is potentially useful for addressing these questions because it should be much more robust to evolutionary turnover than conventional conservation-based methods, and some classes of enhancers are known to turn over more quickly than others[43]. We found that, in general, the trends in constraint at enhancers parallel those previously reported for protein-coding genes. For example, constraint increases with breadth of activity across cell types and decreases in tissues associated with rapid evolution, such as olfactory regions, the immune system, and male reproduction. Constraint also appears to be correlated at enhancer/promoter pairs. These observations about the specific ways in which evolutionary constraints on enhancers depend on genomic context may be useful in improving the prediction power for the fitness consequences of noncoding mutations. As has been suggested for protein-coding genes[38], it seems plausible that the positive correlation between the strength of constraint and the number of active cell types can be explained by pleiotropy: enhancers active in more cell types are more likely to participate in multiple regulatory networks, perhaps with distinct roles involving the binding of different factors and/or the use of different binding sites within each enhancer. As a result, they may be subject to greater constraint. Nevertheless, many open questions remain about the influences of constraint on enhancers, and it will be important to examine these questions further in light of rapidly improving enhancer annotations, data describing enhancer-promoter interactions[44-46], and observations of complex evolutionary behavior at enhancers[47].

Online Methods

Genomic features

The genomic features used by LINSIGHT can be divided into three categories: conservation scores, predicted binding sites, and regional annotations (Table 2 and Supplementary Table 3). Conservation scores included phyloP scores[25], phastCons elements[4], SiPhy omega elements[48,49], and CEGA elements[50]. Except for SiPhy, each score type was represented by multiple data tracks—for example, phastCons tracks for vertebrate, mammalian, and primate alignments (Supplementary Table 3). Predicted binding sites included transcription factor binding sites (TFBS) and RNA binding sites. Predicted TFBSs were obtained from the conserved TFBS track in the UCSC Genome Browser[51], the rVISTA database[52], SwissRegulon[53], FunSeq2[20], and the Ensembl Regulatory Build[27]. RNA binding sites include splice sites predicted by SPIDEX[54] and miRNA target sites predicted by TarBase[55]. The regional annotations were based a variety of sources, including ChIP-seq and RNA-seq data from the ENCODE[11] and Roadmap Epigenomics[12] projects, enhancers from FANTOM5[35], predicted distal regulatory modules from FunSeq2[20], and the distances to nearest TSSs based on GENCODE gene models[56]. All features and the resulting LINSIGHT scores were expressed in genomic coordinates for the hg19 assembly of the human genome.

Polymorphism and divergence data

The polymorphism and divergence data used by the INSIGHT component of the LINSIGHT model were borrowed from previous analyses[3,21,22]. Briefly, we obtained human single nucleotide polymorphisms from high-coverage genome sequences for 54 unrelated individuals from the “69 Genomes” data set from Complete Genomics, eliminating nucleotide sites with more than two alleles. Outgroup alleles were defined by the aligned chimpanzee, orangutan, and rhesus macaque reference genomes from UCSC. Several filters were applied to these data to reduce technical errors from alignment, sequencing, and genotype inference; for example, we removed simple repeats, recent transposable elements, recent segmental duplications, and regions not in syntenic alignment across primates[22]. Putatively neutral regions were identified by starting with all aligned regions, then removing coding sequences, conserved noncoding sequences, and their proximal flanking regions. These regions were used to estimate neutral divergence and polymorphism rates in the human lineage in a block-wise manner across the genome, to account for local variation in mutation rates[21]. To allow for uncertainty in the human-chimpanzee most recent common ancestor (MRCA), we integrated over a distribution of ancestral alleles inferred after fitting a standard phylogenetic model to the outgroup sequences[21].

Generalized linear model

The selection parameters in the INSIGHT model, ρ and γ, were defined as linear-sigmoid functions of the local genomic features at each nucleotide site i. Specifically, if D is a column vector of genomic features at site i, then where the row vectors W and W consist of feature weights (free parameters in the model) and g() and h() are sigmoid functions that map all input values to the range (0,1). For , we used the standard logistic function, h(x) = 1/(1 + e). For g(), however, we used the asymmetric Gompertz sigmoid function[57], g(x) = exp[–3exp(–x)], which ensured that gradients were not too small when ρ is close to zero and accelerated convergence during model fitting.

Fitting the LINSIGHT model to the data

The weights for all genomic features were estimated by approximately maximizing the log likelihood of the INSIGHT model with respect to our genome-wide data set. We began by considering all genomic positions not excluded by our data-quality filters. Because our focus was on noncoding regions, we additionally excluded coding regions annotated by GENCODE (release 19). Instead of a traditional “batch” learning algorithm, which would require either storing all data in memory or reading it from disk many times, we used an “online” stochastic gradient descent algorithm[58]. The algorithm processed the genome sequentially, in “minibatches” of 100 successive nucleotides, each time updating the parameter vector in the direction of the gradient of the log likelihood function, with learning rates of 0.001 and 0.01 for ρ and γ, respectively. Gradients were computed analytically, by propagating partial derivatives through the linear-sigmoid component of the model using the chain rule (back-propagation). The learning procedure was truncated after 20 passes through the entire data set. The entire process took less than one day on a desktop computer. The genome-wide LINSIGHT scores are available from the Cold Spring Harbor Laboratory mirror of the UCSC Genome Browser (hg19 assembly).

Comparison with other methods

Our benchmarking scheme for prioritization of disease-associated variants closely followed the one introduced in ref. [13]. The HGMD and ClinVar noncoding disease variants and three sets of negative controls were obtained directly from this study. The negative controls consisted of: (1) a randomly selected subset of human common variants which is 100-fold larger than the set of HGMD variants (unmatched); (2) a subset of human common variants matched to the disease variants by exact distance-to-nearest-TSS (matched TSS) (although each negative example is not necessarily near the same TSS as the matched disease variant); and (3) a subset of human common variants required to be within 1-kb of the matched disease variants (matched region). The two matched sets account for the enrichment of known disease variants near coding genes. We later defined three additional sets of negative controls by the same strategy but using singleton variants from the 1000 Genomes Project phase 3 data[59] instead of common variants, to ensure that our results were not driven by differences in allele frequency between the disease variants and negative controls. In all cases, we subsampled the negative sets to balance the numbers of positive and negative sets. To reduce stochasticity, subsampling was performed 100 times and average performance statistics were reported. For comparison, we downloaded precomputed CADD[18] (v1.3), GWAVA[13] (v1.0), FunSeq2[20] (v2.1.0), and Eigen[34] (Oct. 11, 2015) scores from the source websites. In all cases, we used GWAVA scores based on training with variants matched by distance-to-nearest-TSS were used[13]. In addition, we obtained mammalian phyloP[25] scores based on the 46-way whole-genome alignment for hg19 from the UCSC Genome Browser[51], and we computed DeepSEA functional significance scores for both disease variants and negative controls using the online DeepSEA web service[16] (computed on Nov 3, 2016). The DeepSEA functional significance scores integrate individual tissue-specific DeepSEA scores based on polymorphism data; these were used in all comparisons because the tissue types associated with disease variants and ORegAnno TFBSs are typically unknown. Note that two of the methods considered, CADD and DeepSEA, provide allele-specific predictions, whereas the other methods assign identical scores to all alternative variants. When evaluating CADD and DeepSEA on the ClinVar data set, we used the score corresponding to the annotated disease-associated allele. When evaluating these methods on the HGMD data set, however, no disease-associated allele was provided, so we used the maximum score for the three alternative alleles.

Classification of disease-associated variants by genomic location

For analyses that considered the genomic locations of disease-associated variants, we divided the variants in the HGMD and ClinVar databases into four categories based on their locations relative to gene models from GENCODE (release 19). These categories were: (1) “promoter” variants, located within 1 kb upstream of the 5’-most annotated transcription start site of any protein-coding gene; (2) “splicing” variants, located within 20 bp of any annotated splice junction; (3) “UTR” variants, located within the annotated 5’ or 3’ UTR of any protein-coding gene; and (4) all “other” variants. Each variant was assigned to the first category whose criteria it fulfilled in the order splicing > UTR > promoter > other.

Quantification of the contributions of genomic feature classes

We measured the relative contributions of the conservation scores, predicted binding sites, and regional annotations by removing all features of each class (see Table 2), retraining the LINSIGHT model without those features, and evaluating the AUC of the reduced model. The contribution of each class of features was defined as the AUC for the full model minus the AUC for the reduced model, averaged across 100 independent subsamples of negative controls described above. Notice that, while this difference in AUCs is generally positive, it may be negative due to stochastic effects. This analysis was performed on a merged set of HGMD and ClinVar variants, separately for promoter, splicing, UTR, and other regions.

Analysis of evolutionary constraints on enhancers

To study evolutionary constraints on enhancers, we used the comprehensive atlas of human enhancers based on enhancer RNAs (eRNAs) that was recently provided by the FANTOM5 project[35]. The evolutionary constraint for each enhancer was quantified by taking the average LINSIGHT score across all nucleotide sites in the enhancer. We examined the relationship between this measure of constraint and the number of cell types in which each enhancer was active (according to a detectable eRNA signature). We also defined a subset of enhancers as tissue-specific, based on apparent activity in only a single tissue type, and examined the relationship between tissue of activity and degree of constraint. Finally, we obtained putative enhancer-TSS pairs (based on correlated patterns of expression across tissues) from the FANTOM5 website, and examined the correlation in constraint at the enhancer and promoter in each pair, defining the promoter as the 1 kb region upstream of the TSS. In cases where an enhancer was associated with multiple TSSs, the TSS with highest correlation coefficient was selected.

Statistical analysis

To examine the relationship between evolutionary constraints on enhancers and tissue specificity, Spearman’s rank correlation coefficient was calculated between the average LINSIGHT score for each enhancer and its number of active cell types. To quantify the statistical significance of the correlation, a two-tailed p-value was computed using the standard asymptotic t approximation implemented in the “cor.test” function in R (p < 10‒15; n = 29,303). The same method was used to quantify the statistical significance of the correlation between the average LINSIGHT scores at enhancer/promoter pairs (p < 10‒15; n = 25,067). Furthermore, to investigate the relationship between the average LINSIGHT score in an enhancer and the number of active cell types when controlling for average eRNA expression level, the partial Spearman’s ρ and a two-tailed p-value were computed using the ppcor package[60] (p < 10‒15; n = 29,303). To investigate whether the difference between two AUCs is statistically significant, the DeLong test was used to compute two-tailed p-values[61].

Code availability

The LINSIGHT code is available at https://github.com/CshlSiepelLab/LINSIGHT.

Data availability

The training data and pre-computed LINSIGHT scores are available at http://compgen.cshl.edu/~yihuang/LINSIGHT/.

58 in total

1. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits.

Authors: Lucia A Hindorff; Praveen Sethupathy; Heather A Junkins; Erin M Ramos; Jayashri P Mehta; Francis S Collins; Teri A Manolio
Journal: Proc Natl Acad Sci U S A Date: 2009-05-27 Impact factor: 11.205

2. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Authors: Ewan Birney; John A Stamatoyannopoulos; Anindya Dutta; Roderic Guigó; Thomas R Gingeras; Elliott H Margulies; Zhiping Weng; Michael Snyder; Emmanouil T Dermitzakis; Robert E Thurman; Michael S Kuehn; Christopher M Taylor; Shane Neph; Christoph M Koch; Saurabh Asthana; Ankit Malhotra; Ivan Adzhubei; Jason A Greenbaum; Robert M Andrews; Paul Flicek; Patrick J Boyle; Hua Cao; Nigel P Carter; Gayle K Clelland; Sean Davis; Nathan Day; Pawandeep Dhami; Shane C Dillon; Michael O Dorschner; Heike Fiegler; Paul G Giresi; Jeff Goldy; Michael Hawrylycz; Andrew Haydock; Richard Humbert; Keith D James; Brett E Johnson; Ericka M Johnson; Tristan T Frum; Elizabeth R Rosenzweig; Neerja Karnani; Kirsten Lee; Gregory C Lefebvre; Patrick A Navas; Fidencio Neri; Stephen C J Parker; Peter J Sabo; Richard Sandstrom; Anthony Shafer; David Vetrie; Molly Weaver; Sarah Wilcox; Man Yu; Francis S Collins; Job Dekker; Jason D Lieb; Thomas D Tullius; Gregory E Crawford; Shamil Sunyaev; William S Noble; Ian Dunham; France Denoeud; Alexandre Reymond; Philipp Kapranov; Joel Rozowsky; Deyou Zheng; Robert Castelo; Adam Frankish; Jennifer Harrow; Srinka Ghosh; Albin Sandelin; Ivo L Hofacker; Robert Baertsch; Damian Keefe; Sujit Dike; Jill Cheng; Heather A Hirsch; Edward A Sekinger; Julien Lagarde; Josep F Abril; Atif Shahab; Christoph Flamm; Claudia Fried; Jörg Hackermüller; Jana Hertel; Manja Lindemeyer; Kristin Missal; Andrea Tanzer; Stefan Washietl; Jan Korbel; Olof Emanuelsson; Jakob S Pedersen; Nancy Holroyd; Ruth Taylor; David Swarbreck; Nicholas Matthews; Mark C Dickson; Daryl J Thomas; Matthew T Weirauch; James Gilbert; Jorg Drenkow; Ian Bell; XiaoDong Zhao; K G Srinivasan; Wing-Kin Sung; Hong Sain Ooi; Kuo Ping Chiu; Sylvain Foissac; Tyler Alioto; Michael Brent; Lior Pachter; Michael L Tress; Alfonso Valencia; Siew Woh Choo; Chiou Yu Choo; Catherine Ucla; Caroline Manzano; Carine Wyss; Evelyn Cheung; Taane G Clark; James B Brown; Madhavan Ganesh; Sandeep Patel; Hari Tammana; Jacqueline Chrast; Charlotte N Henrichsen; Chikatoshi Kai; Jun Kawai; Ugrappa Nagalakshmi; Jiaqian Wu; Zheng Lian; Jin Lian; Peter Newburger; Xueqing Zhang; Peter Bickel; John S Mattick; Piero Carninci; Yoshihide Hayashizaki; Sherman Weissman; Tim Hubbard; Richard M Myers; Jane Rogers; Peter F Stadler; Todd M Lowe; Chia-Lin Wei; Yijun Ruan; Kevin Struhl; Mark Gerstein; Stylianos E Antonarakis; Yutao Fu; Eric D Green; Ulaş Karaöz; Adam Siepel; James Taylor; Laura A Liefer; Kris A Wetterstrand; Peter J Good; Elise A Feingold; Mark S Guyer; Gregory M Cooper; George Asimenos; Colin N Dewey; Minmei Hou; Sergey Nikolaev; Juan I Montoya-Burgos; Ari Löytynoja; Simon Whelan; Fabio Pardi; Tim Massingham; Haiyan Huang; Nancy R Zhang; Ian Holmes; James C Mullikin; Abel Ureta-Vidal; Benedict Paten; Michael Seringhaus; Deanna Church; Kate Rosenbloom; W James Kent; Eric A Stone; Serafim Batzoglou; Nick Goldman; Ross C Hardison; David Haussler; Webb Miller; Arend Sidow; Nathan D Trinklein; Zhengdong D Zhang; Leah Barrera; Rhona Stuart; David C King; Adam Ameur; Stefan Enroth; Mark C Bieda; Jonghwan Kim; Akshay A Bhinge; Nan Jiang; Jun Liu; Fei Yao; Vinsensius B Vega; Charlie W H Lee; Patrick Ng; Atif Shahab; Annie Yang; Zarmik Moqtaderi; Zhou Zhu; Xiaoqin Xu; Sharon Squazzo; Matthew J Oberley; David Inman; Michael A Singer; Todd A Richmond; Kyle J Munn; Alvaro Rada-Iglesias; Ola Wallerman; Jan Komorowski; Joanna C Fowler; Phillippe Couttet; Alexander W Bruce; Oliver M Dovey; Peter D Ellis; Cordelia F Langford; David A Nix; Ghia Euskirchen; Stephen Hartman; Alexander E Urban; Peter Kraus; Sara Van Calcar; Nate Heintzman; Tae Hoon Kim; Kun Wang; Chunxu Qu; Gary Hon; Rosa Luna; Christopher K Glass; M Geoff Rosenfeld; Shelley Force Aldred; Sara J Cooper; Anason Halees; Jane M Lin; Hennady P Shulha; Xiaoling Zhang; Mousheng Xu; Jaafar N S Haidar; Yong Yu; Yijun Ruan; Vishwanath R Iyer; Roland D Green; Claes Wadelius; Peggy J Farnham; Bing Ren; Rachel A Harte; Angie S Hinrichs; Heather Trumbower; Hiram Clawson; Jennifer Hillman-Jackson; Ann S Zweig; Kayla Smith; Archana Thakkapallayil; Galt Barber; Robert M Kuhn; Donna Karolchik; Lluis Armengol; Christine P Bird; Paul I W de Bakker; Andrew D Kern; Nuria Lopez-Bigas; Joel D Martin; Barbara E Stranger; Abigail Woodroffe; Eugene Davydov; Antigone Dimas; Eduardo Eyras; Ingileif B Hallgrímsdóttir; Julian Huppert; Michael C Zody; Gonçalo R Abecasis; Xavier Estivill; Gerard G Bouffard; Xiaobin Guan; Nancy F Hansen; Jacquelyn R Idol; Valerie V B Maduro; Baishali Maskeri; Jennifer C McDowell; Morgan Park; Pamela J Thomas; Alice C Young; Robert W Blakesley; Donna M Muzny; Erica Sodergren; David A Wheeler; Kim C Worley; Huaiyang Jiang; George M Weinstock; Richard A Gibbs; Tina Graves; Robert Fulton; Elaine R Mardis; Richard K Wilson; Michele Clamp; James Cuff; Sante Gnerre; David B Jaffe; Jean L Chang; Kerstin Lindblad-Toh; Eric S Lander; Maxim Koriabine; Mikhail Nefedov; Kazutoyo Osoegawa; Yuko Yoshinaga; Baoli Zhu; Pieter J de Jong
Journal: Nature Date: 2007-06-14 Impact factor: 49.962

3. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping.

Authors: Suhas S P Rao; Miriam H Huntley; Neva C Durand; Elena K Stamenova; Ivan D Bochkov; James T Robinson; Adrian L Sanborn; Ido Machol; Arina D Omer; Eric S Lander; Erez Lieberman Aiden
Journal: Cell Date: 2014-12-11 Impact factor: 41.582

4. Krüppel Expression Levels Are Maintained through Compensatory Evolution of Shadow Enhancers.

Authors: Zeba Wunderlich; Meghan D J Bragdon; Ben J Vincent; Jonathan A White; Javier Estrada; Angela H DePace
Journal: Cell Rep Date: 2015-09-03 Impact factor: 9.423

5. An atlas of active enhancers across human cell types and tissues.

Authors: Robin Andersson; Claudia Gebhard; Michael Rehli; Albin Sandelin; Irene Miguel-Escalada; Ilka Hoof; Jette Bornholdt; Mette Boyd; Yun Chen; Xiaobei Zhao; Christian Schmidl; Takahiro Suzuki; Evgenia Ntini; Erik Arner; Eivind Valen; Kang Li; Lucia Schwarzfischer; Dagmar Glatz; Johanna Raithel; Berit Lilje; Nicolas Rapin; Frederik Otzen Bagger; Mette Jørgensen; Peter Refsing Andersen; Nicolas Bertin; Owen Rackham; A Maxwell Burroughs; J Kenneth Baillie; Yuri Ishizu; Yuri Shimizu; Erina Furuhata; Shiori Maeda; Yutaka Negishi; Christopher J Mungall; Terrence F Meehan; Timo Lassmann; Masayoshi Itoh; Hideya Kawaji; Naoto Kondo; Jun Kawai; Andreas Lennartsson; Carsten O Daub; Peter Heutink; David A Hume; Torben Heick Jensen; Harukazu Suzuki; Yoshihide Hayashizaki; Ferenc Müller; Alistair R R Forrest; Piero Carninci
Journal: Nature Date: 2014-03-27 Impact factor: 49.962

6. Identifying a high fraction of the human genome to be under selective constraint using GERP++.

Authors: Eugene V Davydov; David L Goode; Marina Sirota; Gregory M Cooper; Arend Sidow; Serafim Batzoglou
Journal: PLoS Comput Biol Date: 2010-12-02 Impact factor: 4.475

7. The ensembl regulatory build.

Authors: Daniel R Zerbino; Steven P Wilder; Nathan Johnson; Thomas Juettemann; Paul R Flicek
Journal: Genome Biol Date: 2015-03-24 Impact factor: 13.583

Review 8. Deciphering death: a commentary on Gompertz (1825) 'On the nature of the function expressive of the law of human mortality, and on a new mode of determining the value of life contingencies'.

Authors: Thomas B L Kirkwood
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2015-04-19 Impact factor: 6.237

9. A spectral approach integrating functional genomic annotations for coding and noncoding variants.

Authors: Iuliana Ionita-Laza; Kenneth McCallum; Bin Xu; Joseph D Buxbaum
Journal: Nat Genet Date: 2016-01-04 Impact factor: 38.330

10. Selective constraints in experimentally defined primate regulatory regions.

Authors: Daniel J Gaffney; Ran Blekhman; Jacek Majewski
Journal: PLoS Genet Date: 2008-08-15 Impact factor: 5.917

115 in total

Review 1. Machine learning, the kidney, and genotype-phenotype analysis.

Authors: Rachel S G Sealfon; Laura H Mariani; Matthias Kretzler; Olga G Troyanskaya
Journal: Kidney Int Date: 2020-04-01 Impact factor: 10.612

2. A Multiplexed Assay for Exon Recognition Reveals that an Unappreciated Fraction of Rare Genetic Variants Cause Large-Effect Splicing Disruptions.

Authors: Rocky Cheung; Kimberly D Insigne; David Yao; Christina P Burghard; Jeffrey Wang; Yun-Hua E Hsiao; Eric M Jones; Daniel B Goodman; Xinshu Xiao; Sriram Kosuri
Journal: Mol Cell Date: 2018-11-29 Impact factor: 17.970

3. A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation.

Authors: Nicholas Bogard; Johannes Linder; Alexander B Rosenberg; Georg Seelig
Journal: Cell Date: 2019-06-06 Impact factor: 41.582

4. PopViz: a webserver for visualizing minor allele frequencies and damage prediction scores of human genetic variations.

Authors: Peng Zhang; Benedetta Bigio; Franck Rapaport; Shen-Ying Zhang; Jean-Laurent Casanova; Laurent Abel; Bertrand Boisson; Yuval Itan
Journal: Bioinformatics Date: 2018-12-15 Impact factor: 6.937

5. Synonymous variants that disrupt messenger RNA structure are significantly constrained in the human population.

Authors: Jeffrey B S Gaither; Grant E Lammi; James L Li; David M Gordon; Harkness C Kuck; Benjamin J Kelly; James R Fitch; Peter White
Journal: Gigascience Date: 2021-04-05 Impact factor: 6.524

Review 6. Settling the score: variant prioritization and Mendelian disease.

Authors: Karen Eilbeck; Aaron Quinlan; Mark Yandell
Journal: Nat Rev Genet Date: 2017-08-14 Impact factor: 53.242

7. Genetic variation: Linear INSIGHTs into non-coding DNA.

Authors: Shimona Starling
Journal: Nat Rev Genet Date: 2017-04-03 Impact factor: 53.242

8. Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities.

Authors: Marinka Zitnik; Francis Nguyen; Bo Wang; Jure Leskovec; Anna Goldenberg; Michael M Hoffman
Journal: Inf Fusion Date: 2018-09-21 Impact factor: 12.975

9. IW-Scoring: an Integrative Weighted Scoring framework for annotating and prioritizing genetic variations in the noncoding genome.

Authors: Jun Wang; Abu Z Dayem Ullah; Claude Chelala
Journal: Nucleic Acids Res Date: 2018-05-04 Impact factor: 16.971

10. A comprehensive analysis of SNCA-related genetic risk in sporadic parkinson disease.

Authors: Lasse Pihlstrøm; Cornelis Blauwendraat; Chiara Cappelletti; Victoria Berge-Seidl; Margrete Langmyhr; Sandra Pilar Henriksen; Wilma D J van de Berg; J Raphael Gibbs; Mark R Cookson; Andrew B Singleton; Mike A Nalls; Mathias Toft
Journal: Ann Neurol Date: 2018-08-26 Impact factor: 10.422