| Literature DB >> 30727967 |
Yao Yao1,2, Zheng Liu1,2, Qi Wei1,2, Stephen A Ramsey3,4.
Abstract
BACKGROUND: We previously reported on CERENKOV, an approach for identifying regulatory single nucleotide polymorphisms (rSNPs) that is based on 246 annotation features. CERENKOV uses the xgboost classifier and is designed to be used to find causal noncoding SNPs in loci identified by genome-wide association studies (GWAS). We reported that CERENKOV has state-of-the-art performance (by two traditional measures and a novel GWAS-oriented measure, AVGRANK) in a comparison to nine other tools for identifying functional noncoding SNPs, using a comprehensive reference SNP set (OSU17, 15,331 SNPs). Given that SNPs are grouped within loci in the reference SNP set and given the importance of the data-space manifold geometry for machine-learning model selection, we hypothesized that within-locus inter-SNP distances would have class-based distributional biases that could be exploited to improve rSNP recognition accuracy. We thus defined an intralocus SNP "radius" as the average data-space distance from a SNP to the other intralocus neighbors, and explored radius likelihoods for five distance measures.Entities:
Keywords: Data space; GWAS; Machine learning; SNP; noncoding; rSNP
Mesh:
Substances:
Year: 2019 PMID: 30727967 PMCID: PMC6364436 DOI: 10.1186/s12859-019-2637-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The geometric idea behind the intralocus distance features that are used in CERENKOV2. Top panel, SNPs from the same locus form a data-space “cloud.” Triangles and circles, SNPs; black lines, distances between a central SNP and the other SNPs within the locus. Bottom panel, SNPs shown in their chromosomal context
Fig. 2Distributions of intralocus radii computed using five different distance measures (Canberra, Euclidean, Manhattan, cosine, and Pearson) applied to scaled and unscaled feature data, conditioned on the type of reference SNP (rSNP or cSNP) for the intralocus radius calculation. Results shown are for all OSU18 SNPs (see “The OSU18 reference SNP set” section). Significant differences in the rSNP likelihoods vs. cSNP likelihoods are evident for Canberra, Canberra (scaled), Euclidean (scaled), Manhattan (scaled), cosine, and Pearson methods for computing intralocus radii. Modest differences in rSNP vs. cSNP likelihoods were evident for the cases of Euclidean and Manhattan methods for computing intralocus radii
Fig. 3Empirically estimated log-likelihood ratios (rSNP/cSNP) based on intralocus radii computed using ten methods. Results shown are for all OSU18 SNPs (see “The OSU18 reference SNP set” section). LLR, log-likelihood ratio (natural logarithm); ln, natural logarithm
Fig. 4Performance of GWAVA, CERENKOV and CERENKOV2 on the OSU18 reference SNP set, by three performance measures. Marks, sample arithmetic mean of validation-set performance; bars, estimated 95% confidence intervals (see “Gradient boosted decision trees” section); GWAVA, based on the GWAVA’s Random Forest model with 174 features [24]; CERENKOV, our previous model with the base 248-column feature matrix; CERENKOV2, our current model consisting of the base feature matrix plus ten log-likelihood features derived from intralocus radii and fitted using training data only; AUPVR, area under the precision-vs-recall curve (higher is better); AUROC, area under the receiver operating characteristic curve (higher is better); AVGRANK, intralocus average score rank (lower is better [22])
Fig. 5Gini and permutation importance values of 258 features in 14 categories (colored marks). Feature category labels as follows: “LLR”, log-likelihood ratio (the new data-space geometric features); “repliseq”, replication timing; “geneannot”, gene-model annotation-based; “epigenome”, epigenomic segmentation [67, 68]; “featdist”, SNP location-related; “chrom”, the chromosome; “eigen”, based on the Eigen [21] score; “phylogenetic”, phylogenetic interspecies local sequence conservation [6, 80, 81]; “allelism”, allele and MAF-related; “DHS”, DNase I hypersensitive site; “DNAContent”, local nucleotide frequences; “eQTL”, expression quantitative trait locus [75]; “repeats”, genomic repeat annotation; “TFBS”, transcription factor binding site (see “Extracting the nongeometric features” section and Ref. [22] for details)
The 248 SNP features used in CERENKOV
|
|
|
|
|
|---|---|---|---|
| normChromCoord | continuous | UCSC | the SNP coordinate (normalized to chrom. length) |
| majorAlleleFreq | continuous | UCSC/1KG | the major allele frequency (1KG) |
| minorAlleleFreq | continuous | UCSC/1KG | the next-to-major allele frequency (1KG) |
| phastCons | continuous | UCSC | 46-way placental mammal phastCons score [ |
| GERP ++ | continuous | UCSC | bp-level GERP ++ [ |
| avg_GERP | continuous | UCSC | avg. GERP score [ |
| avg_daf | continuous | 1KG | average derived allele frequency in ±1 kbp region |
| avg_het | continuous | 1KG | average heterozygosity rate in ±1 kbp region |
| maf1kb | continuous | UCSC/1KG | average of the MAF values for all SNPs in ±1 kbp window |
| eqtlPvalue | continuous | GTEx | -log10 min( |
| GC5Content | integer (0-5) | UCSC | GC content in a 5 bp window |
| GC7Content | integer (0-7) | UCSC | GC content in a 7 bp window |
| GC11Content | integer (0-11) | UCSC | GC content in a 11 bp window |
| local_purine | integer (0-11) | UCSC | number of purine bases in local 11 bp window |
| local_CpG | integer (0-10) | UCSC | number of CpG dinucleotides in 11 bp window |
| ss_dist | integer | UCSC | signed distance to nearest exon boundary |
| tssDistance | integer | Ensembl75 | signed distance to nearest Ensembl TSS |
| gencode_tss | integer | GENCODE | signed distance to nearest GENCODE TSS |
| tfCount | integer | UCSC | sqrt(count) of ENCODE ChIP-seq TFBS overlap. SNP |
| uniformDhsScore | integer | UCSC | sum scores of ENCODE uniform DHS peaks overlap. SNP |
| uniformDhsCount | integer | UCSC | count of ENCODE uniform DHS peaks overlap. SNP |
| masterDhsScore | integer | UCSC | sum scores of ENCODE master DHS peaks overlap. SNP |
| masterDhsCount | integer | UCSC | count of ENCODE master DHS peaks overlap. SNP |
| chrom | categorical (23) | UCSC | the chromosome to which the SNP maps |
| nestedrepeat | categorical (2) | UCSC | SNP is in a RepeatMasker [ |
| simplerepeat | categorical (2) | UCSC | SNP is in a Tandem Repeats Finder [ |
| cpg_island | categorical (2) | UCSC | SNP is in an epigenome-predicted CpG island [ |
| geneannot | categorical (4) | UCSC | classifies SNP location as CDS, intergenic, UTR, or intron |
| majorAllele | categorical (4) | UCSC/1KG | the major allele for the SNP |
| minorAllele | categorical (4) | UCSC/1KG | the next-to-major allele for the SNP |
| pwm | categorical (22) | Ensembl75 | ID of the Jaspar 2014 [ |
| chromhmm | 6 ×categ. (26) | UCSC | ChromHMM label in Gm12878, H1hesc, HeLaS3, HepG2, HUVEC and K562 cells |
| segway | 6 ×categ. (26) | UCSC | Segway label in Gm12878, H1hesc, HeLaS3, HepG2, HUVEC and K562 cells |
| ch_comb_WEAKENH | categorical (4) | Ensembl75 | ChromHMM label in Ensembl Reg. Seg. build |
| ch_comb_ENH | categorical (6) | Ensembl75 | ChromHMM label in Ensembl Reg. Seg. build |
| ch_comb_REP | categorical (7) | Ensembl75 | ChromHMM label in Ensembl Reg. Seg. build |
| ch_comb_TSSFLANK | categorical (5) | Ensembl75 | ChromHMM label in Ensembl Reg. Seg. build |
| ch_comb_TRAN | categorical (7) | Ensembl75 | ChromHMM label in Ensembl Reg. Seg. build |
| ch_comb_TSS | categorical (7) | Ensembl75 | ChromHMM label in Ensembl Reg. Seg. build |
| ch_comb_CTCFREG | categorical (7) | Ensembl75 | ChromHMM label in Ensembl Reg. Seg. build |
| ENCODE_TFBS | 160 ×categ. (2) | UCSC | 160 features for SNP being in an ENCODE TFBS [ |
| FsuRepliSeq | 16 ×continuous | UCSC | Replication Timing by Repli-chip [ |
| UwRepliSeq | 16 ×continuous | UCSC | Replication Timing by Repli-seq [ |
| SangerTfbsSummary50kb | continuous | Ensembl75 | Summary of Ensembl TFBS peaks from 18 human cell types |
| NkiLad | categorical (2) | UCSC | SNP is in a Lamina Associated Domain (NKI study [ |
| vistaEnhancerCnt | categorical (2) | UCSC | count of VISTA [ |
| vistaEnhancerTotalScore | categorical (2) | UCSC | sum scores of VISTA [ |
| eigen | continuous (2) | Eigen | Eigen & Eigen-PC v1.1 raw scorea [ |
Abbreviations are as follows: UCSC, UC Santa Cruz Genome Browser portal; 1KG, 1,000 Genomes Project; Ensembl75, Ensembl Release 75 [82]; GENCODE, the GENCODE project release 19 [83]; ENCODE, Encyclopedia of DNA Elements [30]; FSU, Florida State University; UW, University of Washington; NKI, Netherlands Cancer Institute; GTEx, the genotype tissue-expression project; GERP, the Genomic Evolutionary Rate Profiling score; CDS, coding DNA sequence; UTR, untranslated region; MAF, minor allele frequency; HMR, human-mouse-rat; TSS, transcription start site