| Literature DB >> 29452392 |
Ajay Anand Kumar1,2, Lut Van Laer1, Maaike Alaerts1, Amin Ardeshirdavani3,4, Yves Moreau3,4, Kris Laukens2,5, Bart Loeys1, Geert Vandeweyer1,2.
Abstract
Motivation: Computational gene prioritization can aid in disease gene identification. Here, we propose pBRIT (prioritization using Bayesian Ridge regression and Information Theoretic model), a novel adaptive and scalable prioritization tool, integrating Pubmed abstracts, Gene Ontology, Sequence similarities, Mammalian and Human Phenotype Ontology, Pathway, Interactions, Disease Ontology, Gene Association database and Human Genome Epidemiology database, into the prediction model. We explore and address effects of sparsity and inter-feature dependencies within annotation sources, and the impact of bias towards specific annotations.Entities:
Mesh:
Year: 2018 PMID: 29452392 PMCID: PMC6022555 DOI: 10.1093/bioinformatics/bty079
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Schematic workflow of pBRIT. (A) Categorization of annotation sources as functional or phenotypic, (B) Gene-by-gene proximity profile computation using TF-IDF and TF-IDF→SVD, followed by intermediate data fusion, (C) Bayesian ridge regression based candidate gene prioritization
Fig. 2.Bayesian ridge regression. The design matrix (X) contains similarity scores of both training and test genes to training genes. The phenotypic concordance score vector is indicated by Y. For LO-OCV, the summed phenotypic score of the nth query gene (A. Test.N.Na) or all test genes (B. Test.ALL.NA), corresponding to prior phenotypic knowledge, is removed (colored box) during regression parameter estimation (Color version of this figure is available at Bioinformatics online.)
Fig. 3.ROC plot of pBRIT benchmark performance. (A) 779 UMLS-coded disease classes obtained from DisGeNET and (B) 2025 time-stamped HPO terms. The four vertical lines indicate the top1%, top10%, top20% and top30% of query genes which were prioritized
Fig. 4.Impact of training set size. Main: Mean rank ratio (MRR) versus number of training genes. Incorporation of test gene phenotypic information (N.NA) in the regression model results in a low and stable MRR, irrespective of feature extraction methodology. Without phenotypic information (All.Na), MRR decreases with increasing number of training genes. Insert: Distribution of training sizes per disease class
Fig. 5.Exploring prioritization results using heatmap plots. (A). The functional annotation matrix X, illustrating the contribution of individual training genes (Y-axis) during regression, using the full TF-IDF->SVD_Test.Pheno.Include model. Darker shades indicate higher contributions. In this example, the gene to be prioritized was KCNA2. (B) Contribution of individual annotation sources for each training gene to the ranking of KCNA2