| Literature DB >> 23021552 |
Warren A Cheung1, Bf Francis Ouellette2, Wyeth W Wasserman3.
Abstract
BACKGROUND: MEDLINE(®)/PubMed(®) currently indexes over 18 million biomedical articles, providing unprecedented opportunities and challenges for text analysis. Using Medical Subject Heading Over-representation Profiles (MeSHOPs), an entity of interest can be robustly summarized, quantitatively identifying associated biomedical terms and predicting novel indirect associations.Entities:
Year: 2012 PMID: 23021552 PMCID: PMC3580445 DOI: 10.1186/gm376
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Figure 1Comparing gene and disease MeSHOPs. A graphical representation of the comparison of the MeSHOPs for the human gene PAX6 and the disease aniridia. The most strongly associated terms for each profile are presented as a word cloud, scaling the size of each term with the degree of association. Blue lines link shared terms between the profiles - the similarity scores quantitatively evaluate the difference between the profiles by comparing all shared terms between profiles.
Figure 2Comparison of performance of gene characteristics. ROC curves are shown comparing predictive gene characteristics. Characteristics are computed from a 2007 Entrez Gene dataset and the MEDLINE® Baseline 2007, predicting against all new disease terms associated to gene MeSHOPs between February 2007 and April 2010.
Performance of gene characteristics at predicting association with disease
|
| GeneRIF | |||||
|---|---|---|---|---|---|---|
| Scoring method | Validation (02/2007-01/2009) | Validation (02/2007-04/2010) | CTD validation (11/2008) | Validation (02/2007-01/2009) | Validation (02/2007-04/2010) | CTD validation (11/2008) |
| Percentage GC content | 0.50 | 0.50 | 0.51 | 0.50 | 0.50 | 0.51 |
| Number of transcripts | 0.53 | 0.53 | 0.55 | 0.51 | 0.51 | 0.53 |
| Transcript length | 0.51 | 0.52 | 0.50 | 0.52 | 0.52 | 0.53 |
| Genomic length | 0.52 | 0.52 | 0.50 | 0.51 | 0.51 | 0.52 |
| Gene ID | 0.73 | 0.71 | 0.78 | 0.64 | 0.63 | 0.69 |
Characteristics were compared against the 02/2007-11/2008 validation sets using gene2pubmed and GeneRIF gene references, as well as the 11/2008 Comparative Toxicogenomics Database (CTD) validation set. Gene characteristics were extracted from EnsEMBL. We compare the performance of these characteristics at predicting new gene-disease relationships in our validation sets (for the genes with mapped characteristics).
Comparison of the performance of Entrez Gene ID to gene-related literature measuresin MEDLINE®
|
| GeneRIF | |||||
|---|---|---|---|---|---|---|
| Feature | Validation AUC (02/2007-01/2009) | Validation AUC (02/2007-04/2010) | CTD validation (11/2008) | Validation AUC (02/2007-01/2008) | Validation AUC (02/2007-04/2010) | CTD validation (11/2008) |
| Number of MeSH terms | 0.74 | 0.73 | 0.81 | 0.80 | 0.85 | 0.82 |
| Number of publications | 0.75 | 0.73 | 0.80 | 0.80 | 0.85 | 0.82 |
| Oldest publication (year) | 0.67 | 0.66 | 0.73 | 0.73 | 0.76 | 0.73 |
| Gene ID | 0.64 | 0.64 | 0.66 | 0.69 | 0.75 | 0.73 |
The oldest publication for a gene has comparable performance to Entrez Gene ID, as measured by the AUC; however, the number of publications for a gene proves to be even more predictive than the Entrez Gene ID.
Explanation of the scoring functions evaluated
| Scoring method | Description |
|---|---|
| Cosine distance of term frequency-inverse document frequency | |
| Cosine distance of | |
| Cosine distance of term fractions | |
| Sum of the log of combined | |
| Sum of the differences of log | |
| L2 of log-p of overlapping terms only | |
| L2 of term fractions of overlapping terms only | |
| L2 of log of | |
| L2 of | |
| L2 of term fractions | |
| L2 of term frequency | |
| Term coverage | | |
| Term overlap | | |
| Number of gene MeSH terms | | |
| Number of disease MeSH terms | | |
| Gene ID | Entrez Gene ID of the gene |
M refers to the set of all MeSH terms, G and D to the MeSH terms for the gene and disease profile, respectively. g(i), gand grefer to the frequency, term fraction, hypergeometric P-value and term frequency-inverse document frequency for the MeSH term i of the gene profile. d(i), dand drefer to the frequency, term fraction, hypergeometric P-value and term frequency-inverse document frequency for the MeSH term i of the disease profile.
Performance using GeneRIF as the gene-literature data source sets
| Scoring method | Novel MEDLINE validation AUC (02/2007-01/2009) | Novel MEDLINE validation AUC (02/2007-04/2010) | Pre-existing CTD validation AUC (11/2008) | Novel CTD validation AUC (11/2008-04/2010) | Pre-existing MEDLINE validation AUC (02/2007) | Mean AUC | Rank |
|---|---|---|---|---|---|---|---|
| Cosine distance of term frequency-inverse document frequency | 0.90 | 0.89 | 0.93 | 0.91 | 0.98 | 0.92 | 2 |
| Cosine distance of | 0.56 | 0.57 | 0.60 | 0.56 | 0.53 | 0.56 | 15 |
| Cosine distance of term fractions | 0.86 | 0.84 | 0.91 | 0.88 | 0.96 | 0.89 | 4 |
| Sum of the log of combined | 0.86 | 0.85 | 0.92 | 0.90 | 0.94 | 0.90 | 3 |
| Sum of the differences of log | 0.91 | 0.91 | 0.77 | 0.83 | 0.93 | 0.87 | 6 |
| L2 of log-p of overlapping terms only | 0.94 | 0.93 | 0.91 | 0.92 | 0.98 | 0.94 | 1 |
| L2 of term fractions of overlapping terms only | 0.56 | 0.55 | 0.55 | 0.56 | 0.51 | 0.55 | 16 |
| L2 of log of | 0.90 | 0.90 | 0.76 | 0.83 | 0.93 | 0.86 | 9 |
| L2 of | 0.90 | 0.90 | 0.76 | 0.81 | 0.92 | 0.86 | 11 |
| L2 of term fractions | 0.86 | 0.85 | 0.89 | 0.88 | 0.94 | 0.88 | 5 |
| L2 of term frequency | 0.90 | 0.90 | 0.76 | 0.83 | 0.93 | 0.86 | 10 |
| Term coverage | 0.91 | 0.90 | 0.77 | 0.83 | 0.93 | 0.87 | 7 |
| Term overlap | 0.82 | 0.82 | 0.86 | 0.86 | 0.87 | 0.85 | 12 |
| Number of gene MeSH terms | 0.74 | 0.73 | 0.80 | 0.80 | 0.81 | 0.78 | 13 |
| Number of disease MeSH terms | 0.90 | 0.90 | 0.77 | 0.83 | 0.93 | 0.87 | 8 |
| Gene ID | 0.64 | 0.64 | 0.69 | 0.69 | 0.66 | 0.66 | 14 |
AUC of the described scoring methods were compared and tested on the validation. CTD, Comparative Toxicogenomics Database.
Performance using gene2pubmed as the gene-literature data source
| Scoring method | Novel MEDLINE validation AUC (02/2007-01/2009) | Novel MEDLINE validation AUC (02/2007-04/2010) | Pre-existing CTD validation AUC (11/2008) | Novel CTD validation AUC (11/2008-04/2010) | Pre-existing MEDLINE validation AUC (02/2007) | Mean AUC | Rank |
|---|---|---|---|---|---|---|---|
| Cosine distance of term frequency-inverse document frequency | 0.92 | 0.91 | 0.95 | 0.93 | 0.98 | 0.94 | 2 |
| Cosine distance of | 0.53 | 0.51 | 0.65 | 0.63 | 0.53 | 0.57 | 16 |
| Cosine distance of term fractions | 0.90 | 0.89 | 0.93 | 0.91 | 0.96 | 0.92 | 5 |
| Sum of the log of combined | 0.91 | 0.89 | 0.94 | 0.94 | 0.94 | 0.92 | 3 |
| Sum of the differences of log | 0.91 | 0.91 | 0.77 | 0.83 | 0.93 | 0.87 | 7 |
| L2 of log-p of overlapping terms only | 0.96 | 0.95 | 0.92 | 0.94 | 0.99 | 0.95 | 1 |
| L2 of term fractions of overlapping terms only | 0.64 | 0.62 | 0.57 | 0.60 | 0.53 | 0.59 | 15 |
| L2 of log of | 0.90 | 0.90 | 0.76 | 0.83 | 0.93 | 0.86 | 10 |
| L2 of | 0.89 | 0.89 | 0.75 | 0.81 | 0.92 | 0.86 | 12 |
| L2 of term fractions | 0.92 | 0.90 | 0.91 | 0.92 | 0.95 | 0.92 | 4 |
| L2 of term frequency | 0.90 | 0.90 | 0.76 | 0.82 | 0.93 | 0.86 | 11 |
| Term coverage | 0.90 | 0.91 | 0.77 | 0.83 | 0.93 | 0.87 | 8 |
| Term overlap | 0.91 | 0.89 | 0.90 | 0.92 | 0.90 | 0.90 | 6 |
| Number of gene MeSH terms | 0.85 | 0.82 | 0.85 | 0.88 | 0.83 | 0.85 | 13 |
| Number of disease MeSH terms | 0.90 | 0.90 | 0.76 | 0.83 | 0.93 | 0.86 | 9 |
| Gene ID | 0.75 | 0.73 | 0.78 | 0.79 | 0.74 | 0.76 | 14 |
AUC of the described scoring methods were compared and tested on the validation sets. CTD, Comparative Toxicogenomics Database.
Summary of MeSHOP performance
| Scoring method | Mean AUC | AUC standard error | Mean test rank ( | Overall rank |
|---|---|---|---|---|
| Cosine distance of term frequency-inverse document frequency | 0.93 | 0.03 | 15.03 | 2 |
| Cosine distance of | 0.57 | 0.05 | 87.25 | 16 |
| Cosine distance of term fractions | 0.90 | 0.04 | 20.21 | 4 |
| Sum of the log of combined | 0.91 | 0.03 | 18.88 | 3 |
| Sum of the differences of log | 0.87 | 0.06 | 26.97 | 7 |
| L2 of log-p of overlapping terms only | 0.94 | 0.03 | 12.06 | 1 |
| L2 of term fractions of overlapping terms only | 0.57 | 0.04 | 86.70 | 15 |
| L2 of log of | 0.86 | 0.07 | 28.05 | 10 |
| L2 of | 0.86 | 0.07 | 29.62 | 12 |
| L2 of term fractions | 0.90 | 0.03 | 20.39 | 5 |
| L2 of term frequency | 0.86 | 0.06 | 28.31 | 11 |
| Term coverage | 0.87 | 0.06 | 27.14 | 8 |
| Term overlap | 0.87 | 0.03 | 26.17 | 6 |
| Number of gene MeSH terms | 0.81 | 0.05 | 38.69 | 13 |
| Number of disease MeSH terms | 0.86 | 0.06 | 27.87 | 9 |
| Gene ID | 0.71 | 0.06 | 58.78 | 14 |
The AUC mean, standard deviation and ranking for the MeSHOP scores and the gene and disease baselines are described, over all validation sets and both GeneRIF and gene2pubmed reference sets.
Figure 3Comparing the performance of similarity scores to gene characteristics. ROC curves for the L2 of log-p of overlapping terms gene-disease profile comparison score, compared against curves for Gene ID, the number of terms in the gene MeSHOP and the number of terms in the disease MeSHOP.
Mean average precision MeSHOP performance
| Scoring method | Novel MEDLINE validation MAP (02/2007-04/2010) | Rank | Novel CTD validation AUC (11/2008-04/2010) | Rank |
|---|---|---|---|---|
| Cosine distance of term frequency-inverse document frequency | 0.87 | 11 | 0.92 | 4 |
| Cosine distance of | 0.55 | 15 | 0.66 | 15 |
| Cosine distance of term fractions | 0.87 | 12 | 0.90 | 6 |
| Sum of the log of combined | 0.88 | 9 | 0.94 | 2 |
| Sum of the differences of log | 0.90 | 3 | 0.79 | 9 |
| L2 of log-p of overlapping terms only | 0.94 | 1 | 0.95 | 1 |
| L2 of term fractions of overlapping terms only | 0.54 | 16 | 0.52 | 16 |
| L2 of log of | 0.89 | 7 | 0.78 | 13 |
| L2 of | 0.89 | 5 | 0.79 | 8 |
| L2 of term fractions | 0.90 | 2 | 0.92 | 5 |
| L2 of term frequency | 0.89 | 8 | 0.79 | 10 |
| Term coverage | 0.90 | 4 | 0.79 | 11 |
| Term overlap | 0.88 | 10 | 0.93 | 3 |
| Number of gene MeSH terms | 0.81 | 13 | 0.88 | 7 |
| Number of disease MeSH terms | 0.89 | 6 | 0.78 | 12 |
| Gene ID | 0.69 | 14 | 0.74 | 14 |
The mean average precision for the novel MEDLINE relationships (02/2007 to 04/2010) and the novel CTD relationships (11/2008 to 04/2010). In each trial, 100 positive relationships and 100 negative relationships were chosen uniformly at random, and the average precision was computed for each scoring method. The mean average precision presented here is calculated over 100 random trials for each validation set.
Figure 4Comparing the performance of similarity scores. ROC curves are shown with AUC, computed for the top five similarity metrics and the disease number of MeSH terms baseline. These scores demonstrate predictions of gene-disease relationships using February 2007 data validated against the Comparative Toxicogenomics Database (11/2008) dataset.
Figure 5Comparison of the top 500 gene predictions for Alzheimer disease from Génie and MeSHOP similarity. The 215 genes ranked in the top 500 gene predictions for both Génie and MeSHOP Similarity are compared, showing a correlation of 0.38. Of the genes ranked in the top 500 by Génie, 79 did not have MeSHOPs and therefore did not have a computed MeSHOP similarity score to rank.
Top 50 Alzheimer disease candidate genes by MeSHOP similarity
| Rank | Gene ID | Gene name | Score | Génie rank | Alzheimer disease |
|---|---|---|---|---|---|
| 1 | 348 | 1.18E+04 | 4 | 812 | |
| 2 | 351 | 1.22E+04 | 2 | 576 | |
| 3 | 4137 | 1.23E+04 | 1 | 211 | |
| 4 | 5663 | 1.27E+04 | 3 | 249 | |
| 5 | 6622 | 1.27E+04 | 6 | 30 | |
| 6 | 627 | 1.28E+04 | 9 | 47 | |
| 7 | 1312 | 1.29E+04 | 87 | 10 | |
| 8 | 1401 | 1.29E+04 | 210 | 5 | |
| 9 | 6532 | 1.30E+04 | 43 | 23 | |
| 0 | |||||
| 11 | 5444 | 1.30E+04 | 204 | 16 | |
| 12 | 1813 | 1.30E+04 | 114 | 1 | |
| 13 | 4846 | 1.30E+04 | 118 | 18 | |
| 14 | 23621 | 1.30E+04 | 5 | 86 | |
| 15 | 2950 | 1.30E+04 | 470 | 4 | |
| 16 | 5621 | 1.31E+04 | 12 | 28 | |
| 17 | 5054 | 1.31E+04 | NA | 3 | |
| 18 | 1636 | 1.31E+04 | 32 | 45 | |
| 19 | 2952 | 1.31E+04 | NA | 3 | |
| 20 | 5071 | 1.31E+04 | 13 | 6 | |
| 21 | 120892 | 1.31E+04 | 18 | 6 | |
| 22 | 3553 | 1.31E+04 | 39 | 32 | |
| 23 | 4023 | 1.31E+04 | 172 | 7 | |
| 24 | 6647 | 1.31E+04 | 36 | 6 | |
| 25 | 3356 | 1.31E+04 | 121 | 16 | |
| 26 | 10 | 1.31E+04 | 333 | 4 | |
| 27 | 7515 | 1.31E+04 | NA | 2 | |
| 28 | 2944 | 1.31E+04 | NA | 3 | |
| 29 | 3552 | 1.31E+04 | 30 | 36 | |
| 30 | 3569 | 1.32E+04 | 60 | 28 | |
| 31 | 5664 | 1.32E+04 | 7 | 78 | |
| 32 | 6648 | 1.32E+04 | 131 | 4 | |
| 0 | |||||
| 34 | 338 | 1.32E+04 | NA | 1 | |
| 35 | 7421 | 1.32E+04 | NA | 2 | |
| 0 | |||||
| 37 | 183 | 1.32E+04 | NA | 2 | |
| 38 | 1543 | 1.32E+04 | NA | 1 | |
| 39 | 154 | 1.32E+04 | NA | 1 | |
| 40 | 4524 | 1.32E+04 | 57 | 30 | |
| 41 | 1071 | 1.32E+04 | 197 | 8 | |
| 42 | 3557 | 1.32E+04 | 278 | 7 | |
| 43 | 4318 | 1.32E+04 | 219 | 5 | |
| 44 | 1565 | 1.32E+04 | 238 | 9 | |
| 45 | 335 | 1.32E+04 | 135 | 7 | |
| 0 | |||||
| 47 | 3990 | 1.32E+04 | NA | 2 | |
| 48 | 4153 | 1.32E+04 | NA | 1 | |
| 49 | 23435 | 1.32E+04 | 10 | 10 | |
| 50 | 345 | 1.32E+04 | NA | 2 |
Genes are ranked by MeSHOP similarity score, and compared against the ranked list of Génie candidate genes for Alzheimer disease (a full analysis considering all possible orthologs). Also provided is a list of the number of articles related to Alzheimer disease in the gene2pubmed references for the gene, when present. Rows in bold indicate high-ranking predictions that have no prior association with Alzheimer disease in the literature. NA: gene not among the 566 genes ranked by Génie.
Summary of diabetes loci ranked by MeSHOP similarity
| Locus | Entrez Gene ID | Predicted similarity score | Rank | Percentile | Direct association |
|---|---|---|---|---|---|
| 3416 | 7.59E+07 | 186 | 0.01 | 7.93E-02 | |
| 6934 | 5.91E+07 | 421 | 0.02 | 3.30E-03 | |
| 2132 | 2.96E+07 | 2616 | 0.10 | NA | |
| 3087 | 2.18E+07 | 4631 | 0.18 | NA | |
| 3832 | 1.87E+07 | 5985 | 0.24 | NA | |
| 60529 | 1.55E+07 | 8313 | 0.33 | NA | |
| 169026 | 1.55E+07 | 8352 | 0.33 | NA | |
| 387761 | NA | NA | NA | NA |
Loci identified by (Sladek et al. [42]) were ranked by MeSHOP similarity (L2 of log-p of overlapping terms only). Direct association scores are the Bonferroni corrected P-values generated using the February 2007 datasets. NA
Comparison of MeSHOP results for pancreatic cancer candidate genes
| Gene | Entrez gene | Predicted similarity | Rank | Percentile | Mutations | Deletions | Passenger probability: low rates | Passenger probability: mid rates | Passenger probability: high rates |
|---|---|---|---|---|---|---|---|---|---|
| 7157 | 1.24E+08 | 11 | 100 | 18 | 2 | < 0.001 | < 0.001 | < 0.001 | |
| 1029 | 8.29E+07 | 135 | 99 | 2 | 16 | < 0.001 | < 0.001 | < 0.001 | |
| 3845 | 6.95E+07 | 266 | 99 | 24 | 0 | < 0.001 | < 0.001 | < 0.001 | |
| 7048 | 6.76E+07 | 288 | 99 | 3 | 1 | < 0.001 | 0.001 | 0.003 | |
| 2033 | 6.37E+07 | 351 | 99 | 2 | 0 | 0.176 | 0.482 | 0.984 | |
| 4089 | 6.14E+07 | 386 | 98 | 8 | 6 | < 0.001 | < 0.001 | < 0.001 | |
| 2006 | 5.57E+07 | 509 | 98 | 2 | 0 | 0.115 | 0.372 | 0.413 | |
| 2157 | 5.51E+07 | 530 | 98 | 2 | 0 | 0.165 | 0.482 | 0.853 | |
| 6331 | 5.18E+07 | 629 | 98 | 2 | 0 | 0.176 | 0.482 | 1.000 | |
| 5582 | 4.77E+07 | 798 | 97 | 2 | 0 | 0.115 | 0.372 | 0.413 | |
| 7173 | 4.71E+07 | 831 | 97 | 2 | 0 | 0.115 | 0.375 | 0.694 | |
| 5506 | 4.50E+07 | 946 | 96 | 2 | 0 | 0.115 | 0.477 | 0.694 | |
| 6597 | 4.04E+07 | 1243 | 95 | 2 | 0 | 0.062 | 0.183 | 0.413 | |
| 1289 | 3.72E+07 | 1518 | 94 | 2 | 0 | 0.176 | 0.482 | 0.984 | |
| 4224 | 3.38E+07 | 1895 | 92 | 2 | 0 | 0.062 | 0.183 | 0.413 | |
| 3561 | 2.95E+07 | 2652 | 89 | 1 | 0 | 0.004 | 0.016 | 0.997 | |
| 57194 | 2.77E+07 | 2974 | 88 | 2 | 0 | 0.176 | 0.482 | 1.000 | |
| 4620 | 2.71E+07 | 3063 | 88 | 2 | 0 | 0.165 | 0.477 | 0.853 | |
| 2892 | 2.62E+07 | 3281 | 87 | 1 | 1 | 0.017 | 0.069 | 0.999 | |
| 10347 | 2.56E+07 | 3426 | 86 | 2 | 0 | 0.033 | 0.139 | 0.201 | |
| 1741 | 2.51E+07 | 3540 | 86 | 1 | 0 | 0.003 | 0.015 | 0.997 | |
| 10395 | 2.47E+07 | 3645 | 86 | 2 | 0 | 0.176 | 0.482 | 1.000 | |
| 29998 | 2.06E+07 | 5082 | 80 | 2 | 0 | 0.062 | 0.183 | 0.405 | |
| 5046 | 2.02E+07 | 5240 | 79 | 2 | 0 | 0.176 | 0.482 | 0.911 | |
| 2125 | 2.00E+07 | 5329 | 79 | 2 | 0 | 0.176 | 0.482 | 0.942 | |
| 9542 | 1.95E+07 | 5537 | 78 | 2 | 0 | 0.165 | 0.477 | 0.853 | |
| 26050 | 1.93E+07 | 5655 | 78 | 2 | 0 | 0.165 | 0.477 | 0.853 | |
| 54437 | 1.92E+07 | 5713 | 77 | 2 | 0 | 0.062 | 0.183 | 0.413 | |
| 1804 | 1.86E+07 | 6025 | 76 | 3 | 0 | 0.009 | 0.079 | 0.201 | |
| 65217 | 1.84E+07 | 6162 | 76 | 4 | 0 | < 0.001 | 0.017 | 0.048 | |
| 56776 | 1.82E+07 | 6266 | 75 | 2 | 0 | 0.176 | 0.482 | 0.911 | |
| 781 | 1.77E+07 | 6597 | 74 | 1 | 0 | 0.001 | 0.004 | 0.989 | |
| 9940 | 1.70E+07 | 7039 | 72 | 2 | 0 | 0.176 | 0.482 | 0.911 | |
| 58508 | 1.69E+07 | 7090 | 72 | 6 | 0 | < 0.001 | < 0.001 | < 0.001 | |
| 55193 | 1.63E+07 | 7597 | 70 | 2 | 0 | 0.165 | 0.477 | 0.853 | |
| 54674 | 1.60E+07 | 7856 | 69 | 2 | 0 | 0.062 | 0.183 | 0.405 | |
| 23191 | 1.56E+07 | 8225 | 67 | 3 | 0 | 0.009 | 0.079 | 0.201 | |
| 23451 | 1.55E+07 | 8290 | 67 | 3 | 0 | 0.009 | 0.079 | 0.201 | |
| 7837 | 1.55E+07 | 8302 | 67 | 2 | 0 | 0.176 | 0.482 | 1.000 | |
| 7143 | 1.54E+07 | 8453 | 66 | 2 | 0 | 0.176 | 0.482 | 0.911 | |
| 6614 | 1.53E+07 | 8484 | 66 | 2 | 0 | 0.176 | 0.482 | 1.000 | |
| 55117 | 1.53E+07 | 8488 | 66 | 2 | 0 | 0.062 | 0.183 | 0.405 | |
| 8289 | 1.51E+07 | 8688 | 66 | 2 | 0 | 0.176 | 0.482 | 0.984 | |
| 6511 | 1.48E+07 | 8908 | 65 | 2 | 0 | 0.115 | 0.477 | 0.694 | |
| 80059 | 1.46E+07 | 9064 | 64 | 2 | 0 | 0.062 | 0.183 | 0.413 | |
| 114805 | 1.42E+07 | 9651 | 62 | 2 | 0 | 0.062 | 0.183 | 0.405 | |
| 2977 | 1.39E+07 | 9964 | 60 | 2 | 0 | 0.062 | 0.183 | 0.405 | |
| 27332 | 1.37E+07 | 10174 | 60 | 2 | 0 | 0.115 | 0.375 | 0.694 | |
| 23024 | 1.33E+07 | 10522 | 58 | 2 | 0 | 0.033 | 0.082 | 0.201 | |
| 1794 | 1.33E+07 | 10612 | 58 | 2 | 0 | 0.062 | 0.183 | 0.405 | |
| 25988 | 1.32E+07 | 10714 | 58 | 2 | 0 | 0.062 | 0.183 | 0.405 | |
| 117154 | 1.30E+07 | 10883 | 57 | 1 | 1 | 0.022 | 0.088 | 1.000 | |
| 84620 | 1.26E+07 | 11302 | 55 | 2 | 0 | 0.115 | 0.375 | 0.694 | |
| 9920 | 1.19E+07 | 12083 | 52 | 1 | 1 | 0.006 | 0.025 | 0.998 | |
| 53942 | 1.18E+07 | 12231 | 51 | 2 | 0 | 0.115 | 0.375 | 0.694 | |
| 84448 | 1.17E+07 | 12471 | 51 | 2 | 0 | 0.062 | 0.183 | 0.405 | |
| 54510 | 1.14E+07 | 12864 | 49 | 2 | 0 | 0.115 | 0.375 | 0.694 | |
| 80070 | 1.09E+07 | 13668 | 46 | 2 | 0 | 0.176 | 0.482 | 0.911 | |
| 1008 | 1.09E+07 | 13703 | 46 | 3 | 0 | < 0.001 | 0.017 | 0.048 | |
| 23251 | 1.09E+07 | 13715 | 46 | 2 | 0 | 0.115 | 0.375 | 0.694 | |
| 9096 | 1.08E+07 | 13821 | 45 | 2 | 0 | 0.062 | 0.183 | 0.413 | |
| 145581 | 1.07E+07 | 13894 | 45 | 2 | 0 | 0.062 | 0.183 | 0.405 | |
| 80243 | 1.07E+07 | 13953 | 45 | 3 | 0 | 0.055 | 0.183 | 0.405 | |
| 91010 | 1.05E+07 | 14376 | 43 | 2 | 0 | 0.055 | 0.179 | 0.405 | |
| 81501 | 1.03E+07 | 14681 | 42 | 2 | 0 | 0.055 | 0.179 | 0.405 | |
| 343406 | 1.02E+07 | 15126 | 40 | 2 | 0 | 0.033 | 0.139 | 0.317 | |
| 283383 | 1.02E+07 | 15188 | 40 | 2 | 0 | 0.062 | 0.183 | 0.405 | |
| 27253 | 1.01E+07 | 15355 | 39 | 2 | 0 | 0.062 | 0.183 | 0.405 | |
| 577 | 9.58E+06 | 16457 | 35 | 3 | 0 | 0.033 | 0.082 | 0.201 | |
| 23281 | 9.49E+06 | 16660 | 34 | 2 | 0 | 0.176 | 0.482 | 0.984 | |
| 1496 | 9.42E+06 | 16781 | 33 | 3 | 0 | 0.033 | 0.179 | 0.405 | |
| 54758 | 8.66E+06 | 18571 | 26 | 2 | 0 | 0.033 | 0.082 | 0.201 | |
| 7455 | 8.45E+06 | 19030 | 25 | 2 | 0 | 0.176 | 0.482 | 0.984 | |
| 26005 | 7.38E+06 | 20579 | 18 | 2 | 0 | 0.165 | 0.477 | 0.853 | |
| 440279 | 7.38E+06 | 20835 | 17 | 2 | 0 | 0.115 | 0.372 | 0.694 | |
| 133584 | 7.38E+06 | 21333 | 15 | 2 | 0 | 0.176 | 0.482 | 0.942 | |
| 166824 | 7.38E+06 | 21543 | 15 | 2 | 0 | 0.062 | 0.183 | 0.405 | |
| 341350 | 5.79E+06 | 24923 | 1 | 2 | 0 | 0.165 | 0.477 | 0.853 | |
| NA | NA | NA | 3 | 0 | < 0.001 | 0.004 | 0.009 | ||
| NA | NA | NA | 2 | 0 | 0.165 | 0.477 | 0.853 | ||
| 389197 | NA | NA | NA | 2 | 0 | 0.062 | 0.183 | 0.405 | |
| NA | NA | NA | 2 | 0 | 0.062 | 0.183 | 0.405 | ||
| 441136 | NA | NA | NA | 2 | 0 | 0.009 | 0.079 | 0.201 |
This table shows all genes from Supplementary Table S7 of Jones et al. [43] listed by strength of MeSHOP similarity score (via the L2 of log-p of overlapping terms only metric). NA, no MeSHOP available.
MeSHOP similarity analysis of known breast cancer genes with seven or more observed mutations
| Chromosomal location/gene(s) | Mutations observed | MeSHOP similarity rank |
|---|---|---|
| PIK3CA | 33 | 55 |
| GATA3 | 15 | 773 |
| chr8:37353781-37489508/FGFR1/ZNF703 | 15 | 191/8,493 |
| chr8:128504497-128948225/MYC | 15 | 19 |
| MAP3K1 | 9 | 417 |
| chr20:52065876-52723895/ZNF217 | 9 | 2,102 |
| NCOR1 | 7 | 360 |
Genes known to be implicated in breast cancer from Supplementary Table 4 of [45] are compared to the MeSHOP similarity ranking of human genes for breast neoplasms. The rows in italics highlight genes with a MeSHOP similarity rank less than 13.
Highest ranked breast cancer gene candidates with MeSHOP similarity analysis
| Rank | Gene | Probability of oncogenicity | MeSHOP similarity rank |
|---|---|---|---|
| MAP3K1 | |||
| 2 | TBX3 | 0.99996 | 2513 |
| 3 | TTN | 0.99952 | 1842 |
| NCOR1 | |||
| 5 | MTMR4 | 0.98201 | 10529 |
| 6 | MAP3K13 | 0.97383 | 9545 |
| CDKN1B | |||
| 8 | DIDO1 | 0.90433 | 4931 |
| 9 | SMARCD1 | 0.88007 | 3098 |
| CASP8 |
We list the top ten breast cancer candidate genes from Supplementary Table 5 of [45], ranked by probability of oncogenicity. Genes in bold are in the top 500 by MeSHOP similarity rank.
Datasets used in the analysis with details on size and relevant contents
| Dataset | February 2007 | January 2009 | April 2010 |
|---|---|---|---|
| Entrez Gene (including | |||
| Total genes | 2,460,748 | 4,710,910 | 5,999,558 |
| Human genes | 38,604 | 40,183 | 45,423 |
| Baseline 2007 (Nov 2006) | Baseline 2009 (Nov 2008) | Baseline 2010 (Nov 2009) | |
| MEDLINE® | |||
| Total articles | 16,120,073 | 17,764,232 | 18,502,915 |
| Total links | 3,081,413 | 12,960,489 | 5,979,167 |
| Total human gene links | 272,123 | 445,650 | 527,821 |
Although the number of human genes has not increased much over the years, the number of non-human links has increased substantially since 2007, while the human gene links have increased at a more moderate rate. Previously, MEDLINE®/PubMed® links from genomic sequence were propagated to all related genes. This practice was discontinued in March 2009, resulting (at the time) in a 60% decrease in links and the disparity in the number of overall links from 2009 to 2010.
Top 100 terms shared by the MeSHOPs of PAX6 and aniridia
| Common MeSH term | Gene MeSHOP | Disease MeSHOP | Score |
|---|---|---|---|
| DNA mutational analysis | 0.00E+00 | 0.00E+00 | 0.00e+0 |
| Pedigree | 0.00E+00 | 0.00E+00 | 0.00e+0 |
| Polymorphism, single-stranded conformational | 1.40E-44 | 1.67E-42 | 1.66e-42 |
| Humans | 6.82E-24 | 0.00E+00 | 6.82e-24 |
| Exons | 8.53E-24 | 2.98E-23 | 2.13e-23 |
| Mutation, missense | 1.45E-23 | 9.72E-21 | 9.70e-21 |
| Chromosomes, human, pair 11 | 2.73E-20 | 0.00E+00 | 2.73e-20 |
| Codon, nonsense | 1.15E-18 | 1.96E-21 | 1.15e-18 |
| Cataract | 2.37E-17 | 0.00E+00 | 2.37e-17 |
| Point mutation | 6.94E-17 | 7.02E-18 | 6.24e-17 |
| Frameshift mutation | 9.77E-15 | 2.11E-21 | 9.77e-15 |
| DNA primers | 5.27E-12 | 2.69E-15 | 5.27e-12 |
| Fovea centralis | 2.41E-16 | 6.03E-11 | 6.03e-11 |
| Introns | 5.01E-10 | 2.15E-13 | 5.01e-10 |
| Nystagmus, congenital | 9.55E-10 | 3.29E-11 | 9.22e-10 |
| Genes, dominant | 7.39E-09 | 2.45E-14 | 7.39e-9 |
| Asian continental ancestry group | 2.23E-16 | 1.07E-08 | 1.07e-8 |
| Lens, crystalline | 2.40E-08 | 5.62E-24 | 2.40e-8 |
| Alternative splicing | 6.14E-13 | 7.97E-08 | 7.97e-8 |
| Corneal opacity | 2.45E-06 | 1.47E-16 | 2.45e-6 |
| Child, preschool | 3.63E-06 | 3.08E-44 | 3.63e-6 |
| Family health | 1.03E-05 | 1.69E-07 | 1.01e-5 |
| Gene expression regulation, developmental | 2.46E-15 | 1.04E-05 | 1.04e-5 |
| Genes, homeobox | 1.40E-05 | 6.32E-09 | 1.40e-5 |
| Adolescent | 1.92E-05 | 1.71E-18 | 1.92e-5 |
| Conserved sequence | 6.20E-06 | 1.18E-04 | 1.12e-4 |
| Heterozygote | 1.15E-04 | 6.94E-08 | 1.15e-4 |
| Radiation hybrid mapping | 1.72E-05 | 1.50E-04 | 1.33e-4 |
| Alleles | 2.29E-04 | 4.17E-05 | 1.87e-4 |
| Abnormalities, multiple | 2.91E-04 | 0.00E+00 | 2.91e-4 |
| Iris | 3.45E-04 | 0.00E+00 | 3.45e-4 |
| Blepharoptosis | 4.54E-04 | 4.55E-08 | 4.53e-4 |
| WAGR syndrome | 5.13E-04 | 0.00E+00 | 5.13e-4 |
| Tomography, optical coherence | 1.14E-03 | 4.47E-04 | 6.97e-4 |
| Corpus callosum | 6.03E-07 | 9.38E-04 | 9.38e-4 |
| Pregnancy | 9.62E-01 | 9.60E-01 | 1.09e-3 |
| Open reading frames | 2.56E-10 | 1.12E-03 | 1.12e-3 |
| Forkhead transcription factors | 1.27E-03 | 1.95E-05 | 1.25e-3 |
| Face | 1.42E-03 | 3.94E-05 | 1.38e-3 |
| Nucleic acid heteroduplexes | 2.02E-04 | 1.73E-03 | 1.53e-3 |
| 1.61E-03 | 3.22E-29 | 1.61e-3 | |
| Gene deletion | 1.65E-03 | 1.77E-21 | 1.65e-3 |
| 8.46E-04 | 2.50E-03 | 1.66e-3 | |
| Proprotein convertase 1 | 1.71E-03 | 1.27E-05 | 1.69e-3 |
| Ectopia lentis | 1.81E-03 | 1.42E-05 | 1.79e-3 |
| Albinism, ocular | 1.86E-03 | 3.98E-14 | 1.86e-3 |
| Databases, nucleic acid | 3.09E-04 | 2.63E-03 | 2.32e-3 |
| India | 1.13E-04 | 2.79E-03 | 2.68e-3 |
| Amino acid substitution | 2.03E-06 | 3.07E-03 | 3.06e-3 |
| Transcriptional activation | 1.10E-23 | 3.22E-03 | 3.22e-3 |
| Genetic markers | 3.43E-03 | 1.71E-10 | 3.43e-3 |
| Anophthalmos | 5.77E-03 | 1.45E-04 | 5.63e-3 |
| 3' Untranslated regions | 8.26E-06 | 5.66E-03 | 5.65e-3 |
| Young adult | 1.18E-02 | 4.91E-03 | 6.84e-3 |
| Limbus corneae | 7.58E-03 | 1.48E-18 | 7.58e-3 |
| RNA, transfer, Lys | 4.19E-03 | 1.23E-02 | 8.16e-3 |
| Dna transposable elements | 9.49E-03 | 9.94E-04 | 8.50e-3 |
| Heteroduplex analysis | 4.54E-03 | 1.34E-02 | 8.84e-3 |
| Chromosome deletion | 9.12E-03 | 0.00E+00 | 9.12e-3 |
| Homozygote | 1.09E-02 | 1.29E-03 | 9.57e-3 |
| Otx transcription factors | 5.70E-06 | 1.00E-02 | 9.99e-3 |
| Genetic predisposition to disease | 1.16E-02 | 8.84E-05 | 1.15e-2 |
| Microphthalmos | 1.17E-02 | 2.34E-12 | 1.17e-2 |
| Vision, low | 1.24E-02 | 6.65E-04 | 1.17e-2 |
| Optic nerve | 1.26E-02 | 5.64E-08 | 1.26e-2 |
| Exotropia | 7.21E-03 | 2.12E-02 | 1.40e-2 |
| Cytosine | 2.70E-03 | 2.15E-02 | 1.88e-2 |
| Magnetic resonance imaging | 4.92E-04 | 1.96E-02 | 1.92e-2 |
| United States | 9.81E-01 | 1.00E+00 | 1.93e-2 |
| Trabecular meshwork | 2.22E-02 | 2.11E-03 | 2.01e-2 |
| Polymorphism, restriction fragment length | 2.30E-02 | 7.12E-04 | 2.23e-2 |
| Body patterning | 3.32E-03 | 2.62E-02 | 2.29e-2 |
| Dichotic listening tests | 1.21E-02 | 3.55E-02 | 2.34e-2 |
| Multigene family | 2.46E-02 | 8.37E-04 | 2.38e-2 |
| 3T3 Cells | 3.00E-06 | 2.67E-02 | 2.67e-2 |
| Cognition disorders | 4.77E-02 | 2.05E-02 | 2.72e-2 |
| Esotropia | 1.42E-02 | 4.14E-02 | 2.72e-2 |
| Mutagenesis, insertional | 4.08E-03 | 3.18E-02 | 2.77e-2 |
| Endothelium, corneal | 2.94E-02 | 1.06E-04 | 2.93e-2 |
| Restriction mapping | 3.25E-02 | 2.53E-06 | 3.25e-2 |
| Thymine | 4.15E-02 | 7.24E-03 | 3.43e-2 |
| Sequence homology, amino acid | 3.52E-02 | 3.31E-04 | 3.49e-2 |
| Glutamine | 5.38E-03 | 4.13E-02 | 3.59e-2 |
| Chromosomes, human, pair 10 | 1.97E-02 | 5.72E-02 | 3.75e-2 |
| Cytogenetics | 3.86E-02 | 2.41E-04 | 3.84e-2 |
| Nervous system malformations | 5.11E-02 | 9.30E-02 | 4.18e-2 |
| Organ specificity | 2.36E-01 | 1.90E-01 | 4.62e-2 |
| Catenins | 6.22E-02 | 1.59E-02 | 4.63e-2 |
| Genetic heterogeneity | 2.45E-02 | 7.09E-02 | 4.64e-2 |
| Brain-derived neurotrophic factor | 5.07E-02 | 6.24E-07 | 5.07e-2 |
| Chromosomes, human, pair 12 | 2.70E-02 | 7.77E-02 | 5.08e-2 |
| Leucine zippers | 1.64E-04 | 5.28E-02 | 5.26e-2 |
| Verbal behavior | 2.09E-01 | 1.53E-01 | 5.57e-2 |
| Mice, transgenic | 6.20E-02 | 5.89E-03 | 5.61e-2 |
| Visual acuity | 5.65E-02 | 0.00E+00 | 5.65e-2 |
| DNA fingerprinting | 9.30E-02 | 3.44E-02 | 5.85e-2 |
| Sequence alignment | 1.69E-02 | 7.59E-02 | 5.90e-2 |
| Autistic disorder | 9.48E-02 | 3.57E-02 | 5.91e-2 |
| Fluorescent antibody technique, indirect | 9.84E-02 | 3.83E-02 | 6.00e-2 |
| Age factors | 9.37E-01 | 9.97E-01 | 6.03e-2 |
The top 50 most similar MeSH terms of the 235 MeSH terms shared by both the MeSHOP for aniridia and the MeSHOP for PAX6 are presented here. The P-value of the term in the gene MeSHOP and the disease MeSHOP are presented, and ordered by the difference in the two P-values.