| Literature DB >> 20074336 |
Shi Yu1, Leon-Charles Tranchevent, Bart De Moor, Yves Moreau.
Abstract
BACKGROUND: Text mining has become a useful tool for biologists trying to understand the genetics of diseases. In particular, it can help identify the most interesting candidate genes for a disease for further experimental analysis. Many text mining approaches have been introduced, but the effect of disease-gene identification varies in different text mining models. Thus, the idea of incorporating more text mining models may be beneficial to obtain more refined and accurate knowledge. However, how to effectively combine these models still remains a challenging question in machine learning. In particular, it is a non-trivial issue to guarantee that the integrated model performs better than the best individual model.Entities:
Mesh:
Year: 2010 PMID: 20074336 PMCID: PMC3098068 DOI: 10.1186/1471-2105-11-28
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Overview of the controlled vocabularies applied in the multi-view approach.
| CV | Number of terms in CV | Number of indexed terms | |
|---|---|---|---|
| 1 | 1659 | 1286 | |
| 2 | eVOC anatomical system | 518 | 401 |
| 3 | eVOC cell type | 191 | 82 |
| 4 | eVOC human development | 658 | 469 |
| 5 | eVOC mouse development | 369 | 298 |
| 6 | eVOC pathology | 199 | 166 |
| 7 | eVOC treatment | 62 | 46 |
| 8 | 37069 | 7403 | |
| 9 | GO biological process | 20470 | 4400 |
| 10 | GO cellular component | 3724 | 1571 |
| 11 | GO molecular function | 15282 | 3323 |
| 12 | 1514 | 554 | |
| 13 | 935 | 890 | |
| 14 | 29709 | 15569 | |
| 15 | MeSH analytical | 3967 | 2404 |
| 16 | MeSH anatomy | 2467 | 1884 |
| 17 | MeSH biological | 2781 | 2079 |
| 18 | MeSH chemical | 11824 | 6401 |
| 19 | MeSH disease | 6717 | 4001 |
| 20 | MeSH organisms | 4586 | 1575 |
| 21 | MeSH psychiatry | 1463 | 907 |
| 22 | 9232 | 3446 | |
| 23 | 5021 | 3402 | |
| 24 | 311839 | 27381 | |
| 25 | SNOMED assessment scale | 1881 | 810 |
| 26 | SNOMED body structure | 30156 | 2865 |
| 27 | SNOMED cell | 1224 | 346 |
| 28 | SNOMED cell structure | 890 | 498 |
| 29 | SNOMED disorder | 97956 | 13059 |
| 30 | SNOMED finding | 51159 | 3967 |
| 31 | SNOMED morphologic abnormality | 6903 | 2806 |
| 32 | SNOMED observable entity | 11927 | 3119 |
| 33 | SNOMED procedure | 69976 | 9575 |
| 34 | SNOMED product | 23054 | 1542 |
| 35 | SNOMED regime therapy | 5362 | 1814 |
| 36 | SNOMED situation | 9303 | 2833 |
| 37 | SNOMED specimen | 1948 | 742 |
| 38 | SNOMED substance | 33065 | 8948 |
| 39 | 1618 | 520 | |
| 40 | Merge-9 | 372527 | 50687 |
| 41 | Merge-4 | 363321 | 48326 |
| 42 | Concept-4 | 1420118 | 44714 |
| 43 | No-voc | - | 259815 |
The versions of bio-ontologies and MEDLINE repository adopted in the indexing process are mentioned in the text. The Number of indexed terms of controlled vocabularies reported in this table are counted on indexing results of human related publications only so their numbers are smaller than those in our earlier work [2], which were counted on all species appeared in GeneRIF. The Number of terms in CV are counted on the vocabularies independently from the indexing process. The numbers of terms of Merge-9, Merge-4 and Concept-4 are counted on text mining results of all species occurring in GeneRIF.
The number of overlapping terms in different vocabularies and indexed terms
| eVOC | GO | KO | LDDB | MeSH | MPO | OMIM | SNOMED | Uniprot | ||
|---|---|---|---|---|---|---|---|---|---|---|
| eVOC | 1286 | - | 370 | 16 | 118 | 827 | 566 | 325 | 876 | 46 |
| GO | 7403 | 358 | - | 404 | 74 | 3380 | 1234 | 659 | 4772 | 325 |
| KO | 554 | 16 | 344 | - | 1 | 383 | 72 | 120 | 489 | 44 |
| LDDB | 890 | 118 | 74 | 1 | - | 346 | 275 | 205 | 498 | 16 |
| MeSH | 15569 | 784 | 2875 | 344 | 343 | - | 2118 | 1683 | 12483 | 373 |
| MPO | 3446 | 554 | 1177 | 72 | 271 | 2007 | - | 823 | 2729 | 146 |
| OMIM | 3402 | 322 | 655 | 119 | 205 | 1644 | 816 | - | 2275 | 161 |
| SNOMED | 27381 | 814 | 3144 | 380 | 492 | 8900 | 2508 | 2170 | - | 593 |
| Uniprot | 520 | 46 | 301 | 42 | 16 | 361 | 146 | 157 | 371 | - |
The upper triangle matrix shows the numbers of overlapping terms among vocabularies independent from indexing. The lower triangle matrix shows the numbers of overlapping indexed terms. The second horizontal row (from the top) are the numbers of the terms in vocabularies independent from indexing. The second vertical column (from the left) are the numbers of the indexed terms.
Figure 1Conceptual scheme of clustering disease relevant genes. Using these gene-by-term profiles, we evaluate the performance of clustering a benchmark data set consisting 620 disease relevant genes categorized in 29 genetic diseases. The numbers of genes categorized in the diseases are very imbalanced, moreover, some genes are simultaneously related to several diseases. To obtain meaningful clusters and evaluations, we enumerate all the pairwise combinations of the 29 diseases (406 combinations). For each time, the relevant genes of each paired diseases combination are selected and clustered into two groups, then the performance is evaluated using the disease labels. The genes which are relevant to both diseases in the paired combination are removed before clustering (totally less then 5% genes have been removed). Finally, the average performance of all the 406 paired combinations is used as the overall clustering performance.
Prioritization performance obtained by the single controlled vocabularies and the multi-view approach
| Single CV | Error of AUC | Integration of 9 complete CVs | Error of AUC |
|---|---|---|---|
| LDDB | Order statistics | 0.0990 | |
| eVOC | 0.0852 | Average score | 0.0782 |
| MPO | 0.0974 | Maximum score | 0.0957 |
| GO | 0.1027 | 1-SVM | 0.0620 |
| MeSH | 0.1043 | 1-SVM | 0.0583 |
| SNOMED | 0.1129 | 1-SVM | |
| OMIM | 0.1214 | ||
| Uniprot | 0.1345 | ||
| KO | 0.1999 | ||
| Order statistics | 0.0645 | Order statistics | 0.0870 |
| Average score | 0.0382 | Average score | 0.0674 |
| Maximum score | 0.0437 | Maximum score | 0.0883 |
| 1-SVM | 0.0540 | 1-SVM | 0.1036 |
| 1-SVM | 0.0454 | 1-SVM | 0.0851 |
| 1-SVM | 1-SVM | ||
The experiments are repeated 20 times on random candidate gene sets and the standard deviations are all smaller than 0.01.
Clustering performance obtained by the single controlled vocabulary and the multi-view approach
| Single CV | RI | NMI | Integration (9 CVs) | RI | NMI |
|---|---|---|---|---|---|
| LDDB | Ward linkage | ||||
| OMIM | 0.7216 ± 0.0009 | 0.4606 ± 0.0028 | EACAL | 0.7741 ± 0.0041 | 0.5542 ± 0.0068 |
| Uniprot | 0.7130 ± 0.0013 | 0.4333 ± 0.0091 | OKKC( | 0.7641 ± 0.0078 | 0.5395 ± 0.0147 |
| eVOC | 0.7015 ± 0.0043 | 0.4280 ± 0.0079 | MCLA | 0.7596 ± 0.0021 | 0.5268 ± 0.0087 |
| MPO | 0.7064 ± 0.0016 | 0.4301 ± 0.0049 | QMI | 0.7458 ± 0.0039 | 0.5084 ± 0.0063 |
| MeSH | 0.6673 ± 0.0055 | 0.3547 ± 0.0097 | OKKC( | 0.7314 ± 0.0054 | 0.4723 ± 0.0097 |
| SNOMED | 0.6539 ± 0.0063 | 0.3259 ± 0.0096 | AdacVote | 0.7300 ± 0.0045 | 0.4093 ± 0.0100 |
| GO | 0.6525 ± 0.0063 | 0.3254 ± 0.0092 | CSPA | 0.7011 ± 0.0065 | 0.4479 ± 0.0097 |
| KO | 0.5900 ± 0.0014 | 0.1928 ± 0.0042 | Complete linkage | 0.6874 ± 0 | 0.3028 ± 0 |
| Average linkage | 0.6722 ± 0 | 0.2590 ± 0 | |||
| HGPA | 0.6245 ± 0.0035 | 0.3015 ± 0.0071 | |||
| Single linkage | 0.5960 ± 0 | 0.1078 ± 0 | |||
| Ward linkage | 0.7991 ± 0 | 0.5997 ± 0 | Ward linkage | 0.8172 ± 0 | 0.5890 ± 0 |
| OKKC( | 0.7501 ± 0.0071 | 0.5220 ± 0.0104 | OKKC( | 0.7947 ± 0.0052 | 0.5732 ± 0.0096 |
| EACAL | 0.7511 ± 0.0037 | 0.5232 ± 0.0075 | EACAL | 0.7815 ± 0.0064 | 0.5701 ± 0.0082 |
For integration of 9 LSI and 35 subset CVs, only the best three results are shown.
Figure 2Prioritization results obtained by complete CV profiles and LSI profiles.
Figure 3Prioritization results obtained by complete CV profiles and subset CV profiles.
Figure 4Prioritization results obtained by multi-view data integration. The first and the second dotted horizontal lines represent the errors of the best single complete CV profile and the best single LSI profile respectively. To prove the statistical significance between the two closest performance, we used the paired t-test to compare the Error values of the 1-SVM (1) with LSI profile integration with the values of MeSH LSI profile obtained in 20 repetitions, the p-value was 2.67e-004.
Figure 5Clustering results obtained by complete CV and LSI profiles.
Figure 6Clustering results obtained by complete CV and subset CV profiles.
Figure 7Clustering results obtained by multi-view data integration.
Figure 8ROC curves of prioritization obtained by various integration methods. The light grey curves represent the single CV performance. The near-diagonal curve is obtained by the prioritization of random genes.
Clustering performance obtained by merging controlled vocabularies, concept mapping and no vocabulary indexing
| Merging vocabulary | RI | NMI |
|---|---|---|
| merge-9 | 0.6321 ± 0.0038 | 0.2830 ± 0.0079 |
| merge-4 | 0.6333 ± 0.0053 | 0.2867 ± 0.0085 |
| concept-4 | 0.6241 ± 0.0056 | 0.2644 ± 0.0111 |
| novoc | 0.5630 | 0.0892 |
The merge-9, merge-4 and concept-4 profiles were clustered by K-means in 20 random repetitions and the mean values and deviations of evaluations are shown in the table. The novoc profile was only evaluated once by K-means because of the extremely high dimension and the computational burden.
Figure 9Multi-view prioritization and clustering by various numbers of views.
Prioritization of the myopathy disease relevant gene MTM1 by different CVs and multi-view approach
| CV | Ranking position | false positive genes | correlated terms |
|---|---|---|---|
| LDDB | 1 | C3orf1 | muscle, heart, skeletal |
| 2 | HDAC4 | muscle, heart, calcium, growth | |
| 3 | CNTFR | muscle, heart, muscle weak, growth | |
| muscle, muscle weak, skeletal, hypotonia, growth, lipid | |||
| eVOC | muscle, sever, disorder, affect, human, recess | ||
| MPO | 1 | HDAC4 | muscle, interact, protein, domain, complex |
| 2 | HYAL1 | sequence, human, protein, gener | |
| 3 | WTAP | protein, human, sequence, specif | |
| 4 | FUT3 | sequence, alpha, human | |
| ... | |||
| myopathy, muscle, link, sequence, disease, sever | |||
| GO | 1 | muscle, mutate, family, gene, link, seqeuence, sever | |
| MeSH | 1 | HYAL1 | human, protein, clone, sequence |
| 2 | LUC7L2 | protein, large, human, function | |
| myopathy, muscle, mutate, family, gene, missens | |||
| SNOMED | 1 | S100A8 | protein, large, human, function |
| 2 | LUC7L2 | protein, large, human, function | |
| 3 | LGALS3 | human, protein, express, bind | |
| ... | |||
| muscle, mutate, family, gene, link | |||
| OMIM | 1 | HDAC4 | muscle, interact, protein, bind |
| 2 | MAFK | sequence, protein, gene, asthma relat trait | |
| 3 | LUC7L2 | protein, large, function, sequence | |
| 4 | SRP9L1 | sequence, protein, length, function | |
| ... | |||
| muscle, family, gene, link, sequence, disease, sever, weak | |||
| Uniprot | 1 | gene, protein, function | |
| KO | 1 | S100A8 | protein, bind, complex, specif, associ, relat |
| 2 | PRF1 | specif, protein, contain, activ | |
| ... | |||
| protein, large, specif, contain | |||
| Multi-view | |||
| 2 | HDAC4 | ||
| 3 | CNTFR | ||
Genes relevant to breast cancer and muscular dystrophy
| Disease | Breast Cancer | Muscular Dystrophy |
|---|---|---|
| relevant genes | AR | CAPN3 |
| ATM | CAV3 | |
| BCAR1 | COL6A1 | |
| BRCA1 | COL6A3 | |
| BRCA2 | DMD | |
| BRIP1 | DYSF | |
| BRMS1 | EMD | |
| CDH1 | FKRP | |
| CHEK2 | FKTN | |
| CTTN | FRG1 | |
| DBC1 | LAMA2 | |
| ESR1 | LMNA | |
| NCOA3 | MYF6 | |
| PHB | MYOT | |
| PPM1D | PABPN1 | |
| RAD51 | PLEC1 | |
| RAD54L | SEPN1 | |
| RB1CC1 | SGCA | |
| RP11-49G10.8 | SGCB | |
| SLC22A18 | SGCD | |
| SNCG | SGCG | |
| TFF1 | TCAP | |
| TP53 | TRIM32 | |
| TSG101 | TTN |
Clustering breast cancer and muscular dystrophy relevant genes by different CVs and the multi-view approach
| CV | Breast Cancer | Muscular Dystrophy | mis-partitioned genes |
|---|---|---|---|
| LDDB | 22 | 2 | RP11-49G10.8, FKTN |
| 0 | 24 | ||
| eVOC | 22 | 2 | RP11-49G10.8, FKTN |
| 7 | 17 | LMNA, COL6A1, MYF6, CHEK2, SGCD, FKRP, DMD | |
| MPO | 23 | 1 | RP11-49G10.8 |
| 1 | 23 | SGCD | |
| GO | 23 | 1 | RP11-49G10.8 |
| 7 | 17 | LMNA, COL6A1, MYF6, CHEK2, SGCD, FKRP, DMD | |
| MeSH | 23 | 1 | RP11-49G10.8 |
| 2 | 22 | SGCD, COL6A3 | |
| SNOMED | 24 | 0 | |
| 6 | 18 | LMNA, COL6A1, MYF6, TRIM32, SGCD, DMD | |
| OMIM | 24 | 0 | |
| 1 | 23 | SGCD | |
| Uniprot | 24 | 0 | |
| 4 | 20 | MYF6, CHEK2, SGCD, FKRP | |
| KO | 19 | 5 | SLC22A18, RP11-49G10.8, FKTN, PABPN1, CAPN3 |
| 6 | 18 | PPM1D, MYF6, SGCD, FKRP, COL6A3, DYSF | |
| Multi-view (WL) | 24 | 0 | |
| 0 | 24 | ||