| Literature DB >> 15766383 |
Euan A Adie1, Richard R Adams, Kathryn L Evans, David J Porteous, Ben S Pickard.
Abstract
BACKGROUND: Regions of interest identified through genetic linkage studies regularly exceed 30 centimorgans in size and can contain hundreds of genes. Traditionally this number is reduced by matching functional annotation to knowledge of the disease or phenotype in question. However, here we show that disease genes share patterns of sequence-based features that can provide a good basis for automatic prioritization of candidates by machine learning.Entities:
Mesh:
Year: 2005 PMID: 15766383 PMCID: PMC1274252 DOI: 10.1186/1471-2105-6-55
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The feature set. The list of features which were made available to the machine learning application (Weka) to build the alternating decision tree.
| Gene length | EnsemblMart 22.1 | Length of gene in bp. |
| CDS length | EnsemblMart 22.1 | Length of coding sequence in bp. |
| cDNA length | EnsemblMart 22.1 | Length of complementary DNA in bp. |
| Protein length | EnsemblMart 22.1 | Length of protein in aa. |
| Length of 3' UTR | EnsemblMart 22.1 | The length of the 3' untranslated region (UTR) in bp |
| Length of 5' UTR | EnsemblMart 22.1 | The length of the 5' untranslated region (UTR) in bp |
| Distance to nearest neighbouring gene | EnsemblMart 22.1 | Distance to the next known gene on the same chromosome on either strand in bp. |
| Number of exons | EnsemblMart 22.1 | Number of exons in the gene. |
| GC | EnsemblMart 22.1 | GC content (as a %) of gene |
| Transmembrane | EnsemblMart 22.1 | Prediction of transmembrane domains (1 for yes or 0 for no) |
| Signal peptide | EnsemblMart 22.1 | Prediction of signal peptide (1 for yes or 0 for no) |
| Paralog | EnsemblMart 22.1 | If the gene has a paralog in the human genome (1 for yes or 0 for no) |
| Paralog % identity | EnsemblMart 22.1 | % protein identity of best paralog in the human genome. Genes without paralogs have "unknown" entered here. |
| Mouse homolog % identity | Homologene | % protein identity of mouse homolog. Genes without a mouse homolog have "0" entered here. |
| Rat homolog % identity | Homologene | % protein identity of rat homolog. Genes without a rat homolog have "0" entered here. |
| Worm homolog % identity | Homologene | % protein identity of worm homolog (potentially 0, see above) |
| Fly homolog % identity | Homologene | % protein identity of fly homolog (potentially 0, see above) |
| Yeast homolog % identity | Homologene | % protein identity of yeast homolog (potentially 0, see above) |
| Arabidopsis homolog % identity | Homologene | % protein identity of Arabidopsis homolog (potentially 0, see above) |
| Mouse homolog Ka | Homologene | Measure of non-synonymous changes between human and mouse homolog. |
| Mouse homolog Ks | Homologene | Measure of synonymous changes between human and mouse homolog. |
| Mouse homolog Ka / Ks | Homologene | Ratio of above two fields. |
| CpG island at 3' end of gene | EnsemblMart 22.1 | If a CpG island exists at the 3' end of the gene (1 or 0) |
| CpG island at 5' end of gene | EnsemblMart 22.1 | If a CpG island exists at the 5' end of the gene (1 or 0) |
Significant differences between the control set and disease set of genes. The features found to be significantly different between Ensembl genes found in OMIM and those not in OMIM. Significance was calculated using the Mann-Whitney U test unless otherwise noted.
| Gene length | 19 k | 27 k | P < 0.001 |
| cDNA length | 2,126 bp | 2,442 bp | P < 0.001 |
| Protein length | 383 aa | 494 aa | P < 0.001 |
| 3' UTR length | 446 bp | 488 bp | P < 0.01 |
| Exon number | 8 | 10 | P < 0.001 |
| Distance to neighbouring gene | 46 kb | 52 kb | P < 0.01 |
| Protein identity with BRH in mouse | 80% | 87% | P < 0.001 |
| Gene encodes signal peptide | 17% | 35% | P < 0.0001 (calculated using the chi squared test) |
| 5' CpG islands | 12% | 16% | P < 0.028 (calculated using the chi squared test) |
Figure 1Histograms of selected features. Histograms showing distributions of selected features in both "disease genes" (those listed in OMIM) and control genes (those not). Data was binned for graphing purposes. Distributions are shown for (A) gene length in kilobases; (B) protein length in amino acids; (C) % identity of the best reciprocal hit (BRH) homolog in mouse; (D) Ka (a measure of non-synonymous change between species) of the BRH homolog in mouse; (E) number of exons and (F) 3' UTR length in basepairs.
Figure 2The alternating decision tree. The alternating decision tree used to classify instances. A gene is classified with the tree by beginning at the node marked "Start" and then following each branch in turn. Upon reaching a node which contains an assumption the "yes" or "no" branch is followed as appropriate. If the relevant feature is "unknown", neither branch is followed. Adding up each of the numbers in rectangles that are encountered along the way results in a final score which reflects the relative confidence of the classification. The classification itself is based on the sign of the score.
More detailed classifier performance statistics. For each set of genes tested, five statistics that reflected performance were calculated. Accuracy is the overall accuracy of the classifier; precision reflects the classifier's specificity and recall reflects classifier sensitivity. The area under curve (AUC) is the area underneath the ROC curve drawn for each set of genes (see Figure 3) and represents classifier performance across all combinations of sensitivity and specificity. It ranges from 0 to 1, where 1 represents 100% accuracy, 0.5 represents performance no better than random and 0 represents 0% accuracy. The Kappa statistic is a measurement of agreement between predicted and actual classifications and takes false positive rates into account. It is a number between 1 (symbolising perfect agreement between predicted and actual classifications) and 0 (symbolising no agreement).
| Training (OMIM) set | 15 | 67% | 65% | 77% | 0.75 | 0.35 |
| 10 × cross validation | 15 | 63% | 62% | 70% | 0.70 | 0.27 |
| HGMD set | 15 | 64.5% | 63% | 71% | 0.69 | 0.29 |
| Oligogenic set | 15 | 65% | 63% | 72% | 0.76 | 0.31 |
Figure 3Receiver Operating Characteristic (ROC) curves. Receiver Operating Characteristic (ROC) curves for the training set (A) and the two test sets (B and C). The true positive rate is measured along the y-axis and the false positive along the x-axis. The area under the resulting curve is a measure of classifier performance.
Relative contribution of each feature to classification as disease gene. An estimate of the relative contribution of each sequence feature in the final score used by the alternating decision tree for classifying genes as being involved in disease. The percentages are based on the average absolute contribution to the cumulative absolute score of each disease gene in the training set.
| Signal peptide | 23% |
| Mouse homolog % identity | 21% |
| Length of 3' UTR | 12% |
| Number of exons | 7% |
| Rat homolog % identity | 7% |
| Worm homolog % identity | 6% |
| GC | 6% |
| CDS length | 5% |
| Gene length | 4% |
| Mouse homolog Ka | 3% |
| Paralog % identity | 2% |
Figure 4Performance over artificial loci. Relative performance on the sets of artificial loci created from the training set (yellow line), HGMD test set (the blue line) and oligogenic test set (the green line). The gray line represents the value expected if there had been no enrichment. The x axis represents the % of the ranked list in which the target gene was found; the y axis represents how frequent that occurrence was. For example, in the training set (the yellow line) the target gene was in the top 30% of the ranked list around 56% of the time.