| Literature DB >> 28984184 |
Santosh Philips1, Heng-Yi Wu1, Lang Li2.
Abstract
BACKGROUND: With the explosion of data comes a proportional opportunity to identify novel knowledge with the potential for application in targeted therapies. In spite of this huge amounts of data, the solutions to treating complex disease is elusive. One reason being that these diseases are driven by a network of genes that need to be targeted in order to understand and treat them effectively. Part of the solution lies in mining and integrating information from various disciplines. Here we propose a machine learning method to mining through publicly available literature on RNA interference with the goal of identifying genes essential for cell survival.Entities:
Keywords: Gene essentiality; Literature mining; Machine learning
Mesh:
Substances:
Year: 2017 PMID: 28984184 PMCID: PMC5629548 DOI: 10.1186/s12859-017-1799-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Composition of the training and testing sets used to test the various weka classifiers
| Set | Training | Testing | Data | ||
|---|---|---|---|---|---|
| Positive | Negative | Positive | Negative | ||
| 1 | 100 | 300 | 100 | 300 | r,d,dd,g |
| 2 | 100 | 100 | 100 | 100 | r,d,dd,na |
| 3 | 100 | 300 | 100 | 300 | r,d,dd,na |
| 4 | 100 | 400 | 100 | 400 | r,d,dd,na,g |
| 5 | 120 | 1700 | 101 | 1700 | r,d,dd,na,rns |
[r: RNAi abstracts, d: drug only abstract, dd: drug interaction abstracts, na: not applicable, rns: random negative set]
The % accuracy of classification after evaluating each classifier on a given dataset using 10 fold stratified cross validation
| Classifiers | Set 1 | Set 2 | Set 3 | Set 4 | Set 5 |
|---|---|---|---|---|---|
| ZeroR | 75.00 | 50.00 | 75.00 | 80.00 | 93.41 |
| NaiveBayes | 93.00 | 89.00 | 93.25 | 92.40 | 95.00 |
| KNN | 77.00 | 74.00 | 81.00 | 83.20 | 94.23 |
| J48 | 95.00 | 95.00 | 94.50 | 96.60 | 98.46 |
| RandomForest | 91.00 | 95.00 | 84.75 | 82.80 | 93.41 |
| SMO | 94.25 | 94.50 | 94.50 | 96.00 | 98.35 |
| OneR | 88.75 | 78.00 | 88.75 | 91.00 | 96.09 |
The % accurately classified by the top three models after training and testing
| Set 1 | Train | Test | Set 2 | Train | Test | Set 3 | Train | Test |
| J48 | 99.50 | 94.50 | J48 | 99.00 | 93.00 | J48 | 99.50 | 94.50 |
| SMO | 100.00 | 96.25 | RandomForest | 100.00 | 97.50 | SMO | 100.00 | 94.50 |
| NaiveBayes | 98.00 | 86.50 | SMO | 100.00 | 93.00 | NaiveBayes | 98.50 | 84.25 |
| Set 4 | Train | Test | Set 5 | Train | Test | |||
| J48 | 99.00 | 92.40 | J48 | 99.50 | 99.20 | |||
| SMO | 100.00 | 93.00 | SMO | 100.00 | 98.50 | |||
| NaiveBayes | 94.20 | 89.00 | oneR | 96.60 | 97.10 |
These models were previously evaluated using the 10 fold cross validation
Classifier errors for the classifier’s tested on dataset 5
| Classifier Error | ZeroR | NaiveBayes | KNN | J48 | RandomForest | SMO | OneR |
|---|---|---|---|---|---|---|---|
| Kappa statistic | 0.00 | 0.66 | 0.23 | 0.87 | 0.00 | 0.86 | 0.57 |
| Mean absolute error | 0.12 | 0.05 | 0.06 | 0.02 | 0.09 | 0.02 | 0.04 |
| Root mean squared error | 0.25 | 0.22 | 0.24 | 0.12 | 0.20 | 0.13 | 0.20 |
| Relative absolute error | 100% | 40.55% | 47.10% | 15.63% | 73.11% | 13.33% | 31.55% |
| Root relative squared error | 100% | 90.13% | 96.73% | 48.74% | 79.35% | 51.73% | 79.59% |
Performance metrics across the various classifiers tested on dataset 5 for abstracts classified as RNAi
| Classifiers | Time (sec) | TPR | FPR | Precision | Recall | F-Measure |
|---|---|---|---|---|---|---|
| ZeroR | 2.45 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| NaiveBayes | 28.82 | 0.83 | 0.04 | 0.59 | 0.83 | 0.69 |
| KNN | 3.22 | 0.14 | 0.00 | 0.90 | 0.14 | 0.25 |
| J48 | 116.14 | 0.83 | 0.00 | 0.93 | 0.83 | 0.88 |
| RandomForest | 70.66 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| SMO | 6.46 | 0.82 | 0.01 | 0.93 | 0.82 | 0.87 |
| OneR | 12.94 | 0.42 | 0.00 | 0.98 | 0.42 | 0.59 |
[Time in seconds to build the model, True Positive Rate (TPR), False Positive Rate (FPR)]
Fig. 1AUC Receiver Operator Characteristics for the SMO-5 model
The number of abstracts that were processed per year and the number of abstracts that were identified as relevant to RNA interference studies
| Year | Medline | RNAi |
|---|---|---|
| 2001 | 424,042 | 101 |
| 2002 | 435,427 | 180 |
| 2003 | 472,745 | 425 |
| 2004 | 514,910 | 745 |
| 2005 | 575,403 | 1101 |
| 2006 | 620,688 | 1503 |
| 2007 | 652,232 | 1724 |
| 2008 | 701,623 | 1996 |
| 2009 | 742,510 | 2308 |
| 2010 | 801,061 | 2707 |
| 2011 | 862,838 | 3070 |
| 2012 | 931,619 | 3923 |
| 2013 | 978,796 | 4048 |
| 2014 | 1,018,012 | 4498 |
| 2015 | 796,876 | 3835 |
| Total | 10,528,782 | 32,164 |
The number of times a given gene and cell line were studied together
| No. of Cell Gene Associations | Frequency |
|---|---|
| 5 | 25,198 |
| 10 | 461 |
| 15 | 99 |
| 20 | 52 |
| 25 | 25 |
| 30 | 15 |
| 35 | 8 |
| 40 | 4 |
| 45 | 10 |
| 50 | 1 |
| 55 | 1 |
| 60 | 6 |
| 65 | 5 |
| 70 | 1 |
| 75 | 2 |
| 80 | 0 |
| 85 | 1 |
| 90 | 2 |
Frequency of the number of genes studied in a given cell line
| Genes | Frequency |
|---|---|
| 25 | 1291 |
| 50 | 73 |
| 100 | 54 |
| 200 | 30 |
| 300 | 10 |
| 400 | 3 |
| 500 | 1 |
| 600 | 2 |
| 700 | 0 |
| 800 | 1 |
| 900 | 1 |
| 1000 | 0 |
| 1100 | 1 |
Frequency of the number of cell lines used to study a given gene
| Cell Lines | Frequency |
|---|---|
| 25 | 4209 |
| 50 | 96 |
| 100 | 46 |
| 150 | 10 |
| 200 | 5 |
| 250 | 3 |
| 300 | 1 |
Fig. 2Top 10 most studied cell lines
Fig. 3The top 20 genes predicted to be essential for cell survival
The genes amongst the top 20 that are known to be cancer genes and their roles in the various processes required for cellular function
| Functional Class | AKT1 | TP53 | CDH1 | CCND1 | BCL2 | CDKN1A | MYC | EGFR | JUN |
|---|---|---|---|---|---|---|---|---|---|
| Cell cycle | X | X | X | X | X | X | X | ||
| Cell motility and interactions | X | X | |||||||
| Cell response to stimuli | X | X | X | X | X | ||||
| Cellular metabolism | X | X | X | X | X | X | X | ||
| Cellular processes | X | X | X | X | X | X | X | ||
| Development | X | X | X | X | X | X | X | ||
| DNA/RNA metabolism and transcription | X | X | X | ||||||
| Immune system response | X | X | X | ||||||
| Multicellular activities | X | X | |||||||
| Regulation of intracellular processes and metabolism | X | X | X | X | X | X | X | X | X |
| Regulation of transcription | X | X | X | X | |||||
| Signal transduction | X | X | X | X | X | X |
X: genes involved in that particular functional process of the cell