Literature DB >> 28408735

Predicting neurological Adverse Drug Reactions based on biological, chemical and phenotypic properties of drugs using machine learning models.

Salma Jamal1, Sukriti Goyal1, Asheesh Shanker1,2, Abhinav Grover3.   

Abstract

Adverse drug reactions (ADRs) have become one of the primary reasons for the failure of drugs and a leading cause of deaths. Owing to the severe effects of ADRs, there is an urgent need for the generation of effective models which can accurately predict ADRs during early stages of drug development based on integration of various features of drugs. In the current study, we have focused on neurological ADRs and have used various properties of drugs that include biological properties (targets, transporters and enzymes), chemical properties (substructure fingerprints), phenotypic properties (side effects (SE) and therapeutic indications) and a combinations of the two and three levels of features. We employed relief-based feature selection technique to identify relevant properties and used machine learning approach to generated learned model systems which would predict neurological ADRs prior to preclinical testing. Additionally, in order to explain the efficiency and applicability of the models, we tested them to predict the ADRs for already existing anti-Alzheimer drugs and uncharacterized drugs, respectively in side effect resource (SIDER) database. The generated models were highly accurate and our results showed that the models based on chemical (accuracy 93.20%), phenotypic (accuracy 92.41%) and combination of three properties (accuracy 94.18%) were highly accurate while the models based on biological properties (accuracy 82.11%) were highly informative.

Entities:  

Mesh:

Year:  2017        PMID: 28408735      PMCID: PMC5429831          DOI: 10.1038/s41598-017-00908-z

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.379


Introduction

Adverse drug reactions (ADRs) are unwanted phenotypic responses caused due to alterations in biological pathways in response to drug treatments[1]. Studies on ADRs have become more significant owing to the increasing number of morbidity and mortality due to severe ADRs. ADRs have been predicted as the fourth leading cause of death in the United States with a probability of 100 000 fatalities per year[2]. Using the fundamental drug discovery process, few amongst the thousands of lead compounds reach the clinical trials and actually make it to the market which involves billions of dollars and huge amount of time and labour. However even then most of the drugs fail in the phase IV clinical trials and in post marketing surveillance and the drug has a chance to be withdrawn due to ADRs[3]. These facts advocate the inevitable need for prediction of ADRs in early stages of drug discovery and development process. In latest years, prediction of potential ADRs has become a research focus of utmost importance for a large number of pharmaceutical companies and a large number of studies have been conducted in this regard. The traditional method of ADRs prediction employed by these companies involved testing of the compounds by conducting biological assays which is an extremely challenging process in terms of time, effort, money and efficiency[4]. Recently a large number of studies have been reported which involve preclinical prediction of ADRs associated with drugs by integrating the side effects information[5], protein targets, transporters and enzymes information[6], chemical structure information[7] and drugs therapeutic indications[2]. Kanji et al.[8] proposed a new strategy and generated a canonical correlation model for predicting side effects of drugs by combining their chemical properties with their target profiles. Zhang et al.[9] used ensemble methods and devised feature selection based multi-label k-nearest neighbour method (FS-MLKNN) using which essential features for ADR prediction can be predicted. Huang et al. integrated drug information (drug target data and clinical observation data) with network information (protein-protein interaction networks and gene ontology information) and built in silico models for computer-aided ADR prediction of drugs[10]. Although various methods have been proposed for prior prediction of ADRs for drugs, there still remains room for improvement. In the present era, there is an enormous amount of publicly available side effects data. This can serve significant if we could integrate it with chemical structure information, protein binding and therapeutic indication data. In this study, we have proposed a computational method in which we have integrated three levels of information, biological features (targets, transporters and enzymes), chemical information (PubChem substructure fingerprints) and phenotypic information (side effects and therapeutic indications) towards prediction of neurological ADRs. We have measured chemical similarity among the drugs and employed relief-based feature selection technique to identify features relevant for ADR prediction. To handle imbalance in the data, we have used Synthetic Minority Oversampling Technique (SMOTE)[11] on train sets. These balanced training sets were used to generate in silico models which could predict neurological ADRs associated with drugs. Using SMOTE balanced training datasets, the machine learning models for each of the biological, chemical and phenotypic features as well as for combination of all the features for 22 neurological ADRs were generated. Furthermore, the models were employed to predict neurological side effects for uncharacterized drugs in SIDER for which no ADR information was available.

Results and Discussion

The computational methodology followed in the present study has been shown in Fig. 1.
Figure 1

The computational methodology followed in the present study has been shown in Fig. 1.

The computational methodology followed in the present study has been shown in Fig. 1.

Feature analysis

In order to remove the less important features with low significant contribution towards classification, reduce the dimensionality of the data and processing time, we used relief-based feature selection technique. The list of the features obtained after application of relief-based feature selection has been provided as Supplementary Table 1. Table 1 lists the number of features obtained after applying RemoveUseless filter and relief-based feature selection and types of features used to generate the models.
Table 1

Lists the number of features obtained after applying RemoveUseless filter and relief-based feature selection and types of features used to generate the models.

Type of featureInitial number of featuresRemoveUseless filter’Relief-based selectionTotal final features
BiologicalTargets9549455278
Transporters878613
Enzymes16816513
ChemicalSubstructures881619319319
PhenotypicOther ADRs54625411272281
Therapeutic indications304619629
Lists the number of features obtained after applying RemoveUseless filter and relief-based feature selection and types of features used to generate the models.

Models assessment

The performance of the models was evaluated using the testing dataset. Table 2 provides the list of the 22 neurological ADRs along with their SIDER ids for which the SMO models were generated.
Table 2

Provides the list of the 22 neurological ADRs along with their SIDER ids for which the SMO models were generated.

Neurological ADRSIDER id
Arteritic anterior ischaemic optic neuropathyC2242711
Autonomic neuropathyC0259749
Nervous system disorderC0007682
NeuralgiaC0040997
NeuritisC0027813
Neuritis retrobulbarC0085582
Neuroleptic malignant syndromeC0027849
Neurologic reactionC0235030
Neurological impairmentC0521654
Neurological symptomC0235031
Neuromuscular block prolongedC0520758
NeuromyopathyC0027868
NeuropathyC0442874
Neuropathy peripheralC0031117
NeurosisC0027932
NeurotoxicityC0235032
Optic neuritisC0029134
Peripheral motor neuropathyC0235025
Peripheral sensorimotor neuropathyC1112256
Peripheral sensory neuropathyC0151313
PolyneuropathyC0152025
Post herpetic neuralgiaC0032768
Provides the list of the 22 neurological ADRs along with their SIDER ids for which the SMO models were generated.

Modelling using biological features

A total of 22 models were generated on a training set using a combination of 52 targets, 13 transporters and 13 enzymes totalling as 78 biological properties for 913 approved drugs. The models had an accuracy of 82.11%, a very high precision value of 0.94, and value for recall as 0.85, and F-score equal to 0.89. The model for ADR autonomic neuropathy came out to be the best predictive model having the highest accuracy value (98.9%), highest precision (0.99) and F-score of 0.99. Table 3 provides the performances of the models generated using the biological features.
Table 3

Provides the performances of the models generated using the biological features.

ADR eventAccuracy (%)PrecisionRecallF-scoreAUC
Arteritic anterior ischaemic optic neuropathy71.580.990.720.830.36
Autonomic neuropathy98.90.990.990.990.49
Nervous system disorder59.560.60.860.70.55
Neuralgia72.670.920.760.830.54
Neuritis93.980.9310.960.607
Neuritis retrobulbar96.170.990.960.980.48
Neuroleptic malignant syndrome57.370.970.560.710.63
Neurologic reaction88.520.980.890.930.44
Neurological impairment99.450.9910.990.50
Neurological symptom59.010.980.590.730.54
Neuromuscular block prolonged92.890.990.930.960.46
Neuromyopathy85.790.870.970.920.52
Neuropathy69.390.880.740.800.559
Neuropathy peripheral88.520.880.990.930.66
Neurosis78.680.960.800.870.54
Neurotoxicity79.230.950.820.880.41
Optic neuritis80.870.940.840.890.42
Peripheral motor neuropathy84.150.980.850.910.42
Peripheral sensorimotor neuropathy97.810.990.980.980.49
Peripheral sensory neuropathy73.770.970.740.840.57
Polyneuropathy89.010.990.890.940.44
Post herpetic neuralgia89.070.990.890.940.44
Provides the performances of the models generated using the biological features.

Modelling using chemical features

The 22 machine learning models were generated using 319 PubChem chemical substructure fingerprints for 913 drugs. The models were highly informative having an accuracy of 93.20%, precision and recall value of 0.96 and 0.95 respectively, and F-score value equal to 0.95. As compared to the models trained using biological properties, these models were more predictive having greater mean value for all the parameters, indicating that the chemical structure played a significant role in drugs ADR prediction. Table 4 provides the performances of the models generated using the chemical features.
Table 4

Provides the performances of the models generated using the chemical features.

ADR eventAccuracy (%)PrecisionRecallF-scoreAUC
Arteritic anterior ischaemic optic neuropathy98.900.990.990.990.49
Autonomic neuropathy99.450.9910.990.50
Nervous system disorder64.480.670.730.700.63
Neuralgia94.580.970.920.940.95
Neuritis93.810.950.920.940.93
Neuritis retrobulbar99.450.9910.990.5
Neuroleptic malignant syndrome91.250.950.940.950.62
Neurologic reaction97.810.990.980.980.74
Neurological impairment95.620.960.990.970.49
Neurological symptom96.170.970.980.980.49
Neuromuscular block prolonged1001111
Neuromyopathy99.450.9910.990.50
Neuropathy69.640.860.770.810.49
Neuropathy peripheral91.920.940.890.910.92
Neurosis92.890.970.950.960.62
Neurotoxicity92.340.970.940.960.68
Optic neuritis89.070.950.920.940.52
Peripheral motor neuropathy95.620.980.960.970.48
Peripheral sensorimotor neuropathy99.450.9910.990.50
Peripheral sensory neuropathy95.020.970.970.970.48
Polyneuropathy95.080.980.960.970.6
Post herpetic neuralgia98.3610.980.990.99
Provides the performances of the models generated using the chemical features.

Modelling using phenotypic features

Using 281 phenotypic properties which comprised 272 other SE and 9 indications, 22 SMO models were generated for 22 neurological ADRs. The models were very informative having accuracy of 92.41%, precision 0.97, recall value 0.93, and F-score 0.95. The models had similar performance when compared to modelled chemical features but had significantly high values (around 10% increase in accuracy) in comparison to the models with biological properties. Table 5 provides the performances of the models generated using the phenotypic features.
Table 5

Provides the performances of the models generated using the phenotypic features.

ADR eventAccuracy (%)PrecisionRecallF-scoreAUC
Arteritic anterior ischaemic optic neuropathy99.450.9910.990.50
Autonomic neuropathy98.900.990.990.990.47
Nervous system disorder87.430.920.840.880.87
Neuralgia84.150.960.850.900.76
Neuritis94.040.950.950.940.94
Neuritis retrobulbar97.810.9710.980.50
Neuroleptic malignant syndrome90.710.960.930.950.66
Neurologic reaction98.360.990.980.990.74
Neurological impairment99.450.9910.990.50
Neurological symptom95.300.950.950.950.95
Neuromuscular block prolonged98.900.990.990.990.49
Neuromyopathy99.450.9910.990.50
Neuropathy74.310.920.760.830.67
Neuropathy peripheral78.140.910.810.860.70
Neurosis89.610.970.910.940.67
Neurotoxicity85.240.980.860.910.71
Optic neuritis83.060.930.880.90.54
Peripheral motor neuropathy96.170.990.960.980.73
Peripheral sensorimotor neuropathy99.450.9910.990.50
Peripheral sensory neuropathy90.710.980.910.950.75
Polyneuropathy92.340.970.940.960.47
Post herpetic neuralgia1001111
Provides the performances of the models generated using the phenotypic features.

Modelling using the combination of two levels of biological, chemical and phenotypic properties

We generated the models by the combining the two levels of features, chemical + phenotypic, biological + chemical, and phenotypic + biological. We observed that the combination of the two levels of features resulted in more accurate models, with chemical + phenotypic combination models being most accurate and extremely informative. The combined chemical + phenotypic properties models had an accuracy of 94.59%, precision value 0.96, recall 0.95, and F-score 0.96 (Table 6). The phenotypic + biological models also performed well having an accuracy of 92.96%, precision and recall value of 0.96 and 0.94 respectively, and F-score value 0.95 (Table 7). The combined biological + chemical models were least accurate among all three sets with accuracy 91.47%, precision, recall and F-score values equalling to 0.95%, 0.93% and 0.94%, respectively (Table 8).
Table 6

Provides the performances of the models generated using the Chemical + Phenotypic features.

ADR eventAccuracy (%)PrecisionRecallF-scoreAUC
Arteritic anterior ischaemic optic neuropathy99.450.9910.990.50
Autonomic neuropathy99.450.9910.990.50
Nervous system disorder81.960.850.820.830.81
Neuralgia86.880.930.910.920.62
Neuritis88.520.940.930.930.61
Neuritis retrobulbar99.450.9910.990.50
Neuroleptic malignant syndrome93.440.960.960.960.68
Neurologic reaction98.90.990.990.990.74
Neurological impairment99.450.9910.990.50
Neurological symptom96.720.970.980.980.49
Neuromuscular block prolonged99.450.9910.990.50
Neuromyopathy99.450.9910.990.50
Neuropathy98.180.980.970.980.98
Neuropathy peripheral76.50.870.830.850.61
Neurosis92.890.970.940.960.68
Neurotoxicity89.610.970.910.940.67
Optic neuritis91.250.970.930.950.65
Peripheral motor neuropathy98.360.980.990.990.49
Peripheral sensorimotor neuropathy99.450.9910.990.50
Peripheral sensory neuropathy95.080.970.970.970.48
Polyneuropathy97.260.970.990.980.49
Post herpetic neuralgia99.4510.990.990.99
Table 7

Provides the performances of the models generated using the Biological + Phenotypic features.

ADR eventAccuracy (%)PrecisionRecallF-scoreAUC
Arteritic anterior ischaemic optic neuropathy99.450.9910.990.50
Autonomic neuropathy98.90.990.990.990.49
Nervous system disorder84.690.890.820.860.85
Neuralgia86.330.960.880.920.74
Neuritis86.330.930.910.920.59
Neuritis retrobulbar99.450.9910.990.50
Neuroleptic malignant syndrome94.530.970.970.970.75
Neurologic reaction99.450.9910.990.5
Neurological impairment99.450.9910.990.50
Neurological symptom92.340.970.940.960.47
Neuromuscular block prolonged98.90.9910.990.49
Neuromyopathy99.450.9910.990.50
Neuropathy75.40.910.780.840.66
Neuropathy peripheral80.320.910.840.870.72
Neurosis92.890.970.950.960.62
Neurotoxicity88.520.970.90.930.66
Optic neuritis90.160.960.930.940.59
Peripheral motor neuropathy96.170.990.960.980.73
Peripheral sensorimotor neuropathy91.80.980.920.950.76
Peripheral sensory neuropathy99.450.9910.990.5
Polyneuropathy91.80.980.930.950.59
Post herpetic neuralgia99.4510.990.990.5
Table 8

Provides the performances of the models generated using the Biological + Chemical features.

ADR eventAccuracy (%)PrecisionRecallF-scoreAUC
Arteritic anterior ischaemic optic neuropathy98.90.990.990.990.49
Autonomic neuropathy99.450.9910.990.50
Nervous system disorder62.840.640.760.690.60
Neuralgia83.60.920.890.900.54
Neuritis85.790.940.890.920.62
Neuritis retrobulbar99.450.9910.990.50
Neuroleptic malignant syndrome91.250.950.940.950.62
Neurologic reaction98.360.990.980.990.74
Neurological impairment99.450.9910.990.50
Neurological symptom94.530.970.960.970.48
Neuromuscular block prolonged1001.001.001.001.00
Neuromyopathy99.450.9910.990.50
Neuropathy71.580.860.790.820.5
Neuropathy peripheral71.580.880.750.810.62
Neurosis91.250.960.940.950.54
Neurotoxicity91.80.970.940.950.61
Optic neuritis87.970.960.90.930.57
Peripheral motor neuropathy97.260.980.980.980.49
Peripheral sensorimotor neuropathy99.450.9910.990.50
Peripheral sensory neuropathy95.620.970.980.970.49
Polyneuropathy94.530.970.960.970.48
Post herpetic neuralgia98.3610.980.990.99
Provides the performances of the models generated using the Chemical + Phenotypic features. Provides the performances of the models generated using the Biological + Phenotypic features. Provides the performances of the models generated using the Biological + Chemical features.

Modelling using the combined biological, chemical and phenotypic properties

The three levels of the features, biological (78), chemical (319) and phenotypic (291) were combined and a dataset of total 678 properties was created. The learned model systems generated had an accuracy value 94.18%, precision and recall corresponding to 0.96 and 0.96 respectively, and F-score value also 0.96. Table 9 provides the performances of the models generated using the combination of the three levels of features, biological, chemical, and phenotypic.
Table 9

Provides the performances of the models generated using the combination of the three levels of features, biological, chemical, and phenotypic.

ADR eventAccuracy (%)PrecisionRecallF-scoreAUC
Arteritic anterior ischaemic optic neuropathy1001111
Autonomic neuropathy99.450.9910.990.50
Nervous system disorder79.780.830.800.820.79
Neuralgia85.240.930.890.910.61
Neuritis90.710.940.950.950.62
Neuritis retrobulbar99.450.9910.990.50
Neuroleptic malignant syndrome93.980.960.970.960.68
Neurologic reaction99.450.9910.990.75
Neurological impairment99.450.9910.990.50
Neurological symptom96.170.970.980.980.49
Neuromuscular block prolonged99.450.9910.990.50
Neuromyopathy99.450.9910.990.50
Neuropathy79.230.890.860.870.58
Neuropathy peripheral81.420.890.880.880.67
Neurosis93.980.970.960.960.69
Neurotoxicity92.890.970.940.960.68
Optic neuritis93.440.970.960.960.66
Peripheral motor neuropathy97.810.980.980.980.49
Peripheral sensorimotor neuropathy99.450.9910.990.50
Peripheral sensory neuropathy95.020.970.970.970.48
Polyneuropathy96.720.980.980.980.61
Post herpetic neuralgia99.4510.990.990.99
Provides the performances of the models generated using the combination of the three levels of features, biological, chemical, and phenotypic.

Case study on anti-Alzheimer drugs

In the present study, the three FDA approved drugs against Alzheimers, namely include Donepezil (DrugBank ID: DB00843), Galantamine (DrugBank ID: DB00674) and Memantine (DrugBank ID: DB01043), were removed before the generation of the models. The data for these three drugs was used as a control in order to assess the predictive capacity and performance of the models in addition to statistical analysis. As per the information derived from the SIDER database, Donepezil has been associated with the ADRs, Neuralgia and Nervous system disorder (NSD). The models for ADR Neuralgia and ADR NSD generated using chemical features predicted both the ADRs to be associated with Donepezil. SIDER lists Neuropathy peripheral (NP) and NSD as the side effects of Galantamine and the same was predicted by the NP and NSD models generated using the chemical, phenotypic and the combination of the three features. Memantine has been linked to all the three ADRs - Neuralgia, NSD and NP according to the SIDER database. However, ADR NSD modelled using phenotypic and combined features predicted NSD to be related to Memantine. ADR neuritis and optic neuritis was predicted to be associated with Donepezil by the optic neuritis model generated using biological, chemical and combined features. Various studies have reported the correlation between neuritis, optic neuritis and Alzheimers disease[12, 13]. The above results are clear indication of accuracy and the predictive ability of the generated models for 22 neurological ADRs.

Prediction on drugs having no information in SIDER

To enhance the applicability of the generated SMO models for neurological ADRs, we predicted the ADRs for 103 DrugBank drugs having no information in SIDER. We found that all the models predicted NSD as one of the ADR associated with most of the drugs. The top ADRs associated with the drugs included NSD, neuralgia, neurotoxicity, neuroleptic malignant syndrome, peripheral sensory neuropathy and neuropathy. The biological properties NSD model predicted it to be linked to 45 drugs, the NSD model of chemical properties predicted it to be associated with 44 drugs and the combined feature NSD model found NSD to be connected with 15 drugs. No drugs were predicted to have neurological impairment (NI) as ADR except for 1 drug which was predicted by chemical features NI models. To add relevance to our preliminary findings, we conducted an extensive literature search to find association between the drugs and side effects predicted by our models. According to a report by WHO library, Mefloquine was found to be related to various central nervous system adverse events which include major psychiatric disorders and symptoms, neurosis, neuropathies and various other neurological disorders[14]. High doses of cyanocobalamin are known to have possible associations with adverse neurological disorders[15]. Administration of quinolones might result in central nervous system events such as neurotoxicity and neurological ADRs have been ranked as second common group of ADRs associated with drugs of this class[16]. Serious central nervous system adverse events were found to be related to the drug, Sulindac[17]. Tetracyclines have been associated with neurotoxicity and neuromuscular blockage in addition to other neurotoxic events[18]. Irinotecan in combination with oxaliplatin induced various neurologic complications[19], treatment with amiodarone induced polyneuropathy and other neurological complications[20], severe axonal neuropathy and sensorimotor neuropathy was observed following treatment with arsenic trioxide[21] and a 14.3% of serious neurological side effects were observed on administration of bromocriptine[22]. Mild neurologic adverse events were detected on treatment with docetaxel[23], severe neuropsychiatric manifestations were found to be associated with azithromycin[24], nitrofurantoin was reported to cause sensorimotor polyneuropathy when used in children[25], cases of neurosensory adverse effects were observed on treatment with phenylbutazone[26] and use of cocaine[27], paclitaxel[28] and tacrolimus[29] is associated with severe neurotoxicity. Adverse neurological side effects and nervous system disorders were observed in mice on treatment with lopinavir[30]. A major life threatening neurological adverse event was observed in case of administration of vilazodone[31].

External dataset validation

Considering the applicability domain as well as performance of the generated models, the machine learning models were evaluated on 16383 MyriaScreen compounds obtained from Sigma-Aldrich. The most common side effects predicted include neuropathy peripheral, NSD, neuralgia, neuritis, neuropathy and neuroleptic malignant syndrome. NSD was predicted for 1280 compounds by the combined properties model and 6843 compounds by the chemical properties model. NMS was predicted for all the compounds by biological features model, for 344 compounds by the combined features model and 953 compounds by the chemical features model. The ADR which were not predicted to be associated with any of the compounds include autonomic neuropathy, neuromuscular block prolonged and neurological impairment. The results were very similar to the results obtained on testing the models on the uncharacterized drugs having no side effect predicted in SIDER.

Discussion

The present study proposes a rigorous, exhaustive and integrative computational protocol to generate machine learning models using biological, chemical and phenotypic properties of the drugs for the prediction of neurological ADRs. In this study, a total of 176 machine learning SMO models were generated using biological (targets, transporters and enzymes), chemical (substructures), phenotypic (SE and indications) properties for 22 neurological ADRs. To find the most important and quality attributes, we employed relief-based feature selection algorithm using which the complexity of the dataset reduced in addition to the computational time involved. We further employed SMOTE method on the training set to handle the imbalance in the dataset which performs by generating synthetic examples of the minority class. Among the three types of features and their combination, the phenotypic features data appeared to be most informative followed by chemical features as compared to the biological features. Upon addition of the chemical and phenotypic data to the biological data, the performance of the models significantly improved with accuracy from 82.11 to 94.18, recall from 0.85 to 0.96 and f-score from 0.89 to 0.96. However, the overall performances of the models generated using the three levels of features was similar to the chemical and phenotypic features alone. This denotes that chemical and phenotypic data of drugs were most predictive for ADR prediction. We also generated the models using the combination of two levels of features, chemical + phenotypic, biological + chemical, and phenotypic + biological. We observed that the combination models performed better than the models generated using one type of feature, with chemical + phenotypic properties models being the most accurate. Furthermore, to prove the predictive power and to validate the accuracy of the generated models, the models were tested on anti-Alzheimer drugs and on the drugs with no SE information available in the SIDER database. We found that the generated models were highly accurate and predictive. Overall, the present study clearly delineates the potential of data integration approaches in predicting clinically important ADRs prior to the clinical trials.

Methodology

Data extraction and dataset construction

The present study was performed on the approved drugs obtained from DrugBank[32] database which is a freely accessible comprehensive bioinformatics resource of drugs, their targets, structure and pathways.

Side-effect datasets

The information about the drug side-effects was obtained from SIDER[4] database version 4.1. SIDER (side effect resource) is a publicly available resource that contains information about the medicines existing in the market place and their recorded ADRs. As of October 2015, SIDER includes information about 1430 drugs and 5868 side effect keywords. In the present study, the entire SIDER database was downloaded and information about side effects was extracted. SIDER employs STITCH compound ids from which PubChem compound IDs (CID) can be obtained as mentioned in this rule (ftp://xi.embl.de/SIDER/2015-10-21/, Accessed April 2, 2016). The 1991 approved drugs obtained from DrugBank were mapped to the SIDER database using PubChem CIDs and the corresponding side-effects and therapeutic indications were obtained directly. A total of 933 drugs were successfully mapped to their respective DrugBank Ids which constituted the final dataset of 933 drugs, 5462 SE and 3046 therapeutic indications. Finally, each of the 933 drugs was represented as a binary matrix, the elements of which encoded the presence or absence of each of the 5462 SE and 3046 therapeutic indications. In each of 5462 and 3046 dimensional binary matrix, the entry 1 indicated the presence of the SE or therapeutic indication whereas the entry 0 indicated their absence.

Chemical structure dataset

After mapping to the SIDER database, we obtained the chemical structure information for 933 drugs and used PaDEL[33] software to generate the PubChem[34] substructure fingerprints resulting in 881 chemical substructure fingerprints for 928 drugs. To this end, we had an 881 dimensional binary matrix, the elements, 1 or 0, of which corresponded to the presence or absence of the corresponding fingerprint respectively, for each of the 928 drugs.

DrugBank data

The final 928 approved drugs were mapped to the DrugBank database from which information about the protein targets, transporters and enzymes was directly retrieved. To obtain such information, the DrugBank provided UniProt[35] IDs were used and we extracted information about 954 protein targets, 87 transporters and 168 enzymes. As mentioned for the chemical structure dataset, we had a binary matrix the elements of which were either 1 or 0 indicating the presence or absence of a particular target (954), transporter (87) or enzyme (168) respectively, for each of the 928 approved drugs. In conclusion, the phenotypic properties of the 928 drugs consisted of SE and therapeutic indications obtained from SIDER, the chemical properties were denoted by the PubChem fingerprints and the biological properties were constituted by drug protein targets, transporters and enzymes. Finally, in the resulting comma separated value (csv) files consisting of biological, chemical, phenotypic and the combination of the three features, a column named Outcome was appended which had a ‘Yes’ or ‘No’ value if a particular SE was associated with a drug or not.

Chemical structure similarity measurement

We computed Tanimoto coefficient (TC) between the drugs using the ChemmineR package available from R scripting language[36]. ChemmineR converts the chemical structures in the Structural Data Format (SDF) to atom pair fingerprints and the obtained fingerprints are used for the similarity calculation. The drug chemical structures having Tanimoto similarity coefficient greater than 0.75 cut-off were considered as structurally similar drugs and were removed from the dataset resulting in the final set of 926 drugs.

Relief-based features extraction

The drug molecules having uniform values for all the features, biological, chemical and phenotypic were removed using the RemoveUseless filter available in Weka[37], which is a machine learning platform. The resultant dataset was then split into 80% training set and 20% test set using a custom Perl script, where training data was used for generation of predictive models and the test set was used for the model evaluation purpose. While performing feature selection the test set was used as a complete held-out data and feature selection was performed on the training sets to remove any biasness and post that the models were generated using train sets and were evaluated on the test sets. Further, relief-based feature selection technique from Weka in combination with ranker search was employed to identify the features contributing significantly towards the ADR prediction task. The feature selection process also reduces the complexity of the dataset and the processing time required. ReliefAttributeEval is one of the most successful and widely used technique for evaluating the features based on their quality[38]. The algorithm assesses the effectiveness of a feature by repeated sampling of an instance and considers the value of the given feature based on the one-nearest-neighbour classifier[39]. The basic idea of relief feature selection algorithm is that it repetitively estimates the weights for features of an instance on the basis of their capability of discrimination amongst neighbouring instances. The weight for the feature decreases if it differs from the same feature in neighbouring instances of the same class more than neighbouring instances of the other class. After various iterations, the feature with the relevance greater than the threshold is selected[38]. Ranker search method was used along with ReliefAttributeEval which ranks the features based on their individual evaluations. We investigated the other feature selection algorithms which include a gain-ratio based attribute evaluation, oneR algorithm, chi-square based selection, filtered attribute evaluator, information gain-based attribute evaluation and best first attribute selection, to select the important attributes. However, most of these feature selection algorithms gave same ranking to all the attributes as we obtained in case of relief-based selection. Few of the selection algorithms did not give any ranking to the features. The BestFirst method gave 9 biological features, 12 phenotypic and 4 chemical features as significantly relevant which was very less number of attributes resulting in discarding almost all of the features. Thus the feature selection, in the present study, was carried out at two levels, initially using RemoveUseless algorithm followed by relief-based feature selection.

SMOTE for handling data imbalance

A dataset is considered as imbalanced if one class is over-represented while the other class is under-represented. Since not all the drugs were associated with many SE, this resulted in a highly imbalanced dataset and to introduce a balance between the majority and minority class, SMOTE[11] method available from Weka was used on the training sets. SMOTE is an oversampling technique in which the under-sampled or the minority class is balanced by creation of synthetic examples and the data is resampled. The minority class is over-sampled by taking each instance of this class and computing Euclidean instance within the k-nearest members of the minority class and then introducing synthetic instances. The neighbouring instances from k-nearest neighbours are chosen randomly depending upon the amount of over-sampling required. In the present study the number of nearest neighbours’ value was kept as default which is 5. To generate the synthetic examples, the difference between the input vector under consideration and its nearest neighbour is multiplied by a random number and added to the input vector under consideration[40]. Table 10 provides the information about the number of instances obtained after applying SMOTE for each of the 22 neurological ADRs. Supplementary Table 2 mentions the different percentages at which the under-sampled class was over-sampled using SMOTE method.
Table 10

Provides the information about the number of training data instances obtained after applying SMOTE for each of the 22 neurological ADRs.

Neurological ADRSMOTE instancesNo. of instances before applying SMOTE
Training data (Positive outcome Yes)Training data (Negative outcome No)Training data (Positive outcome Yes)Training data (Negative outcome No)
Arteritic anterior ischaemic optic neuropathy6287303730
Autonomic neuropathy6287312731
Nervous system disorder6127312731
Neuralgia62767557676
Neuritis61667656677
Neuritis retrobulbar6287312731
Neuroleptic malignant syndrome64069340693
Neurologic reaction6607282731
Neurological impairment6287312731
Neurological symptom64471814719
Neuromuscular block prolonged6127317312
Neuromyopathy6287317312
Neuropathy57663796637
Neuropathy peripheral585616117616
Neurosis70270627706
Neurotoxicity70270627706
Optic neuritis60970529704
Peripheral motor neuropathy5947243730
Peripheral sensorimotor neuropathy6287312731
Peripheral sensory neuropathy62071320713
Polyneuropathy65671716717
Post herpetic neuralgia6287303730
Provides the information about the number of training data instances obtained after applying SMOTE for each of the 22 neurological ADRs. Additionally, we have generated the models using the imbalanced data as input without applying SMOTE technique. The results obtained have been provided as Supplementary Table 3. We would like to report that we obtained very similar results for all the generated models using all the types of features, biological, chemical, phenotypic and merged.

Predictive modelling

During the generation of predictive models, the neurological ADRs prediction task was treated as a binary classification problem where each drug molecule was considered to either cause a particular ADR (labelled Yes) or not (labelled No). For biological, chemical, phenotypic and combined features for 22 neurological ADRs, a total of 176 predictive classifier models were generated using Sequential Minimization Algorithm (SMO), an implementation of Support Vector Machines (SVM), available from Weka. SVM have been widely used for the classic binary classification problems owing to their capability of handling large training sets as well as generally faster computation time[41-43]. The algorithm operates in an iterative manner by breaking the large quadratic problem (QP) into a range of smaller sub-QPs which are further solved in a systematic mode[44]. SVM is a discriminative classifier which uses an optimal hyperplane separating the new instances and further categorizing them. The SVM algorithm finds a hyperplane that separates the positive instances with negative ones and gives maximum distance between the two classes by creating a gap as wide as possible. This is the case of the linear classification problem, however, in addition, SVM uses kernel method that transforms non-linear space into linear ones for non-linear classification[44]. Default parameters were used for SMO which include Polykernel as the kernel type with complexity parameter, c-value equal to 1.0 to build the models. The predictive models were generated using the SMOTE balanced training set and 10-fold cross validation was used in the present study.

Evaluation measures for predictive models

A total of 176 machine learning models were generated for 22 neurological ADRs which were evaluated using receiver operating characteristic (ROC), accuracy, precision, recall and F-measure. ROC curve is a graphical plot of true positive rate (or sensitivity or recall) vs false positive rate (1-specificity). True positive rate (TPR = TP/(TP + FN)) is the proportion of correctly identified positives while false positive rate is the proportion of correctly identified negatives. Accuracy (Q) is the proportion of correctly identified instances (Q = TP + TN/(TP + TN + FP + FN)). Precision (P) is the fraction of correctly identified positives against all the predicted positives (P = TP/(TP + FP)). The performance for the 176 models for 22 neurological ADRs was averaged for each of the class of the properties, biological, chemical, phenotypic and a combination of all the three properties. Supplementary information
  34 in total

1.  Phenotypic side effects prediction by optimizing correlation with chemical and target profiles of drugs.

Authors:  Rakesh Kanji; Abhinav Sharma; Ganesh Bagler
Journal:  Mol Biosyst       Date:  2015-11

2.  Analysis of pharmacology data and the prediction of adverse drug reactions and off-target effects from chemical structure.

Authors:  Andreas Bender; Josef Scheiber; Meir Glick; John W Davies; Kamal Azzaoui; Jacques Hamon; Laszlo Urban; Steven Whitebread; Jeremy L Jenkins
Journal:  ChemMedChem       Date:  2007-06       Impact factor: 3.466

3.  PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints.

Authors:  Chun Wei Yap
Journal:  J Comput Chem       Date:  2010-12-17       Impact factor: 3.376

Review 4.  Quinolones: review of psychiatric and neurological adverse reactions.

Authors:  Ana M Tomé; Augusto Filipe
Journal:  Drug Saf       Date:  2011-06-01       Impact factor: 5.606

5.  Neurological monitoring of neurotoxicity induced by paclitaxel/cisplatin chemotherapy.

Authors:  T Berger; R Malayeri; A Doppelbauer; G Krajnik; H Huber; E Auff; R Pirker
Journal:  Eur J Cancer       Date:  1997-08       Impact factor: 9.162

6.  Optic-nerve degeneration in Alzheimer's disease.

Authors:  D R Hinton; A A Sadun; J C Blanks; C A Miller
Journal:  N Engl J Med       Date:  1986-08-21       Impact factor: 91.245

7.  Disabling neurological complications of amiodarone.

Authors:  N E Anderson; N M Lynch; K P O'Brien
Journal:  Aust N Z J Med       Date:  1985-06

8.  DrugBank: a comprehensive resource for in silico drug discovery and exploration.

Authors:  David S Wishart; Craig Knox; An Chi Guo; Savita Shrivastava; Murtaza Hassanali; Paul Stothard; Zhan Chang; Jennifer Woolsey
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

9.  Severe Generalized Weakness, Paralysis, and Aphasia following Administration of Irinotecan and Oxaliplatin during FOLFIRINOX Chemotherapy.

Authors:  Manisha Chandar; Robert de Wilton Marsh
Journal:  Case Rep Oncol       Date:  2015-03-04

10.  A side effect resource to capture phenotypic effects of drugs.

Authors:  Michael Kuhn; Monica Campillos; Ivica Letunic; Lars Juhl Jensen; Peer Bork
Journal:  Mol Syst Biol       Date:  2010-01-19       Impact factor: 11.429

View more
  13 in total

1.  Learning predictive models of drug side-effect relationships from distributed representations of literature-derived semantic predications.

Authors:  Justin Mower; Devika Subramanian; Trevor Cohen
Journal:  J Am Med Inform Assoc       Date:  2018-10-01       Impact factor: 4.497

2.  Current status and future directions for a neurotoxicity hazard assessment framework that integrates in silico approaches.

Authors:  Kevin M Crofton; Arianna Bassan; Mamta Behl; Yaroslav G Chushak; Ellen Fritsche; Jeffery M Gearhart; Mary Sue Marty; Moiz Mumtaz; Manuela Pavan; Patricia Ruiz; Magdalini Sachana; Rajamani Selvam; Timothy J Shafer; Lidiya Stavitskaya; David T Szabo; Steven T Szabo; Raymond R Tice; Dan Wilson; David Woolley; Glenn J Myatt
Journal:  Comput Toxicol       Date:  2022-03-17

3.  Robust clinical marker identification for diabetic kidney disease with ensemble feature selection.

Authors:  Xing Song; Lemuel R Waitman; Yong Hu; Alan S L Yu; David C Robbins; Mei Liu
Journal:  J Am Med Inform Assoc       Date:  2019-03-01       Impact factor: 4.497

4.  Predicting potential drug-drug interactions on topological and semantic similarity features using statistical learning.

Authors:  Andrej Kastrin; Polonca Ferk; Brane Leskošek
Journal:  PLoS One       Date:  2018-05-08       Impact factor: 3.240

5.  Prediction of Drug Side Effects with a Refined Negative Sample Selection Strategy.

Authors:  Haiyan Liang; Lei Chen; Xian Zhao; Xiaolin Zhang
Journal:  Comput Math Methods Med       Date:  2020-05-09       Impact factor: 2.238

6.  Target-Adverse Event Profiles to Augment Pharmacovigilance: A Pilot Study With Six New Molecular Entities.

Authors:  Peter Schotland; Rebecca Racz; David Jackson; Robert Levin; David G Strauss; Keith Burkhart
Journal:  CPT Pharmacometrics Syst Pharmacol       Date:  2018-10-24

7.  Computational models for the prediction of adverse cardiovascular drug reactions.

Authors:  Salma Jamal; Waseem Ali; Priya Nagpal; Sonam Grover; Abhinav Grover
Journal:  J Transl Med       Date:  2019-05-22       Impact factor: 5.531

8.  Inferring new relations between medical entities using literature curated term co-occurrences.

Authors:  Adam Spiro; Jonatan Fernández García; Chen Yanover
Journal:  JAMIA Open       Date:  2019-07-01

9.  Mapping the perturbome network of cellular perturbations.

Authors:  Michael Caldera; Felix Müller; Isabel Kaltenbrunner; Marco P Licciardello; Charles-Hugues Lardeau; Stefan Kubicek; Jörg Menche
Journal:  Nat Commun       Date:  2019-11-13       Impact factor: 14.919

10.  Artificial Intelligence and Machine learning based prediction of resistant and susceptible mutations in Mycobacterium tuberculosis.

Authors:  Salma Jamal; Mohd Khubaib; Rishabh Gangwar; Sonam Grover; Abhinav Grover; Seyed E Hasnain
Journal:  Sci Rep       Date:  2020-03-26       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.