| Literature DB >> 21510898 |
Frank P Y Lin1, Stephen Anthony, Thomas M Polasek, Guy Tsafnat, Matthew P Doogue.
Abstract
BACKGROUND: The identification of drug characteristics is a clinically important task, but it requires much expert knowledge and consumes substantial resources. We have developed a statistical text-mining approach (BInary Characteristics Extractor and biomedical Properties Predictor: BICEPP) to help experts screen drugs that may have important clinical characteristics of interest.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21510898 PMCID: PMC3110144 DOI: 10.1186/1471-2105-12-112
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The drug characteristics and their data sources
| Data source | Characteristic category | Examples | Number of characteristics | Median number of drugs associated with the characteristic |
|---|---|---|---|---|
| Australian Medicines Handbook (AMH) | Major drug classes | Gastrointestinal drugs, | 20 | 37 |
| Minor drug classes | 5HT3 antagonists, Benzodiazepines, Carbapenems, | 197 | 3 | |
| Adverse events | Anorexia, EPSE, Hyperuricaemia, Increased liver enzymes, Nephrotoxicity | 238 | 22 | |
| Pharmacokinetic Interaction Screening tool (PKIS) | Perpetrators* | CYP1A2 inducers, CYP3A inducers CYP3A inhibitors (moderate), CYP3A inhibitors (strong) | 15 | 5 |
| Narrow therapeutic index drugs | Alkylating agents, Anticonvulsants, Immunosuppresants, | 14 | 5.5 | |
| Total | 484 | |||
Abbreviations: CYP = cytochromes P450. EPSE = extra-pyramidal side-effects. Note: *) Perpetrators: perpetrators are drugs that are capable of altering the concentration of another drug with ≥ 2-fold change via the hepatic CYP enzyme system.
Figure 1The workflow of BICEPP and the evaluation procedure. A. The procedures of feature derivation and feature selection. The features for the inputs of machine learning classifiers are the CDF of 20-most predictive tokens. The CDF of a token, given a drug, is defined as the proportion of abstracts containing the token within the list of abstracts retrieved by using the drug name as query to search MEDLINE. B. Cross-validation was performed to estimate the generalisation performance of BICEPP. The feature selection described in (A) was performed on the training set (which contains k-1 folds of data) and machine learning models were built to predict test set data. This figure illustrates the 5 × stratified up-to-10-fold cross-validation procedure used throughout the evaluation experiments in this paper. Abbreviations: AMH: Australian Medicines Handbook; AWT: abstract with title; AUC: area under ROC curve; CDF: conditional document frequency using the drug name as query to search MEDLINE; ROC: receiver operating characteristics;
Examples of tokens eliminated and retained during the feature selection process on drug "warfarin"
| Examples of tokens | |
|---|---|
| ≈ 1 | only, when, however, well, another, same, results, other, observed, possible, different, since, even, could, though, occurring, therefore, high, although, also, both, so, result, appeared |
| ≈ 0.90 | restricted, controls, implicated, followed, diverse, stable, display, rate, plays, indicative, inhibit, typically, describe, excluded, terminal, excessive, largest, knowledge, employing, se |
| ≈ 0.80 | life, mature, loading, preincubation, problem, failure, binds, resolved, physiology, shock, signs, molecule, bind, elevations, chinese, usual, surface, aid, unit, accurate |
| ≈ 0.70 | intervention, stimulus, transition, closed, enable, bands, requiring, ester, nervous, sizes, electrophoresis, polymorphonuclear, aging, associations, accounts, practical, selective, choice, routine, attached |
| ≈ 0.60 | subset, undergoes, success, antagonist, artery, mr, depolarization, fields, suppression, precipitation, temperatures, records, mg2, adjustment, oxygen, picture, assembly, transcripts, encoded, organic |
| ≈ 0.50 | hydrogen, coated, glycol, antisense, coronary, adsorbed, histology, scan, formulation, foods, holding, resorption, gestational, filling, locus, memory, atrophy, ringer, prospectively, recruitment |
| ≈ 0.40 | diuretics, atrial, lysis, spinal, camp, bmax, vein, proteases, chelator, arachidonic, alzheimer, ascorbic, histamine, rhythm, ouabain, gas, preoperative, bladder, menopause, pertussis |
| ≈ 0.33 | chromatographic, endothelin, relaxed, acceptable, stenosis, withdrawal, january, trypsin, oxidized, infiltration, forearm, et-1, enrolled, electrochemical, peroxidation, mothers, phosphodiesterase, cystic, compression, countries |
r = Pearson's correlation coefficient
Examples of tokens retained by feature selection (in decreasing order of document frequencies)
Warfarin (8060), anticoagulation (2508), anticoagulant (1953), heparin (1699), thrombosis (1651), bleeding (1633), international (1324), venous (1238), aspirin (1231), fibrillation (1191), inr (1106), prothrombin (1035), thromboembolism (1017), anticoagulants (864), thromboembolic (860), coagulation (790), embolism (706), deep (698), prophylaxis (636), antithrombotic (606)
Examples of rare tokens eliminated by feature selection
vestige, bacteroides, ca-laurell, gd2, idaho, i475s, h2-blocker, depots, viic/viiam, left-hemispheric, p = .37, laboratory-developed, cardio, frames, thistle, thy1, homolog, videotapes, u-105665, five-years, cold-labeled, workups, fviiic
Examples of common tokens eliminated by feature selection
The predictive performance of BICEPP by characteristics categories
| Category | Best AUC | Algorithm | ||||
|---|---|---|---|---|---|---|
| AMH major classes | > 0.80 | 20 (100) | 19 (95) | 19 (95) | 20 (100) | |
| > 0.90 | 15 (75) | 16 (80) | 16 (80) | 16 (80) | 19 (95) | |
| > 0.95 | 10 (50) | 12 (60) | 11 (55) | 11 (55) | 12 (60) | |
| AMH minor classes | > 0.80 | 98 (50) | 133 (68) | 130 (66) | 134 (68) | |
| > 0.90 | 86 (44) | 121 (61) | 120 (61) | 117 (59) | 123 (62) | |
| > 0.95* | 73 (37) | 114 (58) | 102 (52) | 106 (54) | 114 (58) | |
| AMH adverse events | > 0.80 | 134 (56) | 145 (61) | 114 (48) | 119 (50) | |
| > 0.90 | 65 (27) | 76 (32) | 56 (24) | 63 (26) | 86 (36) | |
| > 0.95 | 30 (13) | 38 (16) | 30 (13) | 35 (15) | 41 (17) | |
| PKIS perpetrator | > 0.80 | 3 (20) | 7 (47) | 3 (20) | 4 (27) | |
| > 0.90 | 1 (7) | 4 (27) | 2 (13) | 3 (20) | 5 (33) | |
| > 0.95 | 0 (0) | 2 (13) | 2 (13) | 2 (13) | 2 (13) | |
| Narrow therapeutic index drugs | > 0.80 | 8 (57) | 9 (64) | 8 (57) | 8 (57) | |
| > 0.90 | 7 (50) | 8 (57) | 5 (36) | 7 (50) | 8 (57) | |
| > 0.95 | 3 (21) | 5 (36) | 3 (21) | 2 (14) | 5 (36) | |
| > 0.80 | 263 (54) | 313 (65) | 274 (57) | 285 (59) | ||
| > 0.90 | 174 (36) | 225 (46) | 199 (41) | 206 (43) | 241 (50) | |
| > 0.95* | 116 (24) | 171 (35) | 148 (31) | 156 (32) | 174 (36) | |
The numbers in this table indicate the number of characteristics (percentage) that achieved an AUC above the given threshold in stratified cross-validation evaluations. The performance is indicated by AUC and can be interpreted as good (> 0.80), very good (> 0.9), and excellent (> 0.95), respectively. Overall, 68% of drug characteristics can be predicted with good AUC (numbers in boldface) and 36% of characteristics can be predicted very accurately (AUC > 0.95) with at least one classifier. The last column (best of 4) shows how many characteristics achieved AUC above the given threshold by any of the four algorithms. Pearson's chi-square test was applied to examine the homogeneity between algorithms. *) indicate the statistically significant categories at α = 0.05 (analysed as 4 × 1 tables with 3 d.f.). However, no categories were statistically performance significant after adjusting for family-wise error rate using Bonferroni method (n = 18). Abbreviations: AE: adverse events; AMH: Australian Medicines Handbook; IBk: k-nearest neighbour algorithm; NB: Naive Bayes; SVM: support vector machine; SVM/L: linear SVM; SVM/RBF: support vector machine with radial basis function kernel. PKIS: PharmacoKinetic Interaction Screening database.
Figure 2The predictive performance of BICEPP (AUC) by drug characteristics. The predictive power of BICEPP was evaluated by using stratified cross-validation experiments performed on each of the 484 drug characteristics listed in Table 1. In this figure, each data point denotes the best AUC (out of the 4 machine learning algorithms) evaluated on the dataset of drug characteristic that contains more than 10 positive examples (O), between 2--9 positive examples (*), and less than 2 positive examples (X) respectively. The dotted lines indicate AUCs of 0.8 and 0.9 respectively. Abbreviations: TI: therapeutic index.
Figure 3The predictive performance versus number of positive examples and the position of index keywords. This figure illustrates how the performance of BICEPP is related to the number of positive examples (A) and position of index keywords in the respective feature ranks (B). Each data point represents the best AUC (out of the 4 machine learning algorithms studied in this paper) performed on one of the 484 drug characteristics listed in Table 1. As illustrated in the shaded area in Figure 3(A), the predictive performance of BICEPP had a higher variability in datasets with less than 10 positive examples. The boxed area (*) in Figure 3(B) represents a list of "surprising characteristics", whose predictive powers were high but the index keywords were not discriminative. The contents are listed in more detail in Table 3. Refer to the main text for details.
Drug characteristics with poorly discriminative index keywords but achieved an overall good predictive performance
| Category | Characteristic | Position of IK(s) | n(pos) | Best AUC* | Top-20 predictive tokens/words (AUC†) |
|---|---|---|---|---|---|
| AE | Cystitis | 10.7 pct | 13 | 1.000 | nsaids, cyclooxygenase, nimesulide, meloxicam (0.999); nsaid, diclofenac, naproxen, antiinflammatory, non-steroidal (0.998); ibuprofen, anti-inflammatory, nonsteroidal (0.997); ketoprofen, antipyretic (0.996); indomethacin (0.993); osteoarthritis (0.991); pge2 (0.991), prostanoid (0.988), thromboxane (0.986), prostaglandin (0.985) |
| AE | Dyslipidaemia | 14.0 pct | 13 | 0.900 | Aldosterone (0.95); acetazolamide, mineralocorticoid (0.94); deoxycorticosterone (0.93), pge2, indomethacin, hearing (0.88); spironolactone, mineralocorticoids, hyponatremia, renin, adh, ace (0.87); furosemide (0.86); insipidus, asthmatic, prostaglandins, fev1, pra, phenylbutazone (0.85) |
| AE | Migraine | 76.0 pct | 14 | 0.906 | angiotensin (0.89), plasminogen, dbp, insulin (0.85); infarction, low-density, losartan (0.84); hormonal, brachial, run-in, fixed-dose (0.83); lipoprotein, valsartan, endothelium-dependent, renin (0.82); pravastatin, hba1c (0.81); angiotensinogen, chd, smoking (0.80) |
| AE | Oesophagitis | 10.0 pct | 19 | 0.919 | Metastases (0.92); marrow (0.90); weekly (0.88); metastatic, antitumor (0.87); cancer (0.86); 3-year, toxicity, regimen, nadir, breast, metastasis (0.85); myeloma, survival, cancers, prostate (0.84); regimens, remission, cytotoxic, melphalan (0.83) |
| AE | Paralytic ileus | 59.4 pct | 12 | 0.961 | amitriptyline (0.96); tricyclic (0.95); antidepressants, anticholinergic, antidepressant (0.94); neuroleptics, chlorpromazine, depressive, overdose (0.93); tca, serotonin, diazepam (0.92); intoxication (0.91); thioridazine, clonidine (0.90); psychotropic, affective, psychological, antinociceptive (0.89); constipation (0.87) |
| AE | Thrombocytopenic purpura | 10.6 pct | 11 | 1 | infarction, ejection (0.95); intra-arterial (0.94); echocardiography (0.93); st-segment (0.93); echocardiographic (0.90); beta-blocking (0.89); beta-blocker (0.87); cardiology, diacetolol, bopindolol, bucindolol, beta-adrenoceptor-blocking, adp, beta-ars, beta1-selective, beta-adrenoblockers, cardioselectivity, atenolol, non-fatal (0.86) |
| MC | 5HT3 antagonists | 10.0 pct | 4 | 1 | 5-ht3ra, 5-ht3ra/dexamethasone, granisetron, ondansetron, 1966-september, 5-ht3, tropisetron, 5-hydroxytryptamine3, anti-emetic (1); dolasetron, emetic, ramosetron, 5-ht3-receptor, cinv, 5ht3, emetogenic, type-3, emesis, setrons, pov (0.999) |
| MC | Antibacterials (ear) | 14.7 pct | 4 | 0.98 | enrofloxacin, chloramphenicol, gentamycin (0.99); oxytetracycline, kanamycin, polymyxin, colistin, gentamicin, bacitracin, neomycin, povidone-iodine, fusidic, streptomycin, bacterial, septicemia, swabs (0.98); anaerobic, peru, tetracycline, aminoglycoside (0.97) |
*) Refers to the best area under the ROC curve achieved by the four machine learning algorithms as evaluated by stratified CVs. †) Refers to the AUCs used during the feature selection process. Abbreviations: AE: Adverse events; IK: index keywords; MC: minor drug classes; pct percentile. n(pos): number of positive examples in each drug characteristic dataset.
Figure 4Comparative performance of cdf with other commonly employed IR methods. This figure illustrates the predictive performance, assessed by the cross-validation results (in AUC), versus the cumulative number of characteristics (out of 484) for each of the commonly-employed methods. Notes: (1) ctf: an average of 1,520 tokens per drug (IQR: 598-3,781 tokens) were retained after elimination. (2) Stemming: the application of stemming algorithm has resulted in the reduction of 20% of tokens (median: 973 tokens per drug, IQR: 455-1,144 tokens). In depth investigation has, however, revealed that stemming did not always group the concept consistently (See additional file 2 for further discussions). (3) Drug synonyms: Compared with MEDLINE searches using only generic names, a mean of 3.4% more results were retrieved when trade names were used (IQR: 0-0.54% more abstracts were retrieved). The dotted lines indicate AUCs of 0.8 and 0.9 respectively. Abbreviations of the method names: cdf: conditional document frequency; ctf: conditional term frequency; ctf-icdf: conditional term frequency-inverse conditional document frequency; Stemming: cdf of tokens reduced by Porter's stemming algorithm; Synonyms: cdf of tokens generated by retrieving abstracts with both generic and trade names for a given drug.
Comparative evaluation of cdf-based predictions with other commonly used IR methods
| Category | Best AUC | Method | ||||
|---|---|---|---|---|---|---|
| AMH major classes | > 0.80 | 19 (95) | ||||
| > 0.90 | 19 (95) | 17 (85) | 18 (90) | 17 (85) | 18 (90) | |
| > 0.95 | 12 (60) | 12 (60) | 15 (75) | 12 (60) | 11 (55) | |
| AMH minor classes | > 0.80 | 135 (69) | 106 (54)* | 152 (77) | 156 (79) | |
| > 0.90 | 123 (62) | 100 (51) | 150 (76)* | 145 (74) | 151 (77)* | |
| > 0.95 | 114 (58) | 92 (47) | 144 (73)* | 142 (72)* | 143 (73)* | |
| AMH adverse events | > 0.80 | 159 (67) | 148 (62) | 153 (64) | 155 (65) | |
| > 0.90 | 86 (36) | 88 (37) | 100 (42) | 84 (35) | 84 (35) | |
| > 0.95 | 41 (17) | 42 (18) | 55 (23) | 44 (18) | 44 (18) | |
| PKIS perpetrator | > 0.80 | 7 (47) | 6 (40) | 7 (47) | 10 (67) | |
| > 0.90 | 5 (33) | 5 (33) | 4 (27) | 3 (20) | 4 (27) | |
| > 0.95 | 2 (13) | 2 (13) | 2 (13) | 3 (20) | 3 (20) | |
| Narrow therapeutic index drugs | > 0.80 | 9 (64) | 9 (64) | 10 (71) | 10 (71) | |
| > 0.90 | 8 (57) | 7 (50) | 10 (71) | 9 (64) | 10 (71) | |
| > 0.95 | 5 (36) | 6 (43) | 9 (64) | 9 (64) | 10 (71) | |
| > 0.80 | 330 (68) | 289 (60)* | 346 (71) | 351 (73) | ||
| > 0.90 | 241 (50) | 217 (45) | 282 (58)* | 258 (53) | 267 (55) | |
| > 0.95 | 174 (36) | 154 (32) | 225 (46)* | 210 (43) | 211 (44) | |
The numbers in this table indicate the number of characteristics (percentage) achieved an AUC above the given thresholds in stratified cross-validation evaluations. For each method, the results from the best of 4 algorithms were compared. The thresholds of AUC can be interpreted as good (> 0.8), very good (> 0.9), and excellent (> 0.95) respectively. The entries labelled (*) indicate a significantly better or worse performance than cdf for predicting drug characteristics. Fisher's exact tests were applied as 2 × 2 tables with α = 0.05 adjusted for a family of four comparisons by using the Bonferroni method. The numbers in boldface indicate the best performing method(s) for each characteristic category above the AUC = 0.8 threshold. Abbreviations of the method names: cdf: conditional document frequency; ctf: conditional term frequency; ctf-icdf: conditional term frequency-inverse conditional document frequency; Stemming: cdf of tokens reduced by Porter's stemming algorithm; Synonyms: cdf of tokens calculated by retrieving abstracts with both generic and trade names for a given drug.
The correlations between the training set statistics and BICEPP performance
| Training set statistics | Method | ||||
|---|---|---|---|---|---|
| (c) Maximum article count | -.20 | -.19 | -.23 | -.20 | -.26 |
| (d) Mean article count | .09 | .10 | .06 | .10 | .00 |
| (e) Median article count | .18 | .17 | .17 | .16 | .20 |
| (g) Variance of articles counts | -.01 | -.00 | -.03 | -.01 | -.09 |
The numbers in the table are Spearman's rank correlation coefficients σ, calculated by correlating the training set statistics with the best AUC obtained from four machine learning algorithms (NB, IBk, SVM with polynomial and RBF kernels) for each of the 272 of 484 drug characteristics with ≥10 positive examples. For each drug, the corresponding article count indicates how many abstracts were retrieved from the MEDLINE database searched by using the drug name. The entries and the category names in boldface indicate p < 0.0001 as determined by the test of rank correlation coefficient using the Fieller-Hartley-Pearson method [37]. Abbreviations of the method names: cdf: conditional document frequency; ctf: conditional term frequency; ctf-icdf: conditional term frequency-inverse conditional document frequency; Stemming: cdf of tokens reduced by Porter's stemming algorithm; Synonyms: cdf of tokens calculated by retrieving abstracts with both generic and trade names for a given drug.