Literature DB >> 26481350

The SIDER database of drugs and side effects.

Michael Kuhn1, Ivica Letunic2, Lars Juhl Jensen3, Peer Bork4.   

Abstract

Unwanted side effects of drugs are a burden on patients and a severe impediment in the development of new drugs. At the same time, adverse drug reactions (ADRs) recorded during clinical trials are an important source of human phenotypic data. It is therefore essential to combine data on drugs, targets and side effects into a more complete picture of the therapeutic mechanism of actions of drugs and the ways in which they cause adverse reactions. To this end, we have created the SIDER ('Side Effect Resource', http://sideeffects.embl.de) database of drugs and ADRs. The current release, SIDER 4, contains data on 1430 drugs, 5880 ADRs and 140 064 drug-ADR pairs, which is an increase of 40% compared to the previous version. For more fine-grained analyses, we extracted the frequency with which side effects occur from the package inserts. This information is available for 39% of drug-ADR pairs, 19% of which can be compared to the frequency under placebo treatment. SIDER furthermore contains a data set of drug indications, extracted from the package inserts using Natural Language Processing. These drug indications are used to reduce the rate of false positives by identifying medical terms that do not correspond to ADRs.
© The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Year:  2015        PMID: 26481350      PMCID: PMC4702794          DOI: 10.1093/nar/gkv1075

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Reducing adverse drugs reactions (ADRs) and elucidating their origin has long been a concern of physicians and researchers within their respective fields of medicine. Within the last years, an increased availability of public data on drug targets and ADRs enabled large-scale studies that go beyond individual fields of medicine to a systems biology or systems medicine approach. For example, it was shown that ADRs can be used to predict drug targets (1) and that causal relations between targets and ADRs can be elucidated (2,3). Data on ADRs are generated in two stages. First, during placebo-controlled clinical trials, the occurrence and frequency of ADRs is recorded. In phase III trials, thousands of patients are carefully monitored, and the ADRs observed during this phase become listed on the package inserts. Once the drug is on the market, surveillance continues (‘phase IV’) and doctors may report ADRs to systems such as the FDA Adverse Event Reporting System (AERS). ADRs from such reporting systems are drawn from much larger numbers of patients; however, they are more subject to confounding biases and the causality may thus be questionable. Typically, such ADRs are also added to package inserts in a separate section on post-marketing experience. In addition, ADRs are reported in the biomedical literature, e.g. in animal studies or in pre- or post-clinical trial studies, and in electronic health records, from where they can be extracted using text mining (4–6). To make ADRs amendable to academic research in a simple way, we developed SIDER (‘Side Effect Resource’) in 2010 when no such resource was freely available for academic researchers (7). The first version of SIDER contained not only drugs and their respective ADRs but also frequency information for both drug and placebo treatment. Users were able to check for the occurrence of specific ADR on the SIDER website. Thanks to the availability of downloadable files, SIDER has since been used in many studies, for example to identify metabolic dysregulation as a cause for ADRs (8), to investigate the effect of essential proteins on ADRs (9) and to predict drug indications (10). As a further use case, SIDER has been used as the benchmarking set for text-mining methods that extract ADR data from the literature (4,11), and other databases have incorporated data from SIDER. For example, ADReCS combined SIDER 2 data with an independent annotation effort and further added an ontology of ADRs that allows grouping of related ADRs (12). IntSide integrated data from SIDER 2 with drug–target and pathway data to pinpoint causes of ADRs (13). Despite no longer being the only freely available ADR resource, SIDER remains heavily used: on average, the SIDER 2 data set of ADRs was downloaded 74 times per month in the past year (unique IP addresses per month, August 2014 to July 2015). We present here a new release of the SIDER database with over 40% more drugs, ADRs and drug–ADR pairs compared to the previous version and more than 2-fold as many drug–ADR pairs as the published version (see Table 1). We ensure a high quality of the extracted entities by manually annotating names, adding synonymous names and using an additional Natural Language Processing step. Our text-mining system creates a machine-readable database and a human-readable website with highlighted terms at the same time, making it possible for users to quickly trace the origin of extracted ADRs.
Table 1.

Data content of ADR databases

ADReCS 1.2 *SIDER 1SIDER 2SIDER 4Increase between SIDER 2 and 4
Year of release2015200920122015
Number of drugs13788889961430+44%
Number of ADRs5984145041925880+40%
Number of drug–ADR pairs
total1340226226999423140064+41%
with frequency50490250684233154772+29%
with frequency for placebo03640633410790+70%

*ADReCS statistics were accessed on July 30, 2015.

*ADReCS statistics were accessed on July 30, 2015.

DATA COLLECTION AND QUALITY CONTROL

Drug labels with information for professionals were obtained from national registries and charity organizations. Structured Product Labels, as provided by the FDA in XML format, could be parsed directly. Labels that were only available as PDF were converted to HTML, preserving tabular data formatting to the extent possible. The initial conversion from the source documents to PDF removes the logical structure of the documents such as headings. Converting from PDF to HTML yields documents with different styles for normal text and headings, but the actual text formatting varies between documents. We therefore developed heuristics to automatically identify section and subsection headings from the text formatting in the HTML labels. For example, we search for headings of the ADR and indications section with different wordings to identify the text style used to indicate the different sections. We also maintain a list of section titles that indicate that the ADR or indication section have ended (e.g. ‘Interactions with other drugs’).

Named entity recognition (NER)

We used a dictionary-based approach to detect mentions of ADRs, diseases, drugs and proteins in the package inserts. Names of chemicals and proteins were taken from the STITCH 4 and STRING 9.1 databases (14,15). In particular, users can directly use identifiers from SIDER to identify drug targets in STITCH 4. In STITCH, stereoisomers and salt forms are usually merged into a common identifier, although the user has the option to view the isomers separately. Likewise, the SIDER download files contain identifiers both with and without stereochemistry, and links to PubChem are provided on the web frontend. Starting with SIDER 4, the major version number of SIDER is linked to the corresponding STITCH version. That is, once STITCH 5 has been released, we will create SIDER 5 with new chemical identifiers. To create a dictionary of ADR and disease names, we pooled synonyms from the Unified Medical Language System (UMLS) Metathesaurus (version 2014AA) for all terms of the Medical Dictionary for Regulatory Activities (MedDRA) (version 16.1). We filtered by semantic type (Supplementary Table S1) and further manually created an exclusion list of concepts with the correct semantic type that did not refer to ADR, such as ‘HIV positive’ or ‘family history of suicide.’ Similarly, we manually inspected all names that occurred at least 100 times during NER to create a list of names to be blocked from the dictionary. Names from the UMLS Metathesaurus often do not cover all possible permutations of words or their synonyms. To expand the dictionary, we created a list of candidate synonyms from frequently interchanged words. For each concept, we pooled the words of all names and added synonymous words. Then we scanned the package inserts for occurrences of these words. In this way, mentions of the concept could be detected even when the order of words had been changed. For example, consider the name ‘blood pressure decreased’. After adding a synonym, the algorithm might look for the words ‘blood’, ‘pressure’, ‘decreased’ and ‘lower’. It would therefore find new names like ‘decreased blood pressure’ and ‘lower blood pressure’. We manually annotated the most frequently occurring novel names and added only truly synonymous names to the dictionary. After completing the dictionaries, entities were recognized using an NER engine that also accounts for orthographic variation, specifically case variation and insertion/removal of hyphens and spaces (16). Sentences in the ADR section that contain negations (e.g. ‘has not been observed’) or speculation (e.g. ‘is suspected’) were excluded (see Supplementary Table S2). By tightly integrating parsing of the labels and entity detection, we were able to produce marked-up HTML versions of the labels. In these files, all detected mentions of the ADR terms are highlighted, along with names of chemical compounds and proteins (Figure 1). In previous versions of SIDER, highlighting of matches and entity recognition were independent steps, which meant that in some cases the source of a detected term was not highlighted in the HTML version. Users can click on ADR terms to get more information from SIDER, and on proteins and compounds for relevant data from Reflect (17).
Figure 1.

Navigating the SIDER website. Users can search for drugs to get an overview of its ADR (A). Clicking on cells in the table of labels and ADRs takes the user to a separate page (B) where they can inspect all mentions of the ADR.

Detection of drug indications by natural language processing

After entity recognition, we used the Stanford Dependencies (18) package (version 3.4.1) to extract further information from the package inserts, analyzing the content of the sections on indications and ADRs. This Natural Language Processing (NLP) approach makes it possible to identify sentence fragments that refer to indications of drugs. As a pre-processing step, we replaced each entity that had been recognized in the previous step with an internal identifier. This way, entity names that consisted of multiple words were collapsed to a single noun, making correct parsing of the sentences easier. With the preprocessed sentences as input, Stanford Dependencies returns a list of dependencies, which are triplets consisting of the kind of dependency and the two connected words (the governor and the dependent). For example, parsing the sentence ‘DRUG may be used for the treatment of INDICATION’ returns this list of dependencies, consisting of seven triplets: nsubjpass(used-4, DRUG-1) aux(used-4, may-2) auxpass(used-4, be-3) root(ROOT-0, used-4) det(treatment-7, the-6) prep_for(used-4, treatment-7) prep_of(treatment-7, INDICATION-9) Analyzing this list of dependencies, we extracted the triplet ‘DRUG, used treatment, INDICATION’ and used custom rules to determine that this implies an indication for the drug (Supplementary Table S3). We used such rules to detect drug combinations, indications and pre-existing conditions of patients that do not refer to ADRs (e.g. ‘In patients with X, Y may occur’, see Supplementary Table S4).

Extraction of frequencies

Frequencies of ADRs were extracted from tables and free-text mentions. In case of tables, the header of the table was analyzed to find out if the reported frequencies are for drug or placebo treatment. We used a heuristic to exclude tables that contain other types of data, such as discontinuation frequencies. We also extracted frequency information from the text of the labels, searching for lists (e.g. ‘frequent: headache, fatigue’) and for numbers in parentheses (e.g. ‘headache (12%)’).

Filtering steps

To further reduce the rate of false positives, the following steps were undertaken. First, we manually inspected the words found at least 100 times and removed terms that did not correspond to ADRs or were ambiguous. Next, we used the extracted indications as a filter on ADR found in the free text of the ADR section (in contrast to those contained in tables listing the frequency of ADR). As described above, potential indications of drugs were detected in two ways: either by NER in the indications section, or by NLP (yielding indications and pre-existing conditions of patients). When a concept had been found by NER in the ADR section of a package insert, we discarded it as an ADR if the concept had been detected (i) by NLP on any package insert of the same drug or (ii) by NER in the indications section of the same package insert. In the latter case, filtering only applies to the same package insert due to a higher rate of false positive indications for NER compared to NLP shown below. To estimate the accuracy of the extracted indications, we matched them against a data set of drug indications derived from the resource drugs.com. We looked for exactly matching terms and used this external data set only for benchmarking the final set of indications (see Table 2). Among the overlapping drug–disease pairs, 61% of those found as an indication by NLP were also contained in the drugs.com data set. For those detected as preconditions, the fraction was 53%. Among terms found by text mining in the indications section, 44% were also contained in the external data set.
Table 2.

Comparison of extracted indications and ADRs between SIDER 4 and drugs.com (accessed on February 12, 2015)

Indications (detected by NLP)Pre-existing conditions (detected by NLP)Indications (detected by NER)ADRs (filtered set)
Counts within SIDER 4
drugs123682613291430
concepts13388092583880
drug–concept pairs4681261612965140064
Comparison with drugs.com
common drugs870615915948
common concepts439292553480
Number of drug–concept pairs
only in drugs.com1328102010342216
intersection10104991685349*
only in SIDER 4655434211127797
Overlap
relative to SIDER 461%53%44%1.2%
relative to drugs.com43%33%62%13%

*This intersection should be as low as possible, as it hints at false positives in SIDER, see discussion in text.

*This intersection should be as low as possible, as it hints at false positives in SIDER, see discussion in text. We also used the drug indication data from drugs.com to obtain an estimate of the false positive rate of ADR in SIDER. In the previous paragraph, we quantified the overlap between indications in drugs.com and SIDER. Here, we tested the overlap between indications from drugs.com and ADR found in SIDER. Thus, matching terms point to an apparent contradiction that could be explained as a false positive in either drugs.com or SIDER. Of 140064 drug–ADR pairs in SIDER, 28343 (20%) could be matched to drugs and diseases found on drugs.com. Of those 28343 drug–disease pairs, only 349 (1.2%) were listed as drug–indication pairs on drugs.com, showing that indications are only very rarely misinterpreted as ADRs.

DATA CONTENT AND AVAILABILITY

SIDER is available at http://sideeffects.embl.de/. The new release, SIDER 4, represents a large increase in the numbers of drugs, ADRs, drug–ADR pairs and drug frequency entries compared to previous versions (Table 1, Figure 2). The homepage allows users to interactively browse the database and to search for drugs and ADRs (Figure 1A). The SIDER website enables users to trace drug–side effect pairs to the drug labels: users can navigate to the drug's page and click on the side effect of interest. On the presented drug label, all instances of the side effect are marked (Figure 1B). In this way, users can quickly trace the origin of an extracted side effect in order to rule out false positives. The complete data set of side effects and the data set of indications are available for download from the SIDER website in text format, including PubChem and MedDRA identifiers. In addition, we have created a GitHub repository (linked from the SIDER download page) where users can contribute errors that they detect in SIDER. In this way, other users can filter the download files, and the authors can remove the source of the errors in upcoming versions of SIDER.
Figure 2.

Data content of SIDER. For SIDER versions 1, 2 and 4 the distribution of (A) drugs per ADR and (B) ADRs per drug is shown. These distributions do not follow a power law. Note that the current version of SIDER is the third release, but is designated SIDER 4 because it is based on the same set of chemicals as STITCH 4.

Navigating the SIDER website. Users can search for drugs to get an overview of its ADR (A). Clicking on cells in the table of labels and ADRs takes the user to a separate page (B) where they can inspect all mentions of the ADR. Data content of SIDER. For SIDER versions 1, 2 and 4 the distribution of (A) drugs per ADR and (B) ADRs per drug is shown. These distributions do not follow a power law. Note that the current version of SIDER is the third release, but is designated SIDER 4 because it is based on the same set of chemicals as STITCH 4.
  17 in total

1.  Analysis of chemical and biological features yields mechanistic insights into drug side effects.

Authors:  Miquel Duran-Frigola; Patrick Aloy
Journal:  Chem Biol       Date:  2013-04-18

2.  Automatic construction of a large-scale and accurate drug-side-effect association knowledge base from biomedical literature.

Authors:  Rong Xu; QuanQiu Wang
Journal:  J Biomed Inform       Date:  2014-06-10       Impact factor: 6.317

3.  The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.

Authors:  Evangelos Pafilis; Sune P Frankild; Lucia Fanini; Sarah Faulwetter; Christina Pavloudi; Aikaterini Vasileiadou; Christos Arvanitidis; Lars Juhl Jensen
Journal:  PLoS One       Date:  2013-06-18       Impact factor: 3.240

4.  PREDICT: a method for inferring novel drug indications with application to personalized medicine.

Authors:  Assaf Gottlieb; Gideon Y Stein; Eytan Ruppin; Roded Sharan
Journal:  Mol Syst Biol       Date:  2011-06-07       Impact factor: 11.429

5.  A pipeline to extract drug-adverse event pairs from multiple data sources.

Authors:  Srijyothsna Yeleswarapu; Aditya Rao; Thomas Joseph; Vangala Govindakrishnan Saipradeep; Rajgopal Srinivasan
Journal:  BMC Med Inform Decis Mak       Date:  2014-02-24       Impact factor: 2.796

6.  ADReCS: an ontology database for aiding standardization and hierarchical classification of adverse drug reaction terms.

Authors:  Mei-Chun Cai; Quan Xu; Yan-Jing Pan; Wen Pan; Nan Ji; Yin-Bo Li; Hai-Jing Jin; Ke Liu; Zhi-Liang Ji
Journal:  Nucleic Acids Res       Date:  2014-10-31       Impact factor: 16.971

7.  A side effect resource to capture phenotypic effects of drugs.

Authors:  Michael Kuhn; Monica Campillos; Ivica Letunic; Lars Juhl Jensen; Peer Bork
Journal:  Mol Syst Biol       Date:  2010-01-19       Impact factor: 11.429

8.  Systematic identification of proteins that elicit drug side effects.

Authors:  Michael Kuhn; Mumna Al Banchaabouchi; Monica Campillos; Lars Juhl Jensen; Cornelius Gross; Anne-Claude Gavin; Peer Bork
Journal:  Mol Syst Biol       Date:  2013       Impact factor: 11.429

9.  Extraction of potential adverse drug events from medical case reports.

Authors:  Harsha Gurulingappa; Abdul Mateen-Rajput; Luca Toldo
Journal:  J Biomed Semantics       Date:  2012-12-20

10.  Pharmacogenomic and clinical data link non-pharmacokinetic metabolic dysregulation to drug side effect pathogenesis.

Authors:  Daniel C Zielinski; Fabian V Filipp; Aarash Bordbar; Kasper Jensen; Jeffrey W Smith; Markus J Herrgard; Monica L Mo; Bernhard O Palsson
Journal:  Nat Commun       Date:  2015-06-09       Impact factor: 14.919

View more
  253 in total

1.  Mining Social Media Data for Biomedical Signals and Health-Related Behavior.

Authors:  Rion Brattig Correia; Ian B Wood; Johan Bollen; Luis M Rocha
Journal:  Annu Rev Biomed Data Sci       Date:  2020-05-04

2.  Computational Assessment of Pharmacokinetics and Biological Effects of Some Anabolic and Androgen Steroids.

Authors:  Marin Roman; Diana Larisa Roman; Vasile Ostafe; Alecu Ciorsac; Adriana Isvoran
Journal:  Pharm Res       Date:  2018-02-05       Impact factor: 4.200

Review 3.  Predicting and Understanding the Human Microbiome's Impact on Pharmacology.

Authors:  Reese Hitchings; Libusha Kelly
Journal:  Trends Pharmacol Sci       Date:  2019-06-03       Impact factor: 14.819

4.  Leveraging Big Data to Transform Drug Discovery.

Authors:  Benjamin S Glicksberg; Li Li; Rong Chen; Joel Dudley; Bin Chen
Journal:  Methods Mol Biol       Date:  2019

Review 5.  Advancing computer-aided drug discovery (CADD) by big data and data-driven machine learning modeling.

Authors:  Linlin Zhao; Heather L Ciallella; Lauren M Aleksunes; Hao Zhu
Journal:  Drug Discov Today       Date:  2020-07-11       Impact factor: 7.851

Review 6.  Drug Repurposing: Claiming the Full Benefit from Drug Development.

Authors:  Eric Kort; Stefan Jovinge
Journal:  Curr Cardiol Rep       Date:  2021-05-07       Impact factor: 2.931

7.  Leveraging digital media data for pharmacovigilance.

Authors:  Hammad Farooq; Junaid Suhail Niaz; Saira Fakhar; Hammad Naveed
Journal:  AMIA Annu Symp Proc       Date:  2021-01-25

Review 8.  Drug repurposing from the perspective of pharmaceutical companies.

Authors:  Y Cha; T Erez; I J Reynolds; D Kumar; J Ross; G Koytiger; R Kusko; B Zeskind; S Risso; E Kagan; S Papapetropoulos; I Grossman; D Laifenfeld
Journal:  Br J Pharmacol       Date:  2017-05-18       Impact factor: 8.739

9.  Using Drug Similarities for Discovery of Possible Adverse Reactions.

Authors:  Emir Muñoz; Vít Nováček; Pierre-Yves Vandenbussche
Journal:  AMIA Annu Symp Proc       Date:  2017-02-10

Review 10.  Drug-Induced Rhabdomyolysis Atlas (DIRA) for idiosyncratic adverse drug reaction management.

Authors:  Zhining Wen; Yu Liang; Yingyi Hao; Brian Delavan; Ruili Huang; Mike Mikailov; Weida Tong; Menglong Li; Zhichao Liu
Journal:  Drug Discov Today       Date:  2018-06-11       Impact factor: 7.851

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.