Literature DB >> 19004875

SuperToxic: a comprehensive database of toxic compounds.

Ulrike Schmidt¹, Swantje Struck, Bjoern Gruening, Julia Hossbach, Ines S Jaeger, Roza Parol, Ulrike Lindequist, Eberhard Teuscher, Robert Preissner.

Abstract

Within our everyday life, we are confronted with a variety of toxic substances of natural or artificial origin. Toxins are already used, e.g. in medicine, but there is still an increasing number of toxic compounds, representing a tremendous potential to extract new substances. Since predictive toxicology gains in importance, the careful and extensive investigation of known toxins is the basis to assess the properties of unknown substances. In order to achieve this aim, we have collected toxic compounds from literature and web sources in the database SuperToxic. The current version of this database compiles about 60,000 compounds and their structures. These molecules are classified according to their toxicity, based on more than 2 million measurements. The SuperToxic database provides a variety of search options like name, CASRN, molecular weight and measured values of toxicity. With the aid of implemented similarity searches, information about possible biological interactions can be gained. Furthermore, connections to the Protein Data Bank, UniProt and the KEGG database are available, to allow the identification of targets and those pathways, the searched compounds are involved in. This database is available online at: http://bioinformatics.charite.de/supertoxic.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2008 PMID： 19004875 PMCID： PMC2686515 DOI： 10.1093/nar/gkn850

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Toxins are hazardous substances, causing illness or damage to an exposed organism if inhaled, swallowed or absorbed through the skin. They can be found all over in nature and are widely used as drugs in medicine, as toxicity strongly depends on concentration. In nature, animals and plants use toxic substances as protection from predators. For example, poisonous mushrooms or plants use toxins to protect themselves against herbivores. A lot of snakes, scorpions or spiders produce poison to guard themselves from other animals. A number of these substances, originally used by animals or plants to poison their enemies, have become valuable within medicine. In cancer treatment, Paclitaxel, a toxin from the Yew tree (Taxaceae) (1), has been applied successfully in the treatment of breast cancer. Vinorelbin, an alkaloid from Cataranthus roseus, shows good results in the therapy of different carcinomas (2). Very successful in the fight against infection diseases are the toxins of a variety of fungi, the antibiotics (3). These substances, originally produced from the mushroom to protect themselves against bacterial infections, depict a great breakthrough in medicine, as an impressive amount of medical conditions can now be cured. There are different measurements to estimate toxicity: LD50 and LC50 (lethal dose or concentration at which 50% of a population dies) are widely established but also TGI (total growth inhibition), NOEL (no observable effects limit) or LOEL (lowest observable effects level) are used. The wide use of toxins proves the scientific importance, and confronts researchers with the question for the nature of toxicity. What makes a compound toxic? How can toxicity be detected for unknown compounds? To answer these questions, a close investigation of toxic compounds is inescapable, making it necessary to provide a collection of toxins. Databases like Mvir (4) or SCORPION (5) are excellent sources for detailed information, for example, compounds from Scorpiones or DNA and protein sequence analysis. Admittedly, these databases are restrictively applicable for complete consideration of the structural and chemical properties of toxic compounds, within science and industry. To solve this problem, we established the database SuperToxic, which provides a comprehensive collection of toxins from different sources (animals, plants, synthetic, etc.), combined with chemical features, as well as information about commercial availability. This dataset enables a detailed investigation of the correlations between chemical, functional and structural properties of toxins. Furthermore, these data can be used to evaluate the risk of use for compounds within medicine or industry, and give valuable insight into the mechanisms of toxicity. While certain toxins affect many types of cell lines, some toxic compounds only interfere with defined cell types leading to a specific toxicity. Cytostatic drugs, which are often used in chemotherapy, affect the cell cycle or the DNA replication mechanisms (6) and are therefore toxic to all living cells, although tumorigenic cells are more influenced due to their high rate of cell proliferation. In contrast, omeprazole, a drug for the treatment of gastric or duodenal ulcer, only affects cells in the stomach, as it needs an acidic environment to become active (7). Since it is of great interest to figure out, whether a compound interacts specifically with particular cells, SuperToxic is a distinguished tool for the search of such information. Additionally, the toxicity of an unknown substance can be estimated by comparison with structurally similar compounds with known toxicity. Another application of this database is the estimation of health hazards for a variety of chemicals, especially with respect to the new European chemical management system REACH (registration, evaluation, authorization of chemicals) (8), a regulation for the registration, evaluation and confinement of chemicals. This regulation is necessary, as in many chemical production processes, for example, fabrication of colors, varnishes or leather, the usage of toxic compounds is almost inevitable. Therefore, it is essential to assess potential hazardous effects, to safeguard such substances during transportation, usage and storage. The vast usage of chemicals and new chemical registration programs, like REACH, demands alternatives to experimental validation. All producers or importers, who introduce more than one ton per annum of a substance to the European Union, must evaluate the chemical regarding its toxicity, according to REACH. In order to reduce cost-intensive animal testing of toxic compounds, the promotion of data exchange (9–11) and predictive toxicology gains importance and acceptance (12). There are several theoretical approaches, besides QSAR (quantitative structure–activity relationship) (13), which describe the relationship between toxicity and physicochemical properties of compounds. For such predictions, high quality, comprehensive and well-structured databases are essential. The toxicity values and chemical information given by SuperToxic provide such basis for hazard assessment.

THE DATABASE

SuperToxic comprises data from publicly available databases and scientific literature, assembling a vast amount of toxic compounds. Currently, there are about 60 000 structures with corresponding properties stored in the database. Additionally, properties like the number of hydrogen bond (H-bond) donors and acceptors, molecular weight or the octanol–water partition coefficient logP, which allow the evaluation of the Lipinski's Rule of Five (14), can be found within the database. The web interface (available at http://bioinformatics.charite.de/supertoxic) provides different options to access the data: To demonstrate the functionality of the database, the toxicity and the chemical characterization of a new potential drug candidate (Hit1) is exemplarily evaluated: the structure can either be uploaded as MOL file or be drawn, using the molecule editor. The similarity search results in a list of matches to the query compound ordered by similarity (Figure 1 II, column 1–6). The following information are given:

Figure 1.

Flow chart of queries and search results in SuperToxic. (I) Search option of the web interface. (A) Search for toxic compounds via name, CASRN or NSC number and pathway. (B) Structure search: upload a MOL file, enter SMILES or InChI code or dray the structure, e.g. of a potential drug candidate (Hit1). Based on this query structure a similarity or substructure search can be performed. (C) Search for toxic compounds via properties, like molecular weight, number of atoms or rings, H-bond donors or acceptors, LogP or the toxicity value. (II) Result table of a similarity search (Hit1 a query structure), showing a summary for each compound. (A) A detailed view shows all properties. (B) All toxicity values for a compound. (C) Supplier information. (D) Link to KEGG pathways.

Within the ‘Toxin Search’ option (Figure 1, I-A), it is possible to perform a distinct search, given the name of the compound, the Chemical Abstracts Services Registry Number (CASRN) or NSC number, an identifier of the National Cancer Institute database. A search for a certain pathway lists all compounds associated with this pathway. The ‘Structure Search’ option (Figure 1, I-B) allows a structure upload via InChI (IUPAC International Chemical Identifier), SMILES (Simplified Molecular Input Line Entry System) or MOL file. Additionally, the structure can be drawn, using a built-in molecule editor. The provided structure can either be used as input for a similarity screening or a substructure search. Another way to search the database is implemented in the ‘Property Search’ (Figure 1, I-C). Here, the definition of value ranges for certain attributes (e.g. the molecular weight, the logP, the number of rings, H-bond donors or acceptors) provides a list of all database entries fulfilling the conditions. To browse the whole database, the user can choose an alphabetic character or numbers, to display all database entries starting with the selection. Alternatively, all CASRN or NSC numbers, which are available in the database, can be listed. links to resources, external to SuperToxic: order information for more than 60 different suppliers (Figure 1, II-C) pathway information (Figure 1, II-D) ligand information target information synonyms including CASRN and NSC number the empirical formula toxicity information (Figure 1, II-B): dose testing type (e.g. LC50) organism or cell line, the compound was tested on a two-dimensional structure a three-dimensional visualization of the structure Flow chart of queries and search results in SuperToxic. (I) Search option of the web interface. (A) Search for toxic compounds via name, CASRN or NSC number and pathway. (B) Structure search: upload a MOL file, enter SMILES or InChI code or dray the structure, e.g. of a potential drug candidate (Hit1). Based on this query structure a similarity or substructure search can be performed. (C) Search for toxic compounds via properties, like molecular weight, number of atoms or rings, H-bond donors or acceptors, LogP or the toxicity value. (II) Result table of a similarity search (Hit1 a query structure), showing a summary for each compound. (A) A detailed view shows all properties. (B) All toxicity values for a compound. (C) Supplier information. (D) Link to KEGG pathways. For each compound in the result table, a separate ‘Detailed View’ window (Figure 1, II-A) is provided, displaying further structural and chemical properties, such as number of rings, atoms, H-bond donors and acceptors, rotatable bonds, SMILES, logP and molecular weight. All these information can be downloaded in PDF format, together with the atomic coordinates in MOL format. Some of the similar compounds, found for Hit1, like Alfacalcidol (Figure 1 II, red boxes), for example, are very toxic. This finding suggests that despite all the other similar compounds found, which are only slightly toxic, the candidate itself might be tested for toxicity starting with very low concentrations. Another useful feature of SuperToxic, is the user upload interface, which allows the scientific community to contribute to the database. There are several data required for the upload: the structure, toxicological information (type of toxicity, dose, unit, organism or cell line) and an email address for further correspondence. After manual curation, the database will be updated with the new compound.

METHODS

Data mining

SuperToxic was established on the basis of data from the publicly available databases PubMed, DSSTox (15) and NCI60 (16). Furthermore, the book ‘Biogene Gifte’ (17), was manually surveyed, making the data available online for the first time. Toxicity data were collected from literature by extensive text mining. A searchable index from the PubMed database was built. In the next step, the index was filtered for toxicity-related keywords and various patterns, like units or IUPAC names. Finally, all relevant text passages were manually curated in regard to the presence of any toxicity information. The data from all sources were merged to eliminate duplicated compounds. The database currently contains about 60 000 compounds. Also included, are references to the origin of the compounds and about 600 entries in the Kyoto Encyclopedia of Genes and Genomes (KEGG) (18). To detect potential targets in biochemical pathways approximately 400 compounds were detected in the Protein Data Bank (PDB) (19,20) and the corresponding targets were linked via more than 800 KEGG and 3600 UniProt (21) entries.

Calculation of chemical properties

The calculations of chemical properties, e.g. molecular weight and number of H-bond donors and acceptors, were performed with functions from OpenBabel 2.1, an open source chemical toolbox (22) (http://openbabel.sourceforge.net/). To compute the properties for the structures, the MyChem extension (an implementation of the OpenBabel 2.1 library for MySQL) was used (http://mychem.sourceforge.net/).

Analysis of chemical and structural properties

For the complete dataset, the distributions of molecular weight, LogP and H-bond donors and acceptors were analyzed, whereas drugs (23) and natural compounds (24) served as reference groups. A reduced dataset, derived from the NCI60 panel, was subdivided into three toxicity groups, represented by −log (LC50): the slightly toxic, medium toxic and highly toxic compounds. For each group, the distribution of chemical properties was calculated separately, to reveal interdependencies regarding the toxicity. The results are shown under ‘Statistics’ on the SuperToxic website.

Structural fingerprint

The similarity search is performed, using so-called structural fingerprints, a binary string with a length of 1536 bits, which encodes for the chemical characteristics of a compound. Within this database, a combination of two fingerprints was used: (i) a 1024 bit fingerprint based on MDL; (ii) a 512 bit fingerprint encoding for 317 structural properties defined as SMARTS pattern (http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html), provided by OpenBabel (http://openbabel.org/wiki/FP2). The first one generates a fingerprint for each chemical structure, the second provides for exact structural patterns. The application of this combined fingerprint leads to an improvement of the similarity and substructure searches and yields in more detailed results. The fingerprints of all compounds were precalculated and stored in the database.

Similarity search

During the similarity search, the fingerprint of the input structure is built and compared with the fingerprints of the database entries using the Tanimoto coefficient. It is a similarity index and defined as: N and N describe the number of bits, set to 1 in the fingerprint, of compound a and b, respectively. N is the number of bit positions set to 1 in both fingerprints. A molecule with a Tanimoto coefficient ⩾0.85 to an active compound is assumed to be biologically active itself (25). The Tanimoto calculations are performed using MyChem.

Substructure search

For the substructure search, the database entries are filtered, according to the number of atoms and rings. Thus, structures with less atoms or rings, compared with the query molecule, are not considered narrowing down the search space. Afterwards, the fingerprint of the query structure is compared with the remaining database entries' fingerprints. If all bits set to one coincide, the query structure is a substructure of this database compound.

Server

SuperToxic is designed as a relational database, which is implemented in a MySQL server. For chemical functionality, the MyChem/OpenBabel package is added. The web access is enabled via an Apache Webserver 2.2. The web site is built in PHP5 and HTML. For the molecule drawing and uploading function, the tool MarvinSketch and for visual inspection of compounds, Jmol (26) is implemented.

CONCLUSION AND FUTURE DIRECTIONS

SuperToxic is a rich source of toxicological data, combining structural, functional and chemical information, along with corresponding toxicity values. Features, like similarity screening and substructure search, enable to characterize and estimate the potential toxicity of substances which have not been validated yet. The application of SuperToxic might help to reduce the amount of animal testing, e.g. for the risk assessment of new drugs, or to fulfill the new EU REACH requirements. Additionally, this database represents a valuable support during toxin research, as the information about the compounds will facilitate experimental design. The range of toxicity, possible targets, mode of action and chemical modification to lower toxicity can be retrieved. SuperToxic is planned for further enlargement, data concerning peptides and proteins will be added, and ecotoxicological information will be considered. In addition to that, an upload function enables the scientific community to contribute by adding new compounds or supplementary information. In order to provide up-to-date information the database will be updated twice a year.

FUNDING

Deutsche Forschungsgemeinschaft (SFB 449); IRTG Berlin-Boston-Kyoto; Investitionsbank Berlin (IBB); Deutsche Krebshilfe. Funding for the open access charge: Deutsche Forschungsgemeinschaft (SFB-449). Conflict of interest statement. None declared.

23 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

Review 2. SCORPION, a molecular database of scorpion toxins.

Authors: K N Srinivasan; P Gopalakrishnakone; P T Tan; K C Chew; B Cheng; R M Kini; J L Koh; S H Seah; V Brusic
Journal: Toxicon Date: 2002-01 Impact factor: 3.033

3. Do structurally similar molecules have similar biological activity?

Authors: Yvonne C Martin; James L Kofron; Linda M Traphagen
Journal: J Med Chem Date: 2002-09-12 Impact factor: 7.446

Review 4. The expanding role of predictive toxicology: an update on the (Q)SAR models for mutagens and carcinogens.

Authors: Romualdo Benigni; Tatiana I Netzeva; Emilio Benfenati; Cecilia Bossa; Rainer Franke; Christoph Helma; Etje Hulzebos; Carol Marchant; Ann Richard; Yin-Tak Woo; Chihae Yang
Journal: J Environ Sci Health C Environ Carcinog Ecotoxicol Rev Date: 2007 Jan-Mar Impact factor: 3.781

Review 5. Recent progress in the development of anticancer agents.

Authors: Sándor Eckhardt
Journal: Curr Med Chem Anticancer Agents Date: 2002-05

Review 6. The NCI60 human tumour cell line anticancer drug screen.

Authors: Robert H Shoemaker
Journal: Nat Rev Cancer Date: 2006-10 Impact factor: 60.716

7. From genomics to chemical genomics: new developments in KEGG.

Authors: Minoru Kanehisa; Susumu Goto; Masahiro Hattori; Kiyoko F Aoki-Kinoshita; Masumi Itoh; Shuichi Kawashima; Toshiaki Katayama; Michihiro Araki; Mika Hirakawa
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

8. The Blue Obelisk-interoperability in chemical informatics.

Authors: Rajarshi Guha; Michael T Howard; Geoffrey R Hutchison; Peter Murray-Rust; Henry Rzepa; Christoph Steinbeck; Jörg Wegner; Egon L Willighagen
Journal: J Chem Inf Model Date: 2006 May-Jun Impact factor: 4.956

9. CEBS--Chemical Effects in Biological Systems: a public data repository integrating study design and toxicity data with microarray and proteomics data.

Authors: Michael Waters; Stanley Stasiewicz; B Alex Merrick; Kenneth Tomer; Pierre Bushel; Richard Paules; Nancy Stegman; Gerald Nehls; Kenneth J Yost; C Harris Johnson; Scott F Gustafson; Sandhya Xirasagar; Nianqing Xiao; Cheng-Cheng Huang; Paul Boyer; Denny D Chan; Qinyan Pan; Hui Gong; John Taylor; Danielle Choi; Asif Rashid; Ayazaddin Ahmed; Reese Howle; James Selkirk; Raymond Tennant; Jennifer Fostel
Journal: Nucleic Acids Res Date: 2007-10-25 Impact factor: 16.971

10. The universal protein resource (UniProt).

Authors:
Journal: Nucleic Acids Res Date: 2007-11-27 Impact factor: 16.971

21 in total

1. Many InChIs and quite some feat.

Authors: Wendy A Warr
Journal: J Comput Aided Mol Des Date: 2015-06-17 Impact factor: 3.686

Review 2. Review of natural product databases.

Authors: Tao Xie; Sicheng Song; Sijia Li; Liang Ouyang; Lin Xia; Jian Huang
Journal: Cell Prolif Date: 2015-05-25 Impact factor: 6.831