Literature DB >> 22067445

SCRIPDB: a portal for easy access to syntheses, chemicals and reactions in patents.

Abstract

The patent literature is a rich catalog of biologically relevant chemicals; many public and commercial molecular databases contain the structures disclosed in patent claims. However, patents are an equally rich source of metadata about bioactive molecules, including mechanism of action, disease class, homologous experimental series, structural alternatives, or the synthetic pathways used to produce molecules of interest. Unfortunately, this metadata is discarded when chemical structures are deposited separately in databases. SCRIPDB is a chemical structure database designed to make this metadata accessible. SCRIPDB provides the full original patent text, reactions and relationships described within any individual patent, in addition to the molecular files common to structural databases. We discuss how such information is valuable in medical text mining, chemical image analysis, reaction extraction and in silico pharmaceutical lead optimization. SCRIPDB may be searched by exact chemical structure, substructure or molecular similarity and the results may be restricted to patents describing synthetic routes. SCRIPDB is available at http://dcv.uhnres.utoronto.ca/SCRIPDB.

Entities: Chemical Disease Gene

Mesh：

Year: 2011 PMID： 22067445 PMCID： PMC3245107 DOI： 10.1093/nar/gkr919

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

US patent information is in the public domain and describes innovations in the medical, biological, chemical and agricultural fields. Such relevant and accessible material is ideal for scientific analysis and, indeed, databases such as PubChem (1) and ChEBI (2,3) contain chemical structures disclosed by patents. While such databases are highly useful, structural databases are insufficient for a number of scientific investigations. The extraction of component structures from a patent discards information about chemical relationships. These relationships can be explicitly labelled, such as a molecule's role as reagent or product in a chemical synthesis. Relationships may also be implicitly embedded in the context of the complete patent. For example, molecules that co-occur in drug patent claims are likely to have similar biological behavior. These exemplar relationships have been valuable in statistical analyses for automated reaction extraction (4,5) and bioisostere discovery (6). Reaction extraction characterizes the molecular transformations that occur within a set of syntheses. Bioisostere discovery catalogs molecular substituents that participate in similar biological interactions. Such analyses require access to large data sets, which are often unavailable, proprietary, or expensive. In addition to the relationships among a patent's molecular structures, a patent's data files are directly useful. One use of patent files is the creation of data sets for optical structure recognition. The task in optical structure recognition is to parse a chemical image and recover the depicted molecular structure. The training and tests sets therefore require correctly matched pairs of images and molecular structure files. As a final example, a patent's written contents can serve as a target for text analytics. Patent descriptions and claims constitute a large corpus of biomedical text, coarsely annotated by, e.g., patent classification and drug–disease pairs. To address these gaps in available metadata, we have created SCRIPDB. While providing uncomplicated searching of the patent literature, we have been careful not to eliminate underlying information. Users of the database may download the full text of the patent as well as molecular structure files and images. Additional summary files were generated and augment, rather than replace, the original data. Thus, SCRIPDB permits researchers to effectively access the full information contained in the patent literature.

MATERIALS AND METHODS

Data collection and processing

The United States Patent and Trademark Office (USPTO) provides full patent text, drawings, and, since 2001, chemical structures in complex work units. This data became available as a free bulk download, hosted on Google servers, in June 2010. The raw data files comprise every granted patent, numbering several thousand per week, and totaling over 10 terabytes of data. However, most patents are not relevant to the biological, chemical or medical domains. Since 2001, patented chemical structures can be described using standard molecular file formats. The USPTO makes disclosed molecules available as either MDL Molfiles (MOL) or ChemDraw binary CDX files. We used the presence of these chemical structure files to identify patents of potential interest. Figure 1 shows the amount of data in SCRIPDB, as measured by the quantity of CDX files (Figure 1a), individual structures (Figure 1a), and patents (Figure 1b). These numbers are consistent with previous analyses of patent data (7).

Figure 1.

Cumulative SCRIPDB content. Although SCRIPDB includes patents from 2011, we show data through 2010, the last complete year. (a) shows the number of ChemDraw CDX structure files, and the structures described therein, available in SCRIPDB for various years. SCRIPDB contains 4 814 913 CDX files from 2001 through 2010, comprising 10 840 646 molecules. Duplicate molecules were filtered from each patent but not across patents, as described in the text. (b) shows the number of patents and details the subset containing reactions. For 2001 through 2010, SCRIPDB contains 107 560 patents, of which 25 048 contain synthetic reactions. A particular structure will often be described with both a CDX and Molfile. However, the information contained in such parallel files is not always identical. For example, a CDX file can label a molecule's role in a reaction as reagent or product, whereas a Molfile cannot. We provide both CDX and Molfile data formats, as well as original TIFF and generated SVG images. For additional convenience, we collated the individual structure files into one structure-data file (SDF) per patent. During collation, the generated SDF files were filtered to remove duplicate molecules. The filtering ensures that duplicated structures are removed from individual patents' SDF files but not across patents, as shared molecules may be used to uncover legitimate relationships among the patents. Patents often describe sets of molecules rather than, or in addition to, specific structures. Markush structure notation is commonly used to succinctly describe large (or infinite) molecular classes by choosing constituent molecular fragments from alternatve substituents, positions, frequencies or homologies. The enumeration and comparison of Markush classes are significant challenges for cheminformatics systems (8). For example, Markush structures typically depict variable substituent placement as bonds that cross rings, as in Figure 2. These bonds frequently appear in Molfiles as separate propane molecules laid over the core scaffold. SCRIPDB provides basic Markush handling by extracting the molecular core and canonicalizing variable substituents, which permits Markush structures to be found via substructure searches. The original Markush structure is then retrievable from the original patent's structure or image files.

Figure 2.

Example Markush structure from US Patent 6 268 504 (9) defining a chemical class via substituent, positional and frequency variations.

Implementation

Raw patent data was downloaded from Google's bulk download of USPTO granted patents with embedded image data (http://www.google.com/googlebooks/uspto-patents-redbook.html). Filtering, collation and de-duplication were performed on an IBM CL1350 cluster with 1 344 cores over 168 Infiniband-connected HS21-XM BladeServers and a DCS9550 storage system. The cluster runs the CentOS operating system, version 5.1, and manages coarse-grained parallelism with the Portable Batch System. SDF files were generated and duplicate structures were removed using OpenBabel, SVN revision 4487 (10). OpenBabel was also used to generate SVG files and compute canonical SMILES strings (http://www.daylight.com/smiles/) for display of retrieved search results, as shown in Figure 3.

Figure 3.

Sample of search results for molecules containing an acridine substructure.

Sample of search results for molecules containing an acridine substructure. The web interface to SCRIPDB was implemented in Python 2.6 using the Django web application framework, version 1.1 (11). Chemistry-specific search functionality, such as substructure searches or structural similarity using the Tanimoto coefficient (12) of OpenBabel FP2 linear structural fingerprints, is provided via integration with Pybel (13). Search structures may be specified via SMARTS queries, uploaded Molfiles, or via the interactive ChemWriter molecular editor (http://metamolecular.com/chemwriter/). ChemWriter is implemented in pure Javascript and permits SCRIPDB to be accessed from any major browser for desktop or iPad without the installation of external plugins.

RESULTS

Structures

New patents are granted each week, providing a steady stream of additional data for SCRIPDB. Here, we report results to the end of 2010, which is the last complete year of patent data. At the end of 2010, SCRIPDB contained 107 560 patents, including 4 814 913 non-redundant CDX structure files. Many structure files describe multiple molecules which, after de-duplication of molecules within a patent, yield a total of 10 840 646 molecules. Not only does this constitute a significant amount of total data (7) but the rate of structure disclosure appears to be growing. Both 2009 and 2010 had record numbers of molecules disclosed, at 1 259 097 and 1 639 522 structures, respectively. SCRIPDB data statistics are summarized in Figure 1. However, focusing solely on the chemical structures ignores valuable chemical information. For many applications, molecular relationships need to be analyzed at various levels of granularity; for example, within a single CDX file, a particular patent, a group of patents that refer to a specific disease, or within the entirety of the database. As seen in Figure 1, patents typically contain multiple structural files which, in turn, contain multiple structures. For example, US Patent 6 884 815 contains 8 187 structure files (14). Figure 4 shows the distribution of molecules per patent. The largest group contains patents that describe relatively few molecules (specifically, 10 or fewer). However, most patents catalog larger structural series. Two-thirds of patents contain more than 10 structures, over half of patents contain more than 20 structures, and almost 1 300 patents describe more than 1 000 structures each. A manual examination of a sample of patents shows that these structures often describe molecular analogs produced in the context of a medicinal chemistry optimization program.

Figure 4.

Structures per patent in SCRIPDB. While 34,789 patents contain ten or fewer structures, two-thirds of patents contain more than ten and 1,296 patents contain more than a thousand structures.

Synthetic reactions

An important relationship described in CDX files is chemical synthesis. Figure 1b details the subset of patents that contain reactions. These are the patents that contain a CDX file with a ReactionStep object (http://www.cambridgesoft.com/services/documentation/sdk/chemdraw/cdx/ReactionStep.htm). Of the 107 560 patents from 2001 to 2010, SCRIPDB contains 25 048 patents that describe syntheses. Fundamental molecular roles, such as reagent and product, are lost when a synthesis is split into separate structures. These relationships may be difficult to rederive, especially in multistep syntheses. Figure 5, which shows the number of reaction steps in SCRIPDB's syntheses, demonstrates that synthetic pathways requiring multiple reaction steps occur frequently in the patent literature. The 25 048 synthesis-containing patents describe a total of 341 764 individual reaction steps. While the most common are single-step syntheses, 52 462 syntheses (27.6%) have at least two reaction steps.

Figure 5.

Number of CDX data files that describe syntheses of various lengths.

DISCUSSION

In addition to their direct value in intellectual property licensing (15) and competitive business analysis (16), patents can serve as a useful resource for a variety of academic research. We examine the datasets used in previous research and show that the data available in SCRIPDB is sufficient, quantitatively and qualitatively, to provide value for future investigations. Additionally, the information in patents is complementary to data available in the conventional academic literature (17), suggesting additional insight may be derived from integration with existing datasets. Specifically, we survey patent data as a tool for the development of chemical image parsing, biological text mining, reaction extraction, and bioisostere discovery.

Patents as a source of chemical images

The problem of optical structure recognition is to automatically extract molecular structure information from images. A necessary resource in the training and evaluation of such systems is a set of molecular images for which the true molecular structure is already known. Ideally, the images are representative of chemical images as commonly used in practice. Validation sets comprising 6 185 images (18) and 454 images (19) were recently reported in the literature. By comparison, SCRIPDB provides millions of CDX structure and TIFF image files. This comprises a validation set several orders of magnitude larger than previously reported, even after eliminating possibly-confounding complex synthetic schemes. Furthermore, SCRIPDB contains SVG images produced by OpenBabel in addition to the original patent's TIFF images. This redundancy permits testing structure recognition algorithms for robustness to the idiosyncrasies of alternative image generation tools.

Patents as biomedical literature

Statistical analysis of the biomedical literature requires large quantities of freely accessible documents. Much of the biological text mining research has therefore used PubMed abstracts of journal articles while more recent full-text analyses have focused on Open Access journals (20). A large corpus of 162 259 full-text journal articles in HTML were used in the TREC Genomics track (21). The 107 560 patents in SCRIPDB constitute a document collection of comparable size. Additionally, patents are interestingly complex. Patents are inherently semistructured and are partially annotated by their embedded chemical structures, gene sequences (7), mathematical equations, and data tables. Patents also contain citations to related papers (22) and patents, permitting co-citation analysis. Patent relationships may be derived and analyzed based on shared terms in patent text, shared molecules, patent assignee (6) or ontological categorization (http://www.uspto.gov/web/offices/opc/documents/classescombined.pdf).

Patents as a reaction database

Libraries of chemical transforms are used for combinatorial library design (23), static synthetic feasibility (24), and full retrosynthetic analysis (5). Although such libraries can be constructed by hand (25), automated extraction is an effective and labor-saving alternative (4,5,26,27). Such systems determine molecular substructures that are modified in the same manner across many different syntheses. These modifications can then be treated as putative reactions, predicting that if the same molecular substructure is found in a new molecule, it can be changed in the same way. Automated transform extraction therefore requires a large corpus of example syntheses within which to find consistent molecular changes. In this capacity, SCRIPDB with its 341 764 reactions compares favorably to the 42 333 reactions in the Methods in Organic Synthesis database recently used for reaction extraction (5) or the 30 530 CCR reactions used to characterize functional group reactivity (28). While smaller than commercial databases containing millions of reactions, SCRIPDB has the virtue of being freely accessible.

Patents as a bioisostere catalog

After finding an initial lead molecule, medicinal chemists will typically create and test a large series of analogous compounds, seeking to increase binding affinity; improve absorption, distribution, metabolism, excretion and toxicity profiles; or avoid the intellectual property restrictions of competitors' patents. One systematic approach for exploring chemical space is to substitute bioisosteres, which are molecular fragments that have similar shape and chemical properties. For example, hydrogen and fluorine have similar van der Waals radii and the same valence. Exchanging a hydrogen with a flourine permits the medicinal chemist to maintain the same molecular shape while optimizing the molecule's charge distribution. There is strong interest both in techniques to determine bioisosteres (29,30) and in idea-generation tools that propose alternative molecules en masse by modifying a lead molecule in silico using the same approach as a medicinal chemist (31). Southall and Ajay (6) investigated bioisosteric replacements in kinase patents by analyzing 116 550 compounds. The maximum common substructure was computed for pairs of compounds and the remainder of the molecules were identified as exchangeable chemical replacements. Southall and Ajay (6) were interested in the research strategies of drug companies, so only compared compounds found in patents assigned to different companies. A similar analysis can be performed for the compound series within a single patent (32). Each patent defines a set of bioisosteric replacements that were determined to be reasonable, interesting and synthetically feasible by a medicinal chemist. Figure 4 demonstrates that SCRIPDB contains sufficient patents with large chemical series to extract sensible chemical replacements.

CONCLUSION AND FUTURE WORK

The impetus of intellectual property protection creates a deluge of patents that carries enormous quantities of chemical and biological information. While patented molecules are accessible via chemical databases, the extraction of component structures from a patent needlessly removes information about chemical relationships. SCRIPDB is designed to make such metadata broadly accessible. We examined the information used in medical text mining, chemical image parsing, reaction extraction and the development of computational tools for lead optimization. We demonstrated that the quantity and quality of SCRIPDB's data compares favorably with existing commercial and free data sets. In many cases, it is complementary to existing data. In the future, we plan to reduce the manual intervention necessary for the incorporation of new patents. Automatic patent processing is critical for maintaining the completeness of SCRIPDB, since new patents are released weekly. Automated updating will also support deposition of non-redundant molecules into PubChem (1). In addition, we wish to pursue integration with other value-adding databases, such as ChEMBL (33) and CDIP (http://ophid.utoronto.ca/cdip), and provide programmatic access to SCRIPDB via RESTful web services. Currently SCRIPDB incorporates patents only from the United States, because the USPTO provides distinct structure files. However, valuable patent data is available from other countries' patent offices. Robust optical recognition of chemical structures would permit future integration with patent offices that provide molecules as images, such as the European Patent Office, the Japanese Patent Office, World Intellectual Property Organization, the UK Intellectual Property Office and the Canadian Intellectual Property Office.

FUNDING

Computational analysis was supported in part by Canada Foundation for Innovation (CFI #12301 and CFI #203383); Ontario Research Fund (GL2-01-030); Canada Research Chair Program (to I.J., in part); Ontario Ministry of Health and Long Term Care (in part). Funding for open access charge: Ontario Research Fund (GL2-01-030). Conflict of interest statement. None declared.

22 in total

1. Assessing synthetic accessibility of chemical compounds using machine learning methods.

Authors: Yevgeniy Podolyan; Michael A Walters; George Karypis
Journal: J Chem Inf Model Date: 2010-06-28 Impact factor: 4.956

2. Annotating patents with Medline MeSH codes via citation mapping.

Authors: Thomas D Griffin; Stephen K Boyer; Isaac G Councill
Journal: Adv Exp Med Biol Date: 2010 Impact factor: 2.622

3. Making "real" molecules in virtual space.

Authors: György Pirok; Nóra Maté; Jeno Varga; József Szegezdi; Miklós Vargyas; Szilard Dórant; Ferenc Csizmadia
Journal: J Chem Inf Model Date: 2006 Mar-Apr Impact factor: 4.956

4. ChEMBL. An interview with John Overington, team leader, chemogenomics at the European Bioinformatics Institute Outstation of the European Molecular Biology Laboratory (EMBL-EBI). Interview by Wendy A. Warr.

Authors: John Overington
Journal: J Comput Aided Mol Des Date: 2009-02-05 Impact factor: 3.686

5. CLiDE Pro: the latest generation of CLiDE, a tool for optical chemical structure recognition.

Authors: Aniko T Valko; A Peter Johnson
Journal: J Chem Inf Model Date: 2009-04 Impact factor: 4.956

6. Computer-assisted design of complex organic syntheses.

Authors: E J Corey; W T Wipke
Journal: Science Date: 1969-10-10 Impact factor: 47.728

7. The Blue Obelisk-interoperability in chemical informatics.

Authors: Rajarshi Guha; Michael T Howard; Geoffrey R Hutchison; Peter Murray-Rust; Henry Rzepa; Christoph Steinbeck; Jörg Wegner; Egon L Willighagen
Journal: J Chem Inf Model Date: 2006 May-Jun Impact factor: 4.956

8. PubChem: a public information system for analyzing bioactivities of small molecules.

Authors: Yanli Wang; Jewen Xiao; Tugba O Suzek; Jian Zhang; Jiyao Wang; Stephen H Bryant
Journal: Nucleic Acids Res Date: 2009-06-04 Impact factor: 16.971

9. Non-redundant patent sequence databases with value-added annotations at two levels.

Authors: Weizhong Li; Hamish McWilliam; Ana Richart de la Torre; Adam Grodowski; Irina Benediktovich; Mickael Goujon; Stephane Nauche; Rodrigo Lopez
Journal: Nucleic Acids Res Date: 2009-11-01 Impact factor: 16.971

10. ChEBI: a database and ontology for chemical entities of biological interest.

Authors: Kirill Degtyarenko; Paula de Matos; Marcus Ennis; Janna Hastings; Martin Zbinden; Alan McNaught; Rafael Alcántara; Michael Darsow; Mickaël Guedj; Michael Ashburner
Journal: Nucleic Acids Res Date: 2007-10-11 Impact factor: 16.971

7 in total

Review 1. Getting the most out of PubChem for virtual screening.

Authors: Sunghwan Kim
Journal: Expert Opin Drug Discov Date: 2016-08-05 Impact factor: 6.098

2. Tracking 20 years of compound-to-target output from literature and patents.

Authors: Christopher Southan; Peter Varkonyi; Kiran Boppana; Sarma A R P Jagarlapudi; Sorel Muresan
Journal: PLoS One Date: 2013-10-29 Impact factor: 3.240

3. Technical implications of new IUPAC elements in cheminformatics.

Authors: John W Mayfield; Roger A Sayle
Journal: J Cheminform Date: 2017-02-13 Impact factor: 5.514

4. Automatic identification of relevant chemical compounds from patents.

Authors: Saber A Akhondi; Hinnerk Rey; Markus Schwörer; Michael Maier; John Toomey; Heike Nau; Gabriele Ilchmann; Mark Sheehan; Matthias Irmer; Claudia Bobach; Marius Doornenbal; Michelle Gregory; Jan A Kors
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

5. PubChem atom environments.

Authors: Volker D Hähnke; Evan E Bolton; Stephen H Bryant
Journal: J Cheminform Date: 2015-08-19 Impact factor: 5.514

6. SureChEMBL: a large-scale, chemically annotated patent document database.

Authors: George Papadatos; Mark Davies; Nathan Dedman; Jon Chambers; Anna Gaulton; James Siddle; Richard Koks; Sean A Irvine; Joe Pettersson; Nicko Goncharoff; Anne Hersey; John P Overington
Journal: Nucleic Acids Res Date: 2015-11-17 Impact factor: 16.971

7. PubChem 2019 update: improved access to chemical data.

Authors: Sunghwan Kim; Jie Chen; Tiejun Cheng; Asta Gindulyte; Jia He; Siqian He; Qingliang Li; Benjamin A Shoemaker; Paul A Thiessen; Bo Yu; Leonid Zaslavsky; Jian Zhang; Evan E Bolton
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

7 in total