| Literature DB >> 33372625 |
Kohulan Rajan1, Henning Otto Brinkhaus1, Achim Zielesny2, Christoph Steinbeck3.
Abstract
Structural information about chemical compounds is typically conveyed as 2D images of molecular structures in scientific documents. Unfortunately, these depictions are not a machine-readable representation of the molecules. With a backlog of decades of chemical literature in printed form not properly represented in open-access databases, there is a high demand for the translation of graphical molecular depictions into machine-readable formats. This translation process is known as Optical Chemical Structure Recognition (OCSR). Today, we are looking back on nearly three decades of development in this demanding research field. Most OCSR methods follow a rule-based approach where the key step of vectorization of the depiction is followed by the interpretation of vectors and nodes as bonds and atoms. Opposed to that, some of the latest approaches are based on deep neural networks (DNN). This review provides an overview of all methods and tools that have been published in the field of OCSR. Additionally, a small benchmark study was performed with the available open-source OCSR tools in order to examine their performance.Entities:
Keywords: Chemical data extraction; Chemical structure; Data mining; Machine learning; Named entity recognition; Open data; Optical chemical structure recognition
Year: 2020 PMID: 33372625 PMCID: PMC7541205 DOI: 10.1186/s13321-020-00465-0
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Comparison of tools and methods published
| Tool name | Programming language used | Operating System compatibility | Open-source | Commercial or free availability (2020) | Ongoing development |
|---|---|---|---|---|---|
| Kekulé | C + + | Windows | No | Yes | No |
| OROCS | C | IBM OS/2 | No | No | No |
| CLiDE Pro | C + + | Windows | No | Yes | Yes |
| OSRA | C + + | Independent | Yesa | Yes | Yes |
| ChemReader | C + + | Windows | No | No | No |
| MolRec | Unknown | Unknown | No | No | Unknown |
| Imago | C + + | Independent | Yes | Yes | No |
| ChemOCR | Java | Independent | No | Yes | Yes |
| ChemInfty | Unknown | Windows | No | No | No |
| eChem | Unknown | Unknown | No | No | No |
| MLOCSR | Unknown | Only Web interface | No | Only web interface | Unknown |
| OCSR | Unknown | Unknown | No | No | Unknown |
| ChemRobot | Unknown | Unknown | No | No | Unknown |
| MolVec | Java | Independent | Yes | Yes | Yes |
| MSE-DUDL | Python | Independent | No | No | No |
| Chemgrapher | Python | Independent | No | No | Yes |
aPrecompiled tool is only available commercially
Time elapsed and accuracy reported for the open-source OCSR tools
| Dataset | MolVec 0.9.7 | Imago 2.0 | OSRA 2.1 | |
|---|---|---|---|---|
USPTO (5719 images) | Time (min) | 28.65 | 72.83 | 145.04 |
| Accuracy | 88.41% | 87.20% | 87.69% | |
UOB (5740 images) | Time (min) | 28.42 | 152.52 | 125.78 |
| Accuracy | 88.39% | 63.54% | 86.50% | |
CLEF 2012 (961 images) | Time (min) | 4.41 | 16.03 | 21.33 |
| Accuracy | 80.96% | 65.45% | 94.90% | |
JPO (450 images) | Time (min) | 7.50 | 22.55 | 16.68 |
| Accuracy | 66.67% | 40.00% | 57.78% |
Fig. 1a Accuracy (Right: higher the better) and b Total time for processing (Left: lower the better)