| Literature DB >> 33431008 |
An Nguyen1, Yu-Chieh Huang1, Pierre Tremouilhac1, Nicole Jung2,3, Stefan Bräse4,5.
Abstract
We developed CHEMSCANNER, a software that can be used for the extraction of chemical information from ChemDraw binary (CDX) or ChemDraw XML-based (CDXML) files and to retrieve the ChemDraw scheme from DOC, DOCX or XML documents. This can facilitate the reuse of chemical information embedded into diverse documents used as standard storage and communication instrument in chemical sciences (e.g. for student's theses, PhD theses, or publications). The extracted information is processed to reactions, molecules, as well as additional text and values and can be accessed via the CHEMSCANNER UI. CHEMSCANNER supports the export to Excel and CML, the direct import of the extracted data to the Open Source ELN Chemotion or the use via "copy and paste" of selected information. The software was designed with a focus on the processing of documents with embedded molecular structure information as CDX or CDXML as these are the most common file formats for chemical drawings. The project aims to support the chemists in their efforts to re-use chemistry research data by providing them missing tools for an automated assembly of reaction data.Entities:
Keywords: CDX; CDXML; Chemical data extraction; Data mining; Molecule recognition
Year: 2019 PMID: 33431008 PMCID: PMC6907231 DOI: 10.1186/s13321-019-0400-5
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Schematic representation of the core functions of the ChemScanner software [30]
Fig. 2Exemplarily chosen schematic representation of the procedure to extract and process different information from CDX files
Fig. 3Explanation of the analysis mode for objects included in CDX and CDXML files: a correlation of geographic information and reaction role assignment; b schematic summary of some successfully assigned scenarios
Fig. 4Screenshot of the ChemScanner UI after extraction of a two-step reaction and labeling of the most important features (size of the schemes was increased to improve the readability in the given view). Labels: 1, upload of documents or files via an easy drag and drop procedure or a selection of stored files; 2, Adding reagents or solvents via a dropdown list of 2000 entries; 3, preview of the original file as given by the upload in the web-application of Chemotion ELN (this add-on function needs a ChemDraw license but is not essential for the ChemScanner function); 4, visualization of extracted structures to verify the result; 5, representation of textual information in the scheme sorted according to the role of the identified items; 6, icons to access the functions “select reaction”, “add comment”, “copy reaction SMILES”, “copy molfiles” and “delete reaction”
Fig. 5Main workflow describing the re-use of extracted information in the ELN. Dependencies of molecule identifiers and representation on the extracted information and its processing