| Literature DB >> 33319904 |
William Digan1,2, Aurélie Névéol3, Antoine Neuraz1,4, Maxime Wack1,2, David Baudoin1,2, Anita Burgun1,2,4, Bastien Rance1,2.
Abstract
BACKGROUND: The increasing complexity of data streams and computational processes in modern clinical health information systems makes reproducibility challenging. Clinical natural language processing (NLP) pipelines are routinely leveraged for the secondary use of data. Workflow management systems (WMS) have been widely used in bioinformatics to handle the reproducibility bottleneck.Entities:
Keywords: containerization; meaningful use; natural language processing; reproducibility of results; workflow
Year: 2021 PMID: 33319904 PMCID: PMC7936396 DOI: 10.1093/jamia/ocaa261
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Literature queries used to identify articles related to reproducibility
| Topics | Query |
|---|---|
| Bioinformatics reproducibility articles |
|
| NLP or clinical NLP reproducibility features |
|
| Identification of NLP framework |
|
Abbreviations: EHR, electronic health record; NLP, natural language processing.
Figure 1.Reproducibility articles sorted by level of analysis and research fields. The scope category as either tool or WMS is also shown.
Characterization of reproducibility recommendations collected from the literature. Each recommendation is assigned a topic and a simple description. The coverage of the recommendation in a given article is noted as present (✓) or absent (-)
| Topics | ID | Features |
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|
| Traceability | R01 | Provenance metadata | ✓ | ✓ | – | ✓ | ✓ | – | – | – |
| R02 | Generating execution logs | ✓ | ✓ | – | ✓ | – | ✓ | ✓ | ✓ | |
| R03 | System metadata (eg, RAM, CPU, OS, etc) | ✓ | ✓ | – | – | – | – | – | – | |
| R04 | Record parameters of tools | ✓ | ✓ | – | – | – | ✓ | – | ✓ | |
| R05 | Recording intermediate results | ✓ | ✓ | – | ✓ | ✓ | ✓ | – | – | |
| Versioning | R06 | Use of version control for workflows | ✓ | ✓ | ✓ | ✓ | – | ✓ | – | ✓ |
| R07 | Use of version control for tools | ✓ | ✓ | ✓ | ✓ | – | ✓ | – | ✓ | |
| R08 | Use of version control for resources | ✓ | ✓ | ✓ | ✓ | – | ✓ | – | ✓ | |
| R09 | Archived tools versions | ✓ | ✓ | – | ✓ | ✓ | ✓ | – | ✓ | |
| R10 | Archived WMS versions | – | – | – | ✓ | – | ✓ | – | ✓ | |
| R11 | Archived resources versions | – | – | – | ✓ | – | ✓ | – | ✓ | |
| R12 | Archived input data versions | ✓ | – | – | ✓ | – | ✓ | – | ✓ | |
| Standardization | R13 | Standard objects identifier input data | ✓ | – | – | ✓ | ✓ | ✓ | – | – |
| R14 | Standard objects identifier output data | ✓ | – | – | ✓ | ✓ | ✓ | – | – | |
| R15 | Standard objects identifier resources | ✓ | – | – | ✓ | ✓ | ✓ | – | – | |
| R16 | Standard objects identifier tools | ✓ | – | – | ✓ | ✓ | ✓ | – | – | |
| R17 | Use of research objects | ✓ | ✓ | – | – | – | – | ✓ | ✓ | |
| R18 | Containerization | ✓ | ✓ | – | – | – | – | – | ✓ | |
| R19 | Use of relative path within tools/ - hard coded path | ✓ | – | ✓ | – | – | – | – | ✓ | |
| R20 | Use of standard folder organization at workflow level (eg, BagIt) | ✓ | – | – | – | – | – | – | – | |
| R21 | Use of standard folder organization at tools level (eg, BagIt, Django project, Eclipse plugin, etc) | ✓ | – | – | – | – | – | – | – | |
| R22 | Presence of a README (tool) | – | – | – | – | ✓ | – | ✓ | ✓ | |
| R23 | Presence of a README (workflow) | – | – | – | – | ✓ | – | ✓ | ✓ | |
| R24 | Availability of a full documentation | – | – | – | – | ✓ | – | ✓ | ✓ | |
| R25 | Tools use a standard framework (eg, UIMA, Docker) | – | – | – | – | – | – | ✓ | – | |
| R26 | Input data in a standard format (eg, BioC, JsonNLP) | ✓ | – | – | – | – | – | ✓ | – | |
| R27 | Output data in a standard format (eg, BioC, JsonNLP) | ✓ | – | – | – | – | – | ✓ | – | |
| Usability | R28 | Absence of manual steps | ✓ | – | – | ✓ | – | ✓ | – | ✓ |
| R29 | Ability to scale up | ✓ | – | – | – | – | – | – | ✓ | |
| R30 | Ability to resume a workflow run | ✓ | ✓ | – | – | – | – | – | – | |
| R31 | Ability to customize the workflow | ✓ | – | – | – | – | – | – | – | |
| R32 | Management of multiple programming languages | – | – | – | – | – | – | – | – | |
| R33 | Workflow Modularity (use or share parts of the workflow) | ✓ | ✓ | – | – | – | – | – | – | |
| R34 | Licensing | ✓ | – | – | – | ✓ | – | – | ✓ | |
| R35 | Benchmark data and performance distributed with the tools | ✓ | – | – | – | – | – | ✓ | ✓ | |
| R36 | Identification of tools to be tailored locally (eg, preprocessing, local rules) | – | – | – | – | ✓ | – | – | – | |
| Shareability | R37 | Workflow publicly accessible | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | – |
| R38 | Tools publicly accessible | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | – | |
| R39 | Input data publicly accessible | ✓ | – | ✓ | ✓ | ✓ | ✓ | ✓ | – | |
| R40 | Resources publicly accessible | – | – | ✓ | – | – | ✓ | ✓ | – |
Figure 2.Classification of recommendation at tool and pipeline level. 21 recommendations are applicable both to tools and workflows, 12 to workflows only, and 7 to tools only.
Summary information on the NLP frameworks selected for our study
| cTakes | CLAMP | GATE | ScispaCy | TextFlows | OpenMinteD | Grid Galaxy | |
|---|---|---|---|---|---|---|---|
| project initiators | Mayo Clinic | School of Biomedical Informatics at the University of Texas Health at Houston | The University of Sheffield, South Yorkshire, England | Allen Institute for Artificial Intelligence, Seattle, WA, USA | Jožef Stefan Institute, Ljubljana, Slovenia | Athena Research and Innovation Center in Information, Communication and Knowledge Technologies |
Vassar College, Poughkeepsie, NY USA Brandeis University, Vassar College, Carnegie-Mellon |
| availability | Open source | Upon request | Open source | Open source | Open source | Open source | Open source |
| Licensing | Apache | / |
| Apache | MIT | Apache | Apache |
| language | Java | Java | Java | python | python | Python/Java | Python/ Java |
| Source |
|
|
|
|
|
|
|
| Demo |
|
|
|
|
|
|
|
| Tools format | UIMA | UIMA | GATE | scpaCy | widget | Galaxy tool | Galaxy tool |
| Container |
|
|
|
|
|
NLP frameworks described in the frame of reproducibility recommendations: we assign the presence (✓) partial (/) or absence or no information (-) of each recommendation in NLP frameworks
| Features | cTakes | CLAMP | GATE | ScispaCy | TextFlows | OpenMinteD | LAPPS Grid Galaxy | |
|---|---|---|---|---|---|---|---|---|
| R01 | Provenance metadata | ✓ | – | – | – | ✓ | ✓ | ✓ |
| R02 | Generating execution logs | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| R03 | System metadata (eg, RAM, CPU, OS, etc) | ✓ | ✓ | ✓ | – | ✓ | ✓ | ✓ |
| R04 | Record parameters of tools | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| R05 | Recording intermediate results | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| R06 | Use of version control for workflows | – | – | – | – | – | – | – |
| R07 | Use of version control for tools | – | – | – | – | – | – | – |
| R08 | Use of version control for resources | – | – | – | – | – | – | – |
| R09 | Archived tools versions | – | – | – | – | – | – | |
| R10 | Archived WMS versions | – | – | – | – | – | – | – |
| R11 | Archived resources versions | – | – | – | – | – | – | – |
| R12 | Archived input data versions | – | – | – | – | – | – | – |
| R13 | Standard objects identifier input data | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| R14 | Standard objects identifier output data | ✓ | ✓ | – | ✓ | ✓ | ✓ | ✓ |
| R15 | Standard objects identifier resources | – | – | ✓ | – | ✓ | ✓ | ✓ |
| R16 | Standard objects identifier tools | – | – | ✓ | – | ✓ | ✓ | ✓ |
| R17 | Use of research objects | – | – | – | – | – | – | – |
| R18 | Containerization |
|
| – | – |
|
|
|
| R19 | Use of relative path within tools/—hard coded path | – | – | – | – | – | – | – |
| R20 | Use of standard folder organization at workflow level (eg, BagIt) | – | – | – | – | – | ✓ | ✓ |
| R21 | Use of standard folder organization at tools level (eg, BagIt, Django project, eclipse plugin, etc) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| R22 | Presence of a README (tool) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| R23 | Presence of a README (workflow) | ✓ | ✓ | – | – | – | – | ✓ |
| R24 | Availability of a full documentation | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| R25 | Tools use a standard framework (eg, UIMA, Docker, Galaxy) | ✓ | ✓ | ✓ | – | – | – | ✓ |
| R26 | Input data in a standard format (eg, BioC, JsonNLP, LIF) | / | – | – | – | – | – | ✓ |
| R27 | Output data in a standard format (eg, BioC, JsonNLP, LIF) | / | – | – | – | – | – | ✓ |
| R28 | Absence of manual steps | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| R29 | Ability to scale up | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| R30 | Ability to resume a workflow run (after failure) | – | – | – | ✓ | – | ✓ | ✓ |
| R31 | Ability to customize the workflow | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| R32 | Management of multiple programming languages | – | – | – | – | – | ✓ | ✓ |
| R33 | Workflow modularity (use or share parts of the workflow) | – | – | – | – | – | ✓ | ✓ |
| R34 | Licensing | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| R35 | Benchmark data and performance distributed with the tools | ✓ | ✓ | ✓ | ✓ | – | – | – |
| R36 | Identification of tools to be tailored locally (eg, preprocessing, local rules) | – | – | – | – | – | – | – |
| R37 | Workflow publicly accessible | ✓ | ✓ | – | ✓ | ✓ | ✓ | ✓ |
| R38 | Tools publicly accessible | – | – | ✓ | ✓ | – | / | ✓ |
| R39 | Input data publicly accessible | – | – | – | – | – | – | – |
| R40 | Resources publicly accessible | – | – | – | ✓ | – | – | – |