| Literature DB >> 27888231 |
Piotr Przybyła1, Matthew Shardlow1, Sophie Aubin2, Robert Bossy2, Richard Eckart de Castilho3, Stelios Piperidis4, John McNaught5, Sophia Ananiadou5.
Abstract
Text mining is a powerful technology for quickly distilling key information from vast quantities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accuracy of current text mining resources. In this survey, we give an overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find biomedical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text mining, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable-those that have the crucial ability to share information, enabling smooth integration and reusability.Entities:
Mesh:
Year: 2016 PMID: 27888231 PMCID: PMC5199186 DOI: 10.1093/database/baw145
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
A comparison of popular metadata schemata, used to encode information about publications
| Name | Last updated | Domain | Main use |
|---|---|---|---|
| Dublin Core (DC)/DC Metadata Initiative (DCMI) | June 2012 | Generic | Widely accepted standard |
| Journal Article Tag Suite (JATS) | Actively Maintained | Journal Articles | Open access journals |
| DataCite | Actively Maintained | Research Data and Publications | Citations |
| CrossRef | Actively Maintained | Research Data and Publications | Citations |
| BibJSON | Actively Maintained | Bibliographic information | Bibliographic metadata |
| CERIF | Actively Maintained | Research Information | European research |
| CKAN | Actively Maintained | Generic | Data management portals |
Different formats describe different types of items as shown in the ‘Domain’ and ‘Main Use’ columns.
http://dublincore.org/
https://jats.nlm.nih.gov/
https://www.datacite.org/
http://www.crossref.org/
http://okfnlabs.org/bibjson/
http://www.eurocris.org/cerif/main-features-cerif
http://ckan.org/
A comparison of metadata schemata used for documenting language resources
| Name | Last Updated | Domain | Main use |
|---|---|---|---|
| TEI | Actively Maintained | Documents | Encoding text corpora |
| CMDI | Actively Maintained | Generic | Infrastructure for metadata profiles |
| META-SHARE | Actively Maintained | Language Resources | Metadata schema for language resources and services documentation |
| LRE Map | Updated at each LREC conference (biennial) | Language Resources | Metadata schema for language resources |
http://www.tei-c.org/
http://www.clarin.eu/content/component-metadata, http://www.clarin.eu/ccr/
http://www.meta-net.eu/meta-share/metadata-schema, http://www.meta-share.org/portal/knowledgebase/home
http://www.resourcebook.eu/searchll.php
A comparison of vocabularies and ontologies for metadata description, used in conjunction with metadata schemata to give meaningful descriptions of resources
| Title | Domain | Format |
|---|---|---|
| Medical Subject Headings (MESH) | Medicine | XML |
| EDAM (EMBRACE Data and Methods) ontology | Bioinformatics | OWL, OBO |
| Dewey Decimal Classification (DDC) | Library classification | – |
| Universal Decimal Classification (UDC) | Library classification | – |
| Library of Congress Subject Headings (LCSH) | Library classification | – |
| EuroVoc | Document classification | XML, SKOS/RDF |
| Semantic Web for Research Communities (SWRC) | Research communities | OWL |
| CASRAI dictionary | Research administration information | HTML |
| Bibliographic Ontology (BIBO) | Bibliographic information (citations and bibliographic references) | RDF/RDFS |
| COAR Resource Type Vocabulary | Open access repositories of research outputs | SKOS |
| PROV Ontology (PROV-O) | Provenance information | OWL2 |
| Open Digital Rights Language (ODRL) | Digital Rights Management, Licensing | RDF/XML |
| Creative Commons Rights Expression Language (ccREL) | Intellectual Property Rights, Digital Rights Management, Licensing | RDF |
A wide variety of formats and sizes, suitable for different domains, is reported above. Although it is difficult to compare size due to different formats, we have presented the resources in approximate order of the number of items held in each at the time of writing from most to least.
https://www.nlm.nih.gov/mesh/
http://edamontology.org/page
https://www.oclc.org/dewey.en.html
http://www.udcc.org/index.php/site/page?view=about
http://id.loc.gov/authorities/subjects.html
http://eurovoc.europa.eu/
http://ontoware.org/swrc/
http://dictionary.casrai.org/Main_Page
http://bibliontology.com/
https://www.coar-repositories.org/
https://www.w3.org/TR/prov-o/
https://www.w3.org/ns/odrl/2/ODRL21
https://wiki.creativecommons.org/wiki/CC_REL
A comparison of popular sources for the discovery of and access to publications for TM
| Title | Publications | Articles access | Type | Domain |
|---|---|---|---|---|
| OpenAIRE | 14.6 million | Abstracts, some full text articles, reports and project deliverables, open access | Aggregator | Open |
| Connecting Repositories (CORE) | 30.5 million | Abstracts, full text articles, open access | Aggregator | Open |
| Bielefeld Academic Search Engine (BASE) | 91.9 million | Abstracts, full text articles, books and multimedia documents, software and datasets, many open access | Aggregator | Open |
| PubMed | 26 million | Citations, abstracts, no full text articles (in principle) | Aggregator | Biomedical, life sciences |
| PubMed Central (PMC) | 3.9 million | Abstracts and full text of journal articles, open access subset | Repository | Biomedical, life sciences |
| MEDLINE | 22 million | Citations, abstracts | Aggregator | Biomedical, life sciences |
| Biodiversity Heritage Library | 109,382 | Abstracts, full text articles, citations, open access | Repository | Biodiversity |
| arXiv | 1.2 million | Full preprints and abstracts | Repository | Biology, physics, computer science, mathematics |
We have made a distinction between modes of operation in the ‘Type’ column.
https://www.openaire.eu/
https://core.ac.uk/
https://www.base-search.net/about/en/
http://www.ncbi.nlm.nih.gov/pubmed
http://www.ncbi.nlm.nih.gov/pmc/
https://www.nlm.nih.gov/pubs/factsheets/medline.html
http://www.biodiversitylibrary.org/
http://arxiv.org/
A comparison of annotation formats used in TM
| Model | Domain | Serialization formats | API | Type |
|---|---|---|---|---|
| BioC | Biomedical | XML | Reference APIs in multiple languages | Stand-off |
| BioNLP shared task TSV | Biomedical | TSV | No | Stand-off |
| BRAT format | Generic | TSV | No | Stand-off |
| Pubtator | Biomedical | TSV | No | Stand-off |
| TEI | Generic | XML | Via XSLT | Stand-off |
| NIF | Generic | RDF | No | Stand-off |
| LIF | Generic | RDF | Reference API in Java | Stand-off |
| IOB | Generic | TSV | Third-party APIs in several languages | In-line |
| Open Annotation | Generic | RDF | No | Stand-off |
| CAS (UIMA) | Generic | XML (XMI) | Reference APIs in Java and C ++ | Stand-off and in-line |
| GATE annotation format | Generic | Several | Reference API in Java | Stand-off and in-line |
| LAF/GrAF | Generic | XML | No | Stand-off |
| PubAnnotation | Generic | JSON | REST API to annotation store | Stand-off |
‘API’ stands for application programming interface and refers to whether there is a suitable library for use with this format. The domain column denotes the typical category of information encoded with this format.
http://bioc.sourceforge.net/
http://2011.bionlp-st.org/home/file-formats
http://brat.nlplab.org/standoff.html
http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/
http://www.tei-c.org/Guidelines/P5/
http://www.tei-c.org/Tools/Stylesheets/
http://persistence.uni-leipzig.org/nlp2rdf/
http://wiki.lappsgrid.org/interchange/
http://mvnrepository.com/artifact/org.lappsgrid/vocabulary
http://www.w3.org/ns/oa
https://uima.apache.org/d/uimaj-2.7.0/references.html#ugr.ref.cas
https://uima.apache.org/downloads/releaseDocs/2.1.0-incubating/docs/html/tutorials_and_users_guides/tutorials_and_users_guides.html
https://gate.ac.uk/sale/tao/splitch5.html
http://jenkins.gate.ac.uk/job/GATE-Nightly/javadoc/
ISO 24612:2012 – http://www.iso.org/iso/catalogue_detail.htm?csnumber=37326
http://www.pubannotation.org/docs/annotation-format/
http://www.pubannotation.org/docs/api/
A comparison of formats for the encoding of different types of knowledge resources
| Format | Resource type | Serialization | Libraries available |
|---|---|---|---|
| TMF/TBX[ | Terminologies | XML | Yes[ |
| LMF[ | Lexica | LMF | No |
| SKOS[ | Thesauri | RDF | Yes (RDF)[ |
| OWL[ | Ontologies | several | Yes[ |
| OBO[ | Ontologies | own | Yes[ |
| Ontolex[ | Lexica relative to ontologies | RDF | Yes[ |
| TMX[ | Translation memories | XML | Yes[ |
| XLIFF[ | Translation memories | XML | Yes[ |
‘Libraries available’ refers to whether there is a suitable library for use with this format.
http://www.tbxinfo.net/
http://www.tbxinfo.net/tbx-downloads/
http://www.lexicalmarkupframework.org/
https://www.w3.org/TR/skos-reference/
https://www.w3.org/2004/02/skos/tools
https://www.w3.org/OWL/
http://owlapi.sourceforge.net/
ftp://ftp.geneontology.org/pub/go/www/GO.format.obo-1_4.shtml
http://oboedit.org/?page=javadocs
https://www.w3.org/community/ontolex/wiki/Final_Model_Specification
https://github.com/cimiano/ontolex/blob/master/Ontologies/ontolex.owl
http://xml.coverpages.org/tmxSpec971212.html
http://docs.transifex.com/api/tm/
http://docs.oasis-open.org/xliff/xliff-core/xliff-core.html
http://www.opentag.com/xliff.htm#Resources
A comparison of popular knowledge resources, typically used in TM for the life sciences
| Name | Type | Domain | Size | Format | License |
|---|---|---|---|---|---|
| Uniprot[ | Knowledge base | Proteomics | 63 million sequences | Own, RDF, FASTA | CC |
| UMLS[ | Thesaurus | Biomedical | 3.2 million concepts | Own | Proprietary |
| Gene Ontology[ | Ontology | Genetics | 44 000 terms | OBO | CC |
| Agrovoc[ | Thesaurus | Agriculture | 32 000 concepts | RDF | CC |
| HPO[ | Vocabulary | Human phenotype | 10 000 terms | OBO, OWL, RDF | Free to use |
| CNO[ | Vocabulary | Neuroscience | 395 classes | OWL, RDF | CC |
| CARO[ | Ontology | Anatomy | 96 classes | OBO, OWL | Unspecified |
These resources differ in terms of type, domain and intended use. These differences make size difficult to compare as different resources have different base elements. Nonetheless, we have presented the table in an approximate order of size from largest to smallest.
http://www.uniprot.org/
https://www.nlm.nih.gov/research/umls/
http://geneontology.org/
http://aims.fao.org/vest-registry/vocabularies/agrovoc-multilingual-agricultural-thesaurus
http://human-phenotype-ontology.github.io/
https://bioportal.bioontology.org/ontologies/CNO
http://www.obofoundry.org/ontology/caro.html
Repositories for the curation of language resources, indexing language resources that are useful for the general domain and the life sciences
| Title | Available records | Type of content | Accessibility (Download) | Accessibility (Upload) | Domain |
|---|---|---|---|---|---|
| ELRA Catalogue of Language Resources[ | 1137 | Corpora, lexica | Some Paid | Restricted | Language technology |
| LDC catalogue[ | Over 900 resources | Corpora | Some Paid | Restricted | Language technology |
| VEST Registry[ | 118 | Vocabularies, standards, tools | Open | Registration upon request | agriculture, food, environment |
| AgroPortal[ | 35 | Vocabularies | Open | Registration upon request | agriculture, environment |
| BioPortal[ | 576 | Vocabularies | Open | Registration upon request | biology, health |
| CLARIN VLO[ | 876 743 records | Various | Open | Upon request | Language technology |
| META-SHARE[ | More than 2700 | Corpora, lexica, language descriptions, tools/services | Open | Registration upon request | Language technology |
| Stav corpora[ | 30 | Annotated corpora | Open | Closed | biomedical |
http://catalog.elra.info/
https://catalog.ldc.upenn.edu/
http://aims.fao.org/vest-registry
http://agroportal.lirmm.fr/
http://bioportal.bioontology.org/
https://www.clarin.eu/content/virtual-language-observatory
http://metashare.elda.org/
http://corpora.informatik.hu-berlin.de/
A comparison of popular interoperability frameworks and supported workflows
| Name | Workflow description language | Workflow engine | Programming language | License |
|---|---|---|---|---|
| Alvis[ | Alvis | Alvis | Java | ALv2 |
| Apache UIMA[ | Aggregates | Aggregates | Java/C ++ | ALv2 |
| CPE | CPE | |||
| UIMA AS | UIMA AS | |||
| RUTA | RUTA | |||
| UIMA DUCC | UIMA DUCC | |||
| GATE Embedded[ | GATE Applications | GATE Embedded | Java | LGPL |
| Heart of Gold[ | Yes (unnamed) | MoCoMan | Java/Python | LGPL |
http://www.quaero.org/module_technologique/alvis-nlp-alvis-natural-language-processing/
https://uima.apache.org/
https://gate.ac.uk/family/embedded.html
http://heartofgold.dfki.de/
A comparison of popular analytics packages
| Name | Native processing framework support | Programming language | Repository | License |
|---|---|---|---|---|
| Apache OpenNLP[ | UIMA | Java | Maven | ALv2 |
| NLP4J (aka Emory NLP)[ | No | Java | Maven | ALv2 |
| FreeLing[ | No | C ++ | No | AGPL + commercial |
| NLTK[ | No | Python | PyPI | ALv2 |
| LingPipe[ | No | Java | Maven | AGPL + commercial |
| Stanford CoreNLP[ | No | Java | Maven | GPL + commercial |
https://opennlp.apache.org/
https://github.com/emorynlp/nlp4j
http://nlp.lsi.upc.edu/freeling/
http://www.nltk.org/
http://alias-i.com/lingpipe/
http://stanfordnlp.github.io/CoreNLP/
A comparison of popular component collections
| Name | Focus area | Processing framework | Repository | Programming language | License |
|---|---|---|---|---|---|
| Apache cTAKES[ | Medical records | UIMA | Maven | Java | ALv2 |
| Bluima[ | Biomedical | UIMA | Maven | Java | ALv2 |
| ClearTK[ | Machine Learning | UIMA | Maven | Java | BSD/GPL |
| DKPro Core[ | Linguistic analysis | UIMA | Maven | Java | ALv2/GPL |
| JCoRe[ | Biomedical | UIMA | Maven | Java | LGPL/GPL |
| BioNLP-UIMA[ | Biomedical | UIMA | Maven | Java | BSD |
| GATE built-in component collection[ | Linguistic analysis and information extraction | GATE | GATE | Java | LGPL/GPL |
| NaCTeM collection[ | UIMA | None | Java | Proprietary | |
| Semantic Software Lab collection[ | Biomedical | GATE | GATE | Java | LGPL/GPL |
http://ctakes.apache.org/
https://github.com/BlueBrain/bluima
https://cleartk.github.io/cleartk/
https://dkpro.github.io/dkpro-core/
http://julielab.github.io/
http://bionlp.sourceforge.net/
https://gate.ac.uk/
http://argo.nactem.ac.uk/
http://www.semanticsoftware.info
A comparison of popular analytics workbenches
| Name | Processing framework | UI | Component collection | External repositories | License |
|---|---|---|---|---|---|
| Argo[ | UIMA | Web-based (service) | NaCTeM | No | Proprietary |
| CLARIN-D WebLicht[ | Proprietary | Web-based (service) | Built-in | No | Proprietary |
| GATE Developer[ | GATE | Installable application | Built-in External | GATE Repositories | LGPL |
| U-Compare[ | UIMA | Installable application | Built-in | no | Proprietary |
| UIMA Ruta[ | UIMA | Installable application (Eclipse plugin) | UIMA-based (e.g. DKPro Core, …) | Yes (via Maven) | ALv2 |
| LAPPS Grid Galaxy[ | UIMA + GATE via Galaxy | Web-based, installable application | Multiple (e.g. GATE, DKPro Core, …) | Galaxy tool shack | ALv2 |
http://argo.nactem.ac.uk/
http://www.clarin-d.de/en/language-resources-and-services/weblicht
https://gate.ac.uk/family/developer.html
http://nactem.ac.uk/ucompare/
https://uima.apache.org/ruta.html
http://galaxy.lappsgrid.org/
A comparison of general purpose workflow engines
| Name | Description of modules | License | Example domains | Component creation | Language |
|---|---|---|---|---|---|
| ELKI[ | data mining algorithms; clustering; outlier detection; dataset statistics; benchmarking, etc. | GNU AGPL | Cluster benchmarking | Programming new Java components | Java |
| Galaxy[ | genome research; data access; visualization components | AFL 3 | Bioinformatics | command line tools | Python |
| Kepler[ | Wide variety of components | BSD | Bioinformatics, data monitoring | Java components, R scripts, Perl, Python, compiled C code, WSDL services | Java |
| KNIME[ | Univariate and multivariate statistics; data mining; time series; image processing; web analytics; TM; network analysis; social media analysis | GPL3 | Business intelligence, financial data analysis | Java, Perl, Python code fragments | Java (Eclipse plugin) |
| Pegasus[ | Shell scripts; command line tools | Apache | Astronomy, bioinformatics, earthquake science | Command line | Java, Python, C |
| Pipeline Pilot[ | Chemistry; Biology; Materials Modelling; Simulation | Proprietary | Chemicals, Energy, Consumer Packaged Goods, Aerospace | Users cannot create components | C ++ |
| Taverna[ | Wide variety of components | LGPL | Bioinformatics, astronomy, chemo-informatics, health informatics | WSDL, SOAP and REST services, Beanshell scripts, local Java API, R scripts | Java |
| Triana[ | audio, image, signal and text processing; physics studies | Apache | Signal processing | Programming new Java components | Java |
| SADI[ | access to the databases and analytical tools for bioinformatics | BSD | Bioinformatics | Web services | OWL, RDF, SPARQL |
These can be used for a variety of scientific programming applications, of which one is TM. We have provided some examples of the typical usages of these resources in the table above.
http://elki.dbs.ifi.lmu.de/
https://galaxyproject.org/
https://kepler-project.org/
https://www.knime.org/knime-analytics-platform
https://pegasus.isi.edu/
http://accelrys.com/products/collaborative-science/biovia-pipeline-pilot/
http://www.taverna.org.uk/
http://www.trianacode.org/
http://sadiframework.org/content/
A comparison of repositories for tools and services that can be redeployed in the text miner's workflow
| Title | Available records | Accessibility | Status | Domain | Category |
|---|---|---|---|---|---|
| BioCatalogue[ | 1,184 | Open access/use and open registration | Running, last updated in 2015[ | Life sciences | Registry |
| Biodiversity Catalogue[ | 71 | Open access/use and open registration | Running, last updated in 2015[ | Biodiversity | Registry |
| Orbit[ | 89 | Open access/registration requires approval | Running, last updated in 2015 | Biomedical informatics | Registry |
| AnnoMarket[ | 60 | Paid for (customers can pay to use any service, third parties can upload their own services and data to sell) | Closed, last updated in 2014[ | General | Platform |
| META-SHARE[ | more than 2,765 | Restricted (anyone can access but addition of new resources requires registering as a META-SHARE member) | Running, last updated in 2016[ | General | Registry |
| LRE Map[ | 3985 | Closed (no option to add own resources) | Running, closed source | General | Registry |
| ALVEO[ | 34 | Open use of services; uploading of services locked | Running, last updated in 2016[ | General | Platform |
| Language Grid[ | 142 | Open use of services for non-profit and research; uploading of services for members | Running, last updated in 2015[ | General | Platform |
| LAPPS Grid[ | 45 | Open use of services; uploading of services locked | Running, last updated in 2016[ | General | Platform |
| QT21[ | 598 | Open browsing and use of services, restricted registry | Beta, closed source | General | Platform |
| LINDAT/CLARIN[ | 1162 | Open | Running, last updated 2016[ | Open | Registry |
| CLARIN Virtual Language Observatory[ | 880 915 | Open | Running, last updated 2016[ | Open | Registry |
There is a large variation in the size and accessibility of these repositories.
https://www.biocatalogue.org/
https://github.com/myGrid/biocatalogue
https://www.biodiversitycatalogue.org/
https://github.com/myGrid/biocatalogue
https://orbit.nlm.nih.gov/
https://annomarket.com
https://github.com/annomarket
http://www.meta-share.eu
https://github.com/metashare/META-SHARE
http://www.resourcebook.eu
http://alveo.edu.au
https://github.com/Alveo
http://langrid.org
http://svn.code.sf.net/p/servicegrid/code
http://www.lappsgrid.org
https://github.com/lappst
http://www.qt21.eu
https://lindat.mff.cuni.cz/en/
https://github.com/ufal/lindat-dspace
https://vlo.clarin.eu/
https://github.com/clarin-eric/VLO