| Literature DB >> 24048470 |
Donald C Comeau1, Rezarta Islamaj Doğan, Paolo Ciccarese, Kevin Bretonnel Cohen, Martin Krallinger, Florian Leitner, Zhiyong Lu, Yifan Peng, Fabio Rinaldi, Manabu Torii, Alfonso Valencia, Karin Verspoor, Thomas C Wiegers, Cathy H Wu, W John Wilbur.
Abstract
A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net/Entities:
Mesh:
Year: 2013 PMID: 24048470 PMCID: PMC3889917 DOI: 10.1093/database/bat064
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.BioC process sequence.
Figure 2.BioC.dtd.
Elements in the BioC.dtd
| Description | |
|---|---|
| A group of documents, usually from a known corpus. | |
| Name of the corpus or other source where the documents were obtained. | |
| Reference to a separate document describing the details of the BioC XML file. It should include all information needed to interpret the data in the file such as types used to describe passages and annotations. For example, if a file includes part-of-speech tags, this file should describe the part-of-speech tags used. An HTML URL would also be a useful way to reference a key file. | |
| Date when the documents were extracted from the original corpus. It may be as simple as YYYYMMDD, but any reasonable format described in the key file is acceptable. | |
| Key-value pairs can record essentially arbitrary information. | |
| Attribute: | |
| | |
| For example: key = ‘type’ will be particularly common. For PubMed documents, passage ‘type’ might signal ‘title’ or ‘abstract’. For annotation elements, it might indicate ‘noun phrase’, ‘gene’ or ‘disease’. The semantics encoded in the infon key-value pairs should be described in the key file. | |
| A document in the collection. A single, complete and stand-alone document. | |
| id of the document in the parent corpus. Should be unique in the collection. | |
| One portion of the document. PubMed documents have a title and an abstract. Structured abstracts could have additional passages. For full-text documents, passages could be sections such as Introduction, Materials and Methods or Conclusion. Another option would be paragraphs. Passages impose a linear structure on the document. | |
| Where the element occurs in the parent document. They should be sequential, avoid overlap and identify an element's position in the document. An element’s position is specified with respect to the whole document and not relative to its parent element’s position. | |
| The original text of the element. | |
| One sentence of the passage. | |
| Stand-off annotation. | |
| Attribute: | |
| | |
| Location of the annotated text. Multiple locations indicate a multispan annotation. | |
| Attributes: | |
| | |
| | |
| Relation between annotations and/or other relations. | |
| Attribute: | |
| | |
| The annotations and/or other relations in this relation. | |
| Attributes: | |
| | |
| |
Figure 3.The exampleCollection.xml.
Figure 4.The exampleCollection.key file describing the elements of the exampleCollection.xml file.
Figure 5.The exampleSentence.xml.
Figure 6.The exampleAnnotation.xml.
Possible annotations in the BioC format
| id | Infon Key: value | Location | Text | Comments | |
|---|---|---|---|---|---|
| Offset | Length | ||||
| T4 | 25 | 10 | Tomography | Part of speech tagging | |
| L14 | 92 | 7 | Smokers | Lemmatization of token | |
| A1 | 16 | 19 | Computed tomography | Abbreviation (ABRV) definition in text | |
| A2 | 37 | 2 | CT | Abbreviation in text | |
| D1 | 61 | 11 | Lung cancer | Disease name mention in text. | |
| D1 | Concept in terminology resource | ||||
| E1 | 16 | 19 | Computed tomography screening | Segmented mention annotation | |
| 41 | 9 | ||||
The efficacy of computed tomography (CT) screening for early lung cancer detection in heavy smokers is currently being tested by a number of randomized trials.