| Literature DB >> 19761564 |
Ronilda Lacson1, Erik Pitzer, Christian Hinske, Pedro Galante, Lucila Ohno-Machado.
Abstract
BACKGROUND: This study describes a large-scale manual re-annotation of data samples in the Gene Expression Omnibus (GEO), using variables and values derived from the National Cancer Institute thesaurus. A framework is described for creating an annotation scheme for various diseases that is flexible, comprehensive, and scalable. The annotation structure is evaluated by measuring coverage and agreement between annotators.Entities:
Mesh:
Year: 2009 PMID: 19761564 PMCID: PMC2745681 DOI: 10.1186/1471-2105-10-S9-S10
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
(a) taken from GDS showing three axes – "cell line," "disease state", and "stress" with corresponding values; (b) taken from GDS showing cell line descriptors.
| Cell line | HTB26 |
| Cell line | HT29 |
| Disease state | Breast cancer |
| Disease state | Colon cancer |
| Stress | Caspase inactivated |
| Stress | DNA fragmented |
| Cell line | Breast tumor |
| Cell line | Colon tumor |
Figure 1Illustration of concepts derived from NCI thesaurus used for variables and values.
Criteria for measuring agreement
| Strict similarity | Exactly the same variable value between annotators. |
| Semantic similarity | There is lexical discordance, but the words match to the same concept. This subsumes hierarchical similarity. |
| Partial similarity | Partial agreement, some degree of discordance. |
Disease categories annotated from GEO
| Breast Cancer | 41 |
| Colon Cancer | 30 |
| Inflammatory Bowel Disease (IBD) | 30 |
| Insulin Dependent Diabetes Mellitus (DM) | 21 |
| Rheumatoid Arthritis (RA) | 19 |
| Systemic Lupus Erythematosus (SLE) | 32 |
Sample variables that are annotated for three disease categories – breast and colon cancer and rheumatoid arthritis
| Breast cancer | Age | ER/PR |
| Colon cancer | Duke staging | |
| Rheumatoid arthritis | Cell type |
Coverage of the top ten variables
| Tissue | C12801 | 99.7 |
| Cell line | C16403 | 99.5 |
| Disease state | C2991 | 98.9 |
| Sample type | C70713 | 98.0 |
| Genetically modified | C16621+C42629 | 92.8 |
| Treatment | C49236 | 76.2 |
| Treatment type | C49236+C27993 | 71.5 |
| Time series | C18235 | 67.2 |
| Gender | C17357 | 59.9 |
| Age | C25150 | 53.2 |
Inter-annotator agreement
| Strict | 89.3 |
| Semantic | 91.0 |
| Semantic + Partia | l92.2 |
Disagreement between Annotators
| Annotator 1 | Annotator 2 | |
| Treatment type | unknown | no |
| Treatment | unknown | yes |
| Sample type | unknown | tumor |
| Stage | 2 | 2a |
| TNM classification | T4b N2a M0 | T4b N2a M3b |
| Family history | no | yes |