| Literature DB >> 26475471 |
Shweta Bagewadi1, Subash Adhikari2, Anjani Dhrangadhariya3, Afroza Khanam Irin3, Christian Ebeling4, Aishwarya Alex Namasivayam5, Matthew Page6, Martin Hofmann-Apitius3, Philipp Senger7.
Abstract
Neurodegenerative diseases are chronic debilitating conditions, characterized by progressive loss of neurons that represent a significant health care burden as the global elderly population continues to grow. Over the past decade, high-throughput technologies such as the Affymetrix GeneChip microarrays have provided new perspectives into the pathomechanisms underlying neurodegeneration. Public transcriptomic data repositories, namely Gene Expression Omnibus and curated ArrayExpress, enable researchers to conduct integrative meta-analysis; increasing the power to detect differentially regulated genes in disease and explore patterns of gene dysregulation across biologically related studies. The reliability of retrospective, large-scale integrative analyses depends on an appropriate combination of related datasets, in turn requiring detailed meta-annotations capturing the experimental setup. In most cases, we observe huge variation in compliance to defined standards for submitted metadata in public databases. Much of the information to complete, or refine meta-annotations are distributed in the associated publications. For example, tissue preparation or comorbidity information is frequently described in an article's supplementary tables. Several value-added databases have employed additional manual efforts to overcome this limitation. However, none of these databases explicate annotations that distinguish human and animal models in neurodegeneration context. Therefore, adopting a more specific disease focus, in combination with dedicated disease ontologies, will better empower the selection of comparable studies with refined annotations to address the research question at hand. In this article, we describe the detailed development of NeuroTransDB, a manually curated database containing metadata annotations for neurodegenerative studies. The database contains more than 20 dimensions of metadata annotations within 31 mouse, 5 rat and 45 human studies, defined in collaboration with domain disease experts. We elucidate the step-by-step guidelines used to critically prioritize studies from public archives and their metadata curation and discuss the key challenges encountered. Curated metadata for Alzheimer's disease gene expression studies are available for download. Database URL: www.scai.fraunhofer.de/NeuroTransDB.html.Entities:
Mesh:
Year: 2015 PMID: 26475471 PMCID: PMC4608514 DOI: 10.1093/database/bav099
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Overall workflow for curation of gene expression studies related to neurodegeneration from public archives. The first step involves automated retrieval of gene expression studies (along with metadata) from public archives such as GEO, and ArrayExpress. The related studies were further assigned to one of the two prioritization classes (priority 1 or priority 2), based on the specific experimental variables. Next, manual curation was applied to capture missing metadata information on priority 1 studies. All the harvested metadata was normalized using standard vocabularies. Both raw and normalized data are stored in NeuroTransDB.
Figure 2.Automated data retrieval of Alzheimer’s Disease specific gene expression studies from ArrayExpress and GEO. Here, the dotted line represents the sequence of query performed. Alzheimer’s disease specific experiment IDs were automatically retrieved from GEO and ArrayExpress, using keywords, through eSearch and REST service respectively. Metadata information was extracted by automatically parsing sample information files (SDRF and SOFT) of these experiment IDs.
Figure 3.Experiment prioritization for metadata curation in All the downloaded Alzheimer’s Disease experiments were first checked for their disease relevancy. Those experiments which were falsely retrieved, are marked as unrelated. The remaining experiments were classified into one of two priority classes based on the experiment type: In vivo or In vitro studies. For priority 1, we considered direct/primary samples from human or animal models such as brain tissue, blood, etc. Experiments that were conducted on derived sample sources such as cell lines, were put into priority 2 class.
Detailed description of Neurodegenerative disease metadata fields outlined for human, mouse and rat
| Annotation type | Metadata fields | Description of the annotation | Relevancy for NDD | Examples | References |
|---|---|---|---|---|---|
| Organism attributes | Age | Age of the organism | Main factor for predisposition to disease | 84 years, 9 months | ( |
| Gender | Gender of the organism | Possible disproportionate effect arising from difference in anatomy and hormonal composition | Male, female | ( | |
| Phenotype | Clinical phenotypes of the organism from which the sample was extracted | Supports comparative analysis for underlying pathomechanisms based on the observable/measurable characteristics | Healthy control, early incipient | ( | |
| Behavioural Effect | Description of behavioural changes occurring in organism due to treatment or other effects | Impact of developed drug or other environmental factors to treat or reduce the disease/disease symptoms | Reduced agitation/aggression | ( | |
| Disease type | The disease occurrence is due to hereditary or effect of environmental factors | To distinguish the genetic variability and complexity between the two types during analysis | Sporadic, familial | ( | |
| Stage | Disease stage of the organism from which the sample was extracted | Capability to distinguish severity of the affected disease | Incipient, severe, BRAAK II | ( | |
| Cause of death | Reason for the organism’s death | To determine if Alzheimer’s disease or its associated comorbidities are major contributors to death rate | Respiratory disorder | ( | |
| Comorbidity | Existence of another disease other than Alzheimer’s | To determine the impact of another disease on Alzheimer’s disease aetiology and progression | Type 2 diabetes | ( | |
| Sample annotations | Post mortem duration (PMD) | Duration from death till the sample extraction from the dead organism | To assess quality and reliability of the sample obtained by measuring RNA integrity that is influenced by natural degradation of the sample after death | 2.5 hours | ( |
| pH | pH value of the extracted post-mortem sample | Indicator of agonal status and RNA integrity | 6 | ( | |
| Functional effect | Description of functional effects observed | Observed changes such as gene expression, post-translation, or pathway due to external effects | Decreased expression of BDNF gene, reduced Aβ toxicity | ( | |
| Brain region | Brain region of the extracted sample | Provides information of pathogenesis and disease progression, as AD does not affect all the brain regions simultaneously | Hippocampus | ( | |
| Cell and cell parts | Type of cells or cell parts extracted from the sample for analysis (if any) | To determine cell type specific expression influencing pathogenesis and regional vulnerability | Synaptoneurosome, neurons and astrocyte | ( | |
| Body Fluid | Type of body fluid used for analysis | Could serve as biomarkers for early diagnosis and therapy monitoring | CSF, blood | ( | |
The table provides a list of metadata fields, confirmed by disease experts, critical for NDD meta-analysis. The selected fields are classified as organism attributes and sample annotations based on their relevancy to organism or sample source.
Detailed description of additional metadata fields, defined specifically for mouse and rat models
| Annotation type | Metadata fields | Description of the annotation | Relevancy for NDD | Examples | References |
|---|---|---|---|---|---|
| Organism attributes | Physical injury | Method used to cause brain injury in animal models | Consideration for analysing plaque formation in animal models to mimic disease symptoms in human | Traumatic brain injury, ischemia reperfusion injury | ( |
| Type of treatment | Description of chemical, drug, genetic or diet treatment | Consideration for determining the effect of treatment on animal models either to mimic or treat the disease/symptoms | Long-term pioglitazone, BDNF treated | ( | |
| Dosage | Detailed description of the dosage associated with “type of treatment” description | Consideration of the right quantity of the substance for determining the effect on animal models either to mimic or treat the disease/symptoms | Total polyphenol 6mg/kg/day, received drinking water without ACE inhibitor | ( | |
| Mouse/rat strain name | Mouse model official or author given name | To determine the effect of different manipulated animal models in recapitulating key AD features capable of extrapolating to human studies | C57BL/6-129 hybrid, Sprague–Dawley rat | ( | |
| Mouse/rat weight | Weight of the animal model during analysis | Establishing a causative link to metabolic disruption | 100–150 g | ( |
These additional metadata fields are defined by disease experts as critical for translating mouse/rat model outcomes to human, in the field of neurodegenerative diseases.
Figure 4.Semi-automated workflow for metadata curation. Automatically extracted metadata fields are rechecked by the curators. To capture the missing fields, curators browse through GEO, ArrayExpress (AE) or GEO2R experiment’s description pages. For cases where the information is still incomplete, associated fulltext publications and their associated supplementary material are read. All the extracted metadata annotations are stored in NeuroTransDB. Intermediately, if feasible, automated extraction leverages on curator’s experience for improvement. This process is carried out half yearly.
Figure 5.Distribution of MIAME and MINSEQE scores for all automatically retrieved Alzheimer’s Disease gene expression experiments in ArrayExpress Database (for human, mouse and rat), as of December 2014. Percentage is calculated as (total number of AD experiments with a certain score)/(total number of AD experiments). ‘NA’ are the experiments which were not present in ArrayExpress. These scores reflect adherence to compliance standards by the data submitters, needed for re-investigation and reproducibility. It is observed that large percentage of experiments fall under score 4, shows that the required minimum information is still incomplete. The list of experiment IDs along with their associated scores, used for generating this statistics are provided in Supplementary File S1.
Figure 6.Priority classification statistics for Alzheimer’s disease gene expression experiments retrieved from ArrayExpress and GEO (for human, mouse and rat). Alzheimer’s disease experiments were retrieved using keywords. Applying the Experiment Prioritization guidelines, they were manually classified to one of the priority classes. Among them, 20% of the experiments were not related to Alzheimer’s disease. The digits on the bars represent number of experiments.
Figure 7.Coverage of basic metadata annotation fields for human AD priority 1 samples with automated retrieval and manual curation. Automated retrieval involved downloading the metadata information from ArrayExpress and GEO, programmatically. For missing meta-annotations, we applied manual curation step to harvest information from the published articles and their associated Supplementary materials. It is clear from the above statistics that manual curation accuracy for basic annotations, such as patient’s clinical manifestations, and raw file information, is highly dependent on data availability.
Figure 8.Frequency distribution and Trend Analysis of human priority 1 Alzheimer’s disease gene-expression experiments for availability of five basic annotation fields in GEO/ArrayExpress sample page versus manual curation. The five basic annotations considered here are age, gender, stage, phenotype and raw filename. (A) Red and blue line represents the linear trend analysis of the availability of meta-annotations for experiments (represented as dots) over years, which has declined. (B) The black line represents mean value of the number of annotation fields filled. It is evident from the shift in mean of the distribution analysis that manual curation plays a very important role in capturing the missing metadata information.