| Literature DB >> 23989082 |
Son Doan1, Ko-Wei Lin, Mike Conway, Lucila Ohno-Machado, Alex Hsieh, Stephanie Feudjio Feupe, Asher Garland, Mindy K Ross, Xiaoqian Jiang, Seena Farzaneh, Rebecca Walker, Neda Alipanah, Jing Zhang, Hua Xu, Hyeon-Eui Kim.
Abstract
The database of genotypes and phenotypes (dbGaP) developed by the National Center for Biotechnology Information (NCBI) is a resource that contains information on various genome-wide association studies (GWAS) and is currently available via NCBI's dbGaP Entrez interface. The database is an important resource, providing GWAS data that can be used for new exploratory research or cross-study validation by authorized users. However, finding studies relevant to a particular phenotype of interest is challenging, as phenotype information is presented in a non-standardized way. To address this issue, we developed PhenDisco (phenotype discoverer), a new information retrieval system for dbGaP. PhenDisco consists of two main components: (1) text processing tools that standardize phenotype variables and study metadata, and (2) information retrieval tools that support queries from users and return ranked results. In a preliminary comparison involving 18 search scenarios, PhenDisco showed promising performance for both unranked and ranked search comparisons with dbGaP's search engine Entrez. The system can be accessed at http://pfindr.net.Entities:
Keywords: DBGAP; GWAS; Information Retrieval; Natural Language Processing; Phenotype Standardization; Text Mining
Mesh:
Year: 2013 PMID: 23989082 PMCID: PMC3912702 DOI: 10.1136/amiajnl-2013-001882
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1Screenshot of the PhenDisco system. The top panel contains a search input box with concept-based search (ie, expandable terms) as the default.
Figure 2Components of the PhenDisco system: (1) sdGaP (semantic-driven genotypes and phenotype) database contains standardized phenotype variables and study metadata from dbGaP, and (2) information retrieval tools that parse input queries, map into information model and return ranked studies. sdGaP consists of data from dbGaP that are mapped into our information model, as well as study meta-data.
List of 18 user-defined queries used for pilot evaluation
| Case no. | Query |
|---|---|
| 1 | Asthma |
| 2 | Asthma AND ‘African American’ |
| 3 | Asthma AND ‘African American’ AND Hispanic |
| 4 | Asthma AND ‘African American’ AND ‘skin test’ |
| 5 | Asthma AND ‘African American’ AND Hispanic AND ‘skin test’ |
| 6 | Asthma AND ‘African American’ AND FEV1 |
| 7 | Asthma AND ‘African American’ AND Hispanic AND FEV1 |
| 8 | Asthma AND ‘skin test’ |
| 9 | COPD |
| 10 | ‘Chronic obstructive pulmonary disease’ AND Caucasian |
| 11 | ‘Chronic obstructive pulmonary disease’ AND Caucasian AND ‘high cholesterol’ |
| 12 | COPD AND hypercholesterolemia |
| 13 | COPD AND FVC |
| 14 | ‘Chronic obstructive pulmonary disease’ AND Caucasian AND FVC |
| 15 | ‘Myocardial infarction’ |
| 16 | ‘Myocardial infarction’ AND black |
| 17 | MI AND BMI |
| 18 | ‘Myocardial infarction’ AND black AND BMI |
Information retrieval performance of PhenDisco versus dbGaP on 18 user case queries
| Precision | Recall | F-measure | MRP (top 5) | MAP | |
|---|---|---|---|---|---|
| dbGaP Entrez | 0.0756 | 0.5278 | 0.1321 | 0.0600 | 0.0756 |
| PhenDisco | 0.3000 | 0.9722 | 0.4552 | 0.4000 | 0.2971 |
MRP (top 5) is mean rank precision at top five retrieved studies, MAP is mean average precision.