Literature DB >> 23926434

Text Categorization of Heart, Lung, and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing n-grams and Metadata Features.

Mindy K Ross¹, Ko-Wei Lin, Karen Truong, Abhishek Kumar, Mike Conway.

Abstract

The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contribution to genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, effective use of the database is hindered by suboptimal study retrieval. Our objective is to evaluate text classification techniques to improve study retrieval in the context of the dbGaP database. We utilized standard machine learning algorithms (naive Bayes, support vector machines, and the C4.5 decision tree) trained on dbGaP study text and incorporated n-gram features and study metadata to identify heart, lung, and blood studies. We used the χ(2) feature selection algorithm to identify features that contributed most to classification performance and experimented with dbGaP associated PubMed papers as a proxy for topicality. Classifier performance was favorable in comparison to keyword-based search results. It was determined that text categorization is a useful complement to document retrieval techniques in the dbGaP.

Entities: CellLine Chemical Disease Gene Species

Keywords: GWAS; database; genome-wide association studies; natural language processing 35/45; text categorization; text classification

Year: 2013 PMID： 23926434 PMCID： PMC3728208 DOI： 10.4137/BII.S11987

Source DB: PubMed Journal: Biomed Inform Insights ISSN： 1178-2226

Introduction

In 2003, the National Institute of Health (NIH) required that certain funded projects include a data-sharing plan. Motivated by the idea of understanding phenotypic influence on genetic disease, these policies were later expanded to include NIH funded genome-wide associated studies (GWAS).1,2 In order to facilitate implementation, a central data repository, the database of Genotypes and Phenotypes (dbGaP), was created by the National Center for Biotechnology Information (NCBI) to provide researchers access to the genotypic and phenotypic information. The database includes specific phenotype variables, statistical summaries of genetic information, and offers potential to access individual level data if approved by an NIH Data Access Committee. The database is growing at a rapid pace. In August 2011, 187 top-level studies (a study comprised of sub-studies) were archived in dbGaP, and by December 2012 there were 357 top-level studies.3 The existence of a publicly accessible database, however, does not guarantee information is available in a suitable form for efficient retrieval, study replication, identification of control populations, or new hypothesis generation.4 Major contributing factors to suboptimal study reuse are that phenotypic variable names are not standardized and related concepts are not effectively mapped. Approximately 130,000 variable names exist in the database and many are redundant. For example, systolic blood pressure is represented by SBP, systolic_BP, and other variations such that a string match-based search may miss these synonyms. Alternatively, if the search term “white” is entered as a keyword search instead of “Caucasian,” for example, the retrieved results for each search do not match. Genetic studies are expensive and time-consuming, hence maximizing data reuse is of paramount importance.5 We are approaching this problem by aligning phenotype variable descriptions to a standard information model in two ways. First, our group has developed PhenDisco—Phenotype Discoverer (http://pfindr.net/)—a robust tool for researchers to query and upload studies in a standardized fashion by retrofitting phenotypes in dbGaP with ontologies using natural language processing. Second, we are enhancing PhenDisco by integrating automatic document classification techniques into the system. This paper is focused on the second objective as we explore approaches to automatic document classification, motivated by the need to provide enhanced search functionality to the PhenDisco system. The exploding number of scientific publications in recent decades and an increasing number of databases makes automated document classification in biomedicine extremely important to provide accurate data retrieval, organize topics of interest for research, and streamline costs of data curation. While there is a growing body of literature in the field of biomedical text classification, a search of PubMed and Google Scholar revealed no publications about text classification applied to dbGaP. We aim to (1) describe in detail the attributes of dbGaP studies, and (2) improve text categorization utilizing n-grams and metadata features. We focus on heart (cardiac), lung (pulmonary and/or respiratory), and blood (heme) studies. These categories were chosen due to their importance to the dbGaP host organization, the National Heart, Lung, and Blood Institute (NHLBI).

Methods

Three hundred and seventeen studies were available in dbGaP on July 1, 2012. Each title and abstract was manually reviewed and annotated by MKR and KWL into heart, lung, blood, and other categories. Interrater reliability was calculated using the R programming language (package IRR) implementation of Cohen’s kappa (two raters) and Fleiss’s kappa (three raters).6 We confirmed the need for enhanced retrieval by performing a simple manual keyword search experiment. Search terms (asterisk = wildcard search) for heart studies were heart and/or card*. Lung study terms were lung and/or pulm* and/or resp*. The terms blood and/or heme* were entered for blood studies. Three machine learning algorithms used successfully for text classification in the past were applied in this work: naïve Bayes (NB), support vector machines (SVM), and the C4.5 decision tree learning algorithm (Weka v. 3.6.8 using default parameters). The NB algorithm is frequently utilized for text classification with good results despite its assumption that features are independent. The SVM algorithm is commonly used because it is robust and resilient to over-fitting. Decision tree algorithms are another commonly applied tool, but incur the risk of mistakes early in the training process.7 Below we describe the experiments in detail. First, we explain the study metadata features. Second, we detail text classification experiments integrating n-grams and metadata features. Finally, we outline the χ2 feature selection algorithm experiments. In addition to text classification based directly on dbGaP studies, we also experimented with using PubMed listed abstracts associated with each study as proxy representation of dbGaP documents. All classifiers were evaluated using Weka with stratified 10-fold cross validation. Evaluation metrics used were accuracy, precision, recall, and F-measure (Table 1). The F-measure is the harmonic mean between precision and recall. Results of the manual keyword search demonstrate the opportunity for improvement in accuracy, precision, recall, and F-measure (Table 2).

Table 1

Contingency table and measurement definitions.

	Correct label	Incorrect label
Assigned label	a	b

Not assigned label	c	d

Accuracy = a +d/a +b +c +d
Precision = a/a +b
Recall = a/a +c
F-measure = 2 × precision × recall/precision + recall

Table 2

dbGaP keyword search results.

	Heart	Lung	Blood
Accuracy	0.64	0.66	0.41
Precision	0.31	0.14	0.07
Recall	0.72	0.75	0.76
F-measure	0.43	0.23	0.13

Notes: dbGaP manual keyword search results of heart, lung and blood studies based on gold-standard label assignment by investigators MKR and KWL.

Metadata features

Each dbGaP study Web page is organized into descriptive sections (Fig. 1). The sections are: study description, authorized access, publicly available data, study inclusion/exclusion criteria, molecular data, study history, selected publications, disease related to study (MeSH terms), links to other NCBI resources, authorized data access requests, and study attribution (principal/co-investigators and funding source). We focused on journal publications, MeSH terms, principal/co-investigators, and funding source as these features have been shown to increase classification accuracy in previous studies.8–10

Figure 1

Metadata features of studies in dbGaP.

Note: Each dbGaP study is organized into descriptive sections.

Abbreviations: dbGaP, Database of Genotypes and Phenotypes; IRB, Institutional Review Board; ftp, File Transfer Protocol; MeSH, Medical Subject Headings; NCBI, National Center for Biotechnology Information.

n-gram and metadata experiments

The first set of experiments used n-gram document representation derived from study titles and descriptions in addition to selected metadata. n-grams provide a means of facilitating statistical analysis by characterizing text in terms of n-consecutive sequences of tokens. For example, performed pulmonary function test separates into unigrams of performed, pulmonary, function, and test and bigrams of performed pulmonary, pulmonary function, and function test, which assists identification as a pulmonary test. Another test, such as cardiac function stress test, can be differentiated by unigrams of cardiac, function, stress, and test and bigrams cardiac function, function stress, and stress test. Metadata for each study was automatically extracted from the dbGaP website (http://www.ncbi.nlm.nih.gov/gap) using Python scripts. For our first set of experiments we employed unigrams, bigrams, and the study metadata (journals, MeSH terms, principal and co-investigators, and funding source). Accuracy, precision, recall, and F-measure were calculated using Weka.

Feature selection experiment

Our next experiment focused on feature set optimization. Yang and Pedersen have shown that feature selection (in particular the χ2 and information gain feature selection algorithms) can improve classification accuracy for some text classification tasks.11,12 We used the Weka implementation of the χ2 feature selection algorithm in this work. All n-grams (unigrams/bigrams) and metadata features were combined and the feature selection threshold (optimal number of features for best performance) was determined experimentally by incrementing the number of features by 100 until the maximum number of features had been reached. We ran feature selection experiments using the NB and SVM algorithms. Accuracy, precision, recall, and F-measure were calculated for each additional feature combination and the optimal threshold was determined.

PubMed experiment

Due to the relative paucity of training data in dbGaP, our last experiment examined feasibility of using study-associated PubMed indexed articles as a representation for topicality (eg, if 20 PubMed heart studies are associated with a dbGaP study, but only one lung study, then the study is likely to be a heart study). A similar method was applied to classifying company and university websites with some success by Ghani et al.13 To develop our training corpus, MKR manually chose 100 PubMed articles associated with dbGaP studies at random in each category of heart, lung, and blood and 300 PubMed studies unrelated to heart, lung, or blood topics for a total of 600 PubMed studies. A binary classifier was used for each category. For example, a study would be categorized as heart or other, lung or other, and blood or other. In each dbGaP study, the associated PubMed studies are found in the selected publications section under which the study authors, title, and journal article are recorded with a PubMed hyperlink. Of the chosen studies, approximately 60 PubMed studies directly associated with dbGaP studies in each category of heart, lung, and blood. However, not all dbGaP studies have an associated PubMed study; therefore, 40 studies with the same topicality were chosen at random from the PubMed database (with same search terms used in the keyword experiments), for a total of 100 studies per heart, lung, and blood category. MC and KWL categorized MKR’s chosen studies and inter-rater reliability was calculated using Fleiss’ Kappa for three raters. Each discrepant classification was assigned a label after discussion and majority vote. A corpus was created from the PubMed study title and abstract and categorized using NB and SVM classifiers with n-gram (unigram and bigram) based feature representation.

Results

We detail our results in four sections. First, we report on some salient characteristics of the dbGaP metadata. Second, we describe the n-gram and metadata classification experiments. Third, we present results from the feature selection experiment. Finally, we state the findings from our PubMed proxy article experiment. In this section we report details of four types of metadata associated with dbGaP studies: journals, MeSH terms, principle and co-investigators, and funding sources.

Journals

There were 4707 journals publications linked to all dbGaP studies. Of these, 606 were unique instances. The mean number of articles per dbGaP study was 15.4 ± 101.81, ranging from 0–1514. The journals most frequently linked to dbGaP studies were of general topicality, cardiology, epidemiology, or stroke (Table 3).

Table 3

Most frequent journals linked to dbGaP studies.

Journal name	Frequency	Percentage (n = 4707)
American Journal of Epidemiology	269	5.7
Circulation	228	4.8
American Journal of Cardiology	146	3.1
JAMA: Journal of the American Medical Association	127	2.7
Atherosclerosis	125	2.7
Stroke	124	2.6
American Heart Journal	118	2.5
The New England Journal of Medicine	116	2.5
Nature Genetics	110	2.3
Diabetes Care	100	2.1
Annals of Epidemiology	91	1.9

Note: Cardiovascular-related subject matter was most common among journals.

Medical Subject Heading terms

On average, dbGaP studies are associated with 2.24 ± 5.46 Medical Subject Heading (MeSH) terms, ranging from 0–69. There were 771 total terms, with cardiovascular disease, stroke, obesity, and smoking being the most common topics. Terms with a frequency of at least six are represented in Table 4.

Table 4

Most frequent MeSH terms in dbGaP.

MeSH terms	Frequency	Percentage (n = 771)
Myocardial infarction	22	2.9
Cardiovascular diseases	20	2.6
Stroke	20	2.6
Obesity	15	2.0
Smoking	10	1.3
Diabetes mellitus, type 2	8	1.0
Prostatic neoplasms	7	0.9
Breast neoplasms	6	0.8
Cardiovascular system	6	0.8
Cholesterol	6	0.8
Dementia	6	0.8
Diabetic nephropathies	6	0.8
Heart disease	6	0.8
Intermittent claudication	6	0.8
Lung neoplasms	6	0.8
Osteoporosis	6	0.8
Parkinson disease	6	0.8
Risk factors	6	0.8

Notes: Medical Subject Heading (MeSH) terms utilized with frequency of least six in dbGaP. The most frequent terms are those relating to cardiovascular disease, obesity, and smoking.

Principal and co-investigators

There were 903 principal investigators and co-principal investigators associated with studies in dbGaP. On average each investigator was associated with a mean of 3.99 ± 8.22 distinct studies. Fifty-one investigators were associated with three or more dbGaP studies.

Funding sources

The funding agency most frequently linked to dbGaP studies supported 111 studies. The mean number of studies supported per funding body was 3.44 ± 0.93 ranging from 1–111. Studies were predominately funded by the National Human Genome Research Institute (36%) and the National Heart, Lung, and Blood Institute (12%). The most frequent funding sources are listed in Table 5.

Table 5

Most frequent funding sources in dbGaP.

Funding source agency	Frequency	Percentage (n = 359)
National Human Genome Research Institute	111	36.4
National Heart, Lung, and Blood Institute	38	12.5
National Cancer Institute	21	6.9
National Institute on Aging	7	2.3
Carlos Slim Health Institute	7	2.3
Geisinger Clinic	4	1.3
National Center for Research Resources	4	1.3
National Institute of Diabetes and Digestive and Kidney Diseases	4	1.3
National Institute of Drug Abuse	4	1.3
National Institute of Neurological Disorders and Stroke	4	1.3
The Canadian Institutes for Health Research	4	1.3
Prostate Cancer Foundation	3	1.0
Wellcome Trust	3	1.0
National Institute of Allergy and Infectious Diseases	3	1.0
Amyotrophic Lateral Sclerosis Association	3	1.0

Notes: Of the most frequently encountered funding sources in dbGaP, the majority are from the National Human Genome Research Institute, the National Heart, Lung, and Blood Institute, and the National Cancer Institute.

Results from the n-gram (unigram and bigram) and metadata feature experiments were favorable. We constructed our training corpus from dbGaP study descriptions, titles and histories. With all corpus study descriptions combined, there were a total of 78,709 words with an average of 258.06 words ± 229.63 (range 0–1419) per study. The study histories contained a total of 24,706 words with a mean of 81 ± 186.97 (range 0–1822). We considered unigrams (one word units) and bigrams (two-word units) as potential features.

Inter-rater agreement

MKR and KWL manually classified dbGaP studies into the categories of heart, lung, and blood. Cohen’s Kappa score was 0.86, 0.72, and 0.77, respectively indicating acceptable agreement. The discrepancies were reviewed, discussed and a final category was determined.

Heart studies

There were 46 heart studies in the database. The best performing algorithm was the C4.5 approach with accuracy of 92.5% and F-measure of 76.3. The second- best performing algorithm was SVM with 90.2% accuracy and F-measure of 65.9. The unigram feature yielded the best result regardless of the metadata feature combination (Table 6). This is substantial improvement over the keyword search accuracy of 64% and F-measure of 43.

Table 6

Results from n-gram and metadata experiments: heart studies.

Feature combination	Accuracy			F-measure

	C4.5	SVM	NB	C4.5	SVM	NB
Unigrams	92.5	90.2	85.6	76.3	65.9	48.8
Bigrams	88.9	89.8	83.6	59.5	58.7	24.2
Funding Sources	83.3	83.2	79.7	0.0	10.5	3.1
Journals	83.6	82.0	82.6	16.7	20.3	34.6
MeSH	89.8	89.8	84.6	56.3	58.7	27.7
Principal Investigator	86.2	86.6	78.7	30.0	34.9	90.2
Unigrams_Funding Sources_Journals	92.5	89.8	85.6	76.3	47.6	65.2
Unigrams_Funding Sources	92.5	90.2	85.6	76.3	65.9	47.6
Unigrams_Journals	92.5	89.8	85.6	76.3	65.2	47.6
Unigrams_MeSH_Journals	92.5	89.8	85.6	76.3	65.2	47.6

Notes: Best-performing combinations of metadata (Journals, MeSH Terms, Principal and Co-Investigators, Funding Sources) and n-grams (Unigrams and Bigrams) featured in heart category. The underscore separates each feature. In addition to results highlighted in table, all other combinations of features involving unigrams with C4.5 algorithm provided good performance with accuracy of 92.5% and F-measure of 76.3. MeSH = Medical Subject Headings. The highest accuracies and F-measures are in bold typeset.

Lung studies

There were 20 studies classified as lung in the dbGaP database. In this case, the C4.5 algorithm performed adequately with regards to accuracy, but was not the highest overall performing learning algorithm in terms of F-measure. The SVM classifier achieved the highest overall score of 95.1% for accuracy and 44.4 for F-measure when the metadata features funds, MeSH terms, and journals were combined (Table 7). Similar to heart studies, this was a noticeable improvement over keyword search for lung studies, which provided an accuracy of 66% and F-measure of 23.

Table 7

Results from n-gram and metadata experiments: lung studies.

Features	Accuracy			F-measure

	C4.5	SVM	NB	C4.5	SVM	NB
Unigrams	91.8	94.1	90.8	24.2	25.0	6.7
Bigrams	91.5	94.1	91.2	13.3	25.0	0.0
Funding Sources	93.4	93.4	90.5	0.0	0.0	0.0
Journals	93.8	93.8	88.5	17.4	17.4	14.6
MeSH	94.1	94.8	91.5	35.7	38.5	7.1
Principal and Co-Investigators	94.4	94.8	90.2	26.1	33.3	0.0
Funding Sources_Principal and Co-Investigators	94.4	94.8	94.8	33.3	26.1	33.3
Funding Sources_MeSH_Journals	93.4	94.8	91.5	33.3	23.1	7.1
Funding Sources_MeSH	94.1	95.1	91.8	35.7	44.4	7.4
MeSH_Principal and Co-Investigators	94.8	94.8	91.5	38.5	38.5	7.1

Notes: Best-performing combinations of metadata (Journals, MeSH Terms, Principal and Co-Investigators, Funding Sources) and n-gram (Unigrams and Bigrams) features in lung category. The underscore separates each feature. MeSH = Medical Subject Headings. The highest accuracies and F-measures are in bold typeset.

Blood studies

There were 28 blood studies in the training set. The best performing classifier was SVM in conjunction with MeSH features, with 92.1% accuracy and 33.3 F-measure. The best performing combination of features was unigrams, funds, and journals with an accuracy of 92.1% and F-measure of 29.4 as shown in Table 8. The scores decreased as increasing features were added likely due to over-fitting of the model. While the results did not increase to the degree of the heart and lung cases, there was an improvement over keyword accuracy of 41% and F-measure of 13.

Table 8

Results from n-gram and metadata experiments: blood studies.

Features	Accuracy			F-measure

	C4.5	SVM	NB	C4.5	SVM	NB
Unigrams	89.5	91.8	87.21	33.3	24.2	4.9
Bigrams	89.2	91.2	88.2	26.7	12.9	0.0
Funding Sources	90.2	88.9	86.9	0.0	0.0	0.0
Journals	90.5	90.2	84.9	0.0	0.0	8.0
MeSH	91.2	92.1	89.8	12.9	33.3	16.2
Principal and Co-Investigators	90.8	91.5	85.3	0.0	23.5	0.0
Unigrams_Funding Sources_Journals	89.5	92.1	87.9	33.3	29.4	5.1
Unigrams_Funding Sources	89.5	91.8	87.5	33.3	24.3	5.0
Unigrams_Journals	89.5	91.8	87.9	33.3	24.2	5.1
Funding Sources_MeSH_Principal and Co-Investigators	91.2	92.1	87.9	33.3	9.8	12.9

Notes: Best-performing combinations of metadata (Journals, MeSH Terms, Principal and Co-Investigators, Funding Sources) and n-gram (Unigrams and Bigrams) features in blood category. The underscore separates each feature. MeSH = Medical Subject Headings. The highest accuracies and F-measures are in bold typeset.

Although the unigrams and metadata feature-based classifiers out-performed the keyword search available in dbGaP, we continued with the feature selection experiment to attempt further performance increase. The 20 most discriminating features in each category as determined by the χ2 feature selection algorithm are noted in Table 9. Performance of the SVM and NB learning algorithms was improved by determining the optimal number of features in cross-validation experiments (Table 10). For heart studies, the threshold was 500 features with best performance by SVM algorithm (F-measure, 83.1). In lung studies the threshold was 2400 features with best results by NB algorithm (F-measure, 78.8). The blood study threshold was 2200 features with best performance by NB algorithm (F-measure, 72.7). All groups achieved substantial improvement, particularly over keyword search (Fig. 2).

Table 9

Most frequently encountered discriminating features.

Heart	Lung	Blood
coronary	lung	LEUKEMIA
infarction	pulmonary	leukemia
myocardial infarction	the_lung	leukemia_all
coronary_artery	of_smoking	acute
artery	of_lung	tumor
myocardial	pulmonary_disease	lymphoblastic_leukemia
heart_disease	lung_cancer	lymphoblastic
nhlbi	controlled clinical trials	acute_lymphoblastic
heart	esp	transformation
MYOCARDIAL INFARCTION	was_evaluated	altered_in
hypertension	cessation_and	patient_samples
cardiovascular	expiratory	ACUTE LYMPHOBLASTIC LEUKEMIA
institute_nhlbi	seattlego_the	ACUTE MYELOID LEUKEMIA
atherosclerosis	trial_with	number_analysis
STROKE	lung_function	leukemia_cll
Stroke	SMOKING CESSATION	CHRONIC LYMPHOCYTIC LEUKEMIA
risk_factors	smoking_cessation	platelet
atherosclerosis_risk	into_two	cll
in_communities	obstructive_pulmonary	landscape_of
of_atherosclerosis	bronchodilator	we_performed

Notes: Notable discriminating features in heart, lung, and blood categories. Words in capital letters are MeSH terms. Underscore indicates bigram.

Abbreviations: esp, Exome Sequencing Project; Seattlego_the, participating group in ESP; cll, Chronic Lymphocytic Leukemia; all, Acute lymphoblastic Leukemia; nhlbi, National Heart, Lung, and Blood Institute.

Table 10

Optimal number of features.

	Optimal # of features	F-measure (feature selection)	F-measure (keyword search)	F-measure (n-grams and metadata)
Heart	500	83.1 (SVM)	43.0	65.9
Lung	2400	78.8 (NB)	23.0	44.4
Blood	2200	72.7 (NB)	13.0	33.3

Notes: Comparison of experiment results for heart, lung, and blood studies reported based on highest scoring learning algorithm and F-measure.

Abbreviations: SVM, Support Vector Machines; NB, Naïve Bayes.

Figure 2

F-Measure feature selection thresholds: heart, lung, blood.

Notes: Best performing χ2 feature selection algorithm results for heart, lung and blood studies. The arrow indicates the cutoff for best performance: Heart = 500 features, lung = 2400, and blood = 2400.

Abbreviations: NB, Naïve Bayes; SVM, Support Vector Machines.

We found PubMed studies effective metadata to represent topicality of dbGaP studies (Table 11). The SVM algorithm using unigrams achieved the highest F-measure for heart, lung, and blood studies (98.2, 97.1, and 95.7, respectively). Accuracy was also highest with the SVM algorithm and unigrams (97.0, 95.1, and 92.6, respectively).

Table 11

PubMed metadata.

	Heart		Lung		Blood

	Accuracy	F-measure	Accuracy	F-measure	Accuracy	F-measure
Unigrams NB	89.5	94.0	85.4	91.9	87.5	92.9
Bigrams NB	83.4	91.0	83.6	91.1	83.4	90.9
Unigrams SVM	97.0	98.2	95.1	97.1	92.6	95.7
Bigrams SVM	95.9	97.6	88.7	93.6	87.5	93.0

Notes: Support Vector Machine (SVM) algorithm performed best with unigram features to identify topicality of the database of Genotypes and Phenotypes (dbGaP) studies based on related PubMed studies. The highest accuracies and F-measures are in bold typeset.

Inter-rater agreement was calculated across three raters (MKR, KWL, and MC). Both MKR and KWL have a background in clinical medicine, while MC’s background is in informatics. The Fleiss’ Kappa for m raters was 0.97 with percent-agreement of 95.9.6 Cohen’s kappa statistic was applied for pairwise comparison. The Cohen’s Kappa and percent-agreement between MKR/KWL, MKR/MC, and KWL/MC were 98.7 and 99, 95.7 and 96.7, and 94.9 and 96.1, respectively. The discrepancies were examined. Most were about classification of diagnosis that fall into multiple categories. For example, stroke and hypertension were classified by some raters as other because they were considered neurologic disease rather than cardiovascular disease.

Discussion

For all categories—heart, lung, and blood—we found that text classification methods improved document identification compared to keyword based approach. By utilizing n-grams and metadata, the highest F-measures achieved for heart, lung, and blood studies were 65.9, 44.4, and 33.3, respectively. With feature selection threshold implemented, the best F-measures were 83.1, 78.8, and 72.7, respectively. Text classification of biomedical-related documents is an active area of research and our dbGaP database results are comparable to previous approaches in the literature. Donaldson et al14 identified PubMed literature on the topic of protein-protein interactions using an SVM algorithm, gaining a classification accuracy of 90%.14 Dobrokhotov et al15 used probabilistic classification to classify and rank PubMed literature according to human genes of interest, achieving 59% precision and 69% recall.15 Miotto et al16 applied classification and regression trees (CART) and artificial neural networks (ANN) machine learning algorithms to PubMed abstracts to identify allergen cross-reactivity papers, finding that a bag-of-words document representation performed best overall.16 In 2007, Wang et al9 used a NB text classifier to demonstrate improved automated document classification in an immune epitope database. Features used were authors, journal, and MeSH headings with a sensitivity (precision) of 95% and specificity of 51.1%.9 In 2008, Poulter et al17 created an information retrieval system for Medline with a NB classifier incorporating MeSH terms and journals titles to retrieve articles of interest in specific domains. The average precision varied from 0.69 to 0.92 depending on the topic.17 In 2009, Conway et al18 demonstrated that a combination of n-gram and semantic features in conjunction with a NB classifier yielded the best classification results for disease outbreak reports.18 Botsis et al19 identified cases of anaphylaxis to influenza vaccine from a database of adverse events using a rule-based classifier that incorporated keywords of anaphylaxis as part of the feature set with sensitivity of 79% and specificity of 94%. The standard machine learning classifiers resulted in F-measures ranging from 0.70 to 0.81.19 In 2009, Denecke and Baehr20 utilized additional document metadata such as keywords, titles, journal, and conference information to achieve favorable results for a document classifier.20 More recently Wei and Collier10 demonstrated that for the task of classifying full-text research papers according to model organism, a NB classifier using features derived from MeSH terms, journal and gene names, provided the best classification accuracy.10 In our experiments, the highest accuracy and F-measure overall for heart studies was based upon the unigram feature alone. This was attributed to the fact that the heart category contained the highest number of studies and the most homogenous diagnoses, such as coronary artery disease and myocardial infarction. For lung studies, the funding source was the most discriminating attribute. One possible explanation for this was the relatively small size of the lung document training set, and to explore this we reviewed the C4.5 classifier output. We discovered many terms used to determine decision nodes were not specific to lungs such as the name of the clinical trial, whereas the heart study output better captures cardiac medical terms (Fig. 3). In regard to the blood studies, we speculate that the lower performance was because their topicality is comprised of heterogeneous terms including clotting disorders, leukemia, and platelets.

Figure 3

C4.5 decision tree nodes to determine heart vs. lung label.

Notes: The C4.5 decision tree nodes, represented by boxes, chosen for classification of (A) lung studies and (B) heart studies. The decision algorithm chooses a term at each node that best divides the training set into the desired category. For lung studies, the terms chosen to best identify the class are not exclusive to lung, rather they are names of study trials; whereas, the heart category terms are more generalizable. Therefore the decision tree algorithm performs the best for heart studies.

Abbreviations: Eagle, Environment and Genetics in Lung cancer Etiology study sponsored by the National Cancer Institute (NCI); ESP, Exome Sequencing Project sponsored by the National Heart, Lung, and Blood Institute; CAD, Coronary artery disease.

In regard to the inter-rater reliability of classifying the PubMed experiments, MKR and KWL’s scores were more aligned because of their clinical backgrounds and experience with manual classification. The categorization of studies is not straightforward. There are diagnoses that may belong to multiple categories and require clinical expertise to determine appropriate classification. For example, although stroke involves the neurologic system, it is generally thought of as a cardiovascular disorder. Other indistinct diagnoses are pulmonary embolism and pulmonary hypertension. Although the root cause of embolism is typically a blood coagulation abnormality, and pulmonary hypertension is vascular in origin, these can be considered lung diseases as recognized by the American Lung Association. This concept of classifying studies into only one category when in reality diagnoses may fall under multiple categories, is a shortcoming of our methods. This is something to be addressed in the future as we optimize the algorithms and is important to consider when deciding which researchers will perform manual categorization of training corpus documents as domain knowledge has a decisive role.

Conclusion

Although relatively small, the number of studies in dbGaP is rapidly increasing. We demonstrated that using a document classifier based on n-grams and structured metadata yields better document retrieval results than the keyword-based search currently available in dbGaP. Without feature selection for heart studies, which had the largest amount of data in the training set, the C4.5 algorithm with unigrams had the best performance. For lung studies, the SVM classifier with funding sources and MeSH terms contributed the most to successful classification. The SVM algorithm was most effective when MeSH terms were utilized to identify blood studies, closely followed by the combination of unigrams, funding sources, and journals. In future work, we plan to employ features derived from MetaMap21 and other natural language processing tools in order to further improve classification accuracy and integrate this into the PhenDisco system. We plan to expand these experiments in topicality and build classifiers for other conditions of interest to dbGaP users (eg, asthma, chronic obstructive pulmonary disease, myocardial infarction, diabetes, etc.). The methods presented in this paper are not only suitable for dbGaP, but can also add structure to new databases or retrofit existing databases.

12 in total

1. Combining NLP and probabilistic categorisation for document and term selection for Swiss-Prot medical annotation.

Authors: Pavel B Dobrokhotov; Cyril Goutte; Anne-Lise Veuthey; Eric Gaussier
Journal: Bioinformatics Date: 2003 Impact factor: 6.937

2. Supporting the curation of biological databases with reusable text mining.

Authors: Olivo Miotto; Tin Wee Tan; Vladimir Brusic
Journal: Genome Inform Date: 2005

3. The NCBI dbGaP database of genotypes and phenotypes.

Authors: Matthew D Mailman; Michael Feolo; Yumi Jin; Masato Kimura; Kimberly Tryka; Rinat Bagoutdinov; Luning Hao; Anne Kiang; Justin Paschall; Lon Phan; Natalia Popova; Stephanie Pretel; Lora Ziyabari; Moira Lee; Yu Shao; Zhen Y Wang; Karl Sirotkin; Minghong Ward; Michael Kholodov; Kerry Zbicz; Jeffrey Beck; Michael Kimelman; Sergey Shevelev; Don Preuss; Eugene Yaschenko; Alan Graeff; James Ostell; Stephen T Sherry
Journal: Nat Genet Date: 2007-10 Impact factor: 38.330

4. Classifying disease outbreak reports using n-grams and semantic features.

Authors: Mike Conway; Son Doan; Ai Kawazoe; Nigel Collier
Journal: Int J Med Inform Date: 2009-05-15 Impact factor: 4.046

5. Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection.

Authors: Taxiarchis Botsis; Michael D Nguyen; Emily Jane Woo; Marianthi Markatou; Robert Ball
Journal: J Am Med Inform Assoc Date: 2011-06-27 Impact factor: 4.497

6. Towards classifying species in systems biology papers using text mining.

Authors: Qi Wei; Nigel Collier
Journal: BMC Res Notes Date: 2011-02-04

7. PreBIND and Textomy--mining the biomedical literature for protein-protein interactions using a support vector machine.

Authors: Ian Donaldson; Joel Martin; Berry de Bruijn; Cheryl Wolting; Vicki Lay; Brigitte Tuekam; Shudong Zhang; Berivan Baskin; Gary D Bader; Katerina Michalickova; Tony Pawson; Christopher W V Hogue
Journal: BMC Bioinformatics Date: 2003-03-27 Impact factor: 3.169

8. Literature retrieval and mining in bioinformatics: state of the art and challenges.

Authors: Andrea Manconi; Eloisa Vargiu; Giuliano Armano; Luciano Milanesi
Journal: Adv Bioinformatics Date: 2012-06-21

9. MScanner: a classifier for retrieving Medline citations.

Authors: Graham L Poulter; Daniel L Rubin; Russ B Altman; Cathal Seoighe
Journal: BMC Bioinformatics Date: 2008-02-19 Impact factor: 3.169

10. Automating document classification for the Immune Epitope Database.

Authors: Peng Wang; Alexander A Morgan; Qing Zhang; Alessandro Sette; Bjoern Peters
Journal: BMC Bioinformatics Date: 2007-07-26 Impact factor: 3.169

3 in total

1. Learning regular expressions for clinical text classification.

Authors: Duy Duc An Bui; Qing Zeng-Treitler
Journal: J Am Med Inform Assoc Date: 2014-02-27 Impact factor: 4.497

2. PhenDisco: phenotype discovery system for the database of genotypes and phenotypes.

Authors: Son Doan; Ko-Wei Lin; Mike Conway; Lucila Ohno-Machado; Alex Hsieh; Stephanie Feudjio Feupe; Asher Garland; Mindy K Ross; Xiaoqian Jiang; Seena Farzaneh; Rebecca Walker; Neda Alipanah; Jing Zhang; Hua Xu; Hyeon-Eui Kim
Journal: J Am Med Inform Assoc Date: 2013-08-29 Impact factor: 4.497

3. Predicting the occurrence of surgical site infections using text mining and machine learning.

Authors: Daniel A da Silva; Carla S Ten Caten; Rodrigo P Dos Santos; Flavio S Fogliatto; Juliana Hsuan
Journal: PLoS One Date: 2019-12-13 Impact factor: 3.240

3 in total