Michael A Gillette1,2,3, D R Mani1, Christopher Uschnig1,4, Karell G Pellé4, Lola Madrid5,6, Sozinho Acácio6, Miguel Lanaspa5,6, Pedro Alonso5,6, Clarissa Valim4,7, Steven A Carr1, Stephen F Schaffner1,4, Bronwyn MacInnis1,4, Danny A Milner1,3,4,8, Quique Bassat5,6,9,10,11, Dyann F Wirth1,4. 1. Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA. 2. Division of Pulmonary and Critical Care Medicine, Massachusetts General Hospital, Boston, Massachusetts, USA. 3. Harvard Medical School, Boston, Massachusetts, USA. 4. Harvard T. H. Chan School of Public Health, Department of Immunology and Infectious Diseases, Boston, Massachusetts, USA. 5. ISGlobal, Hospital Clínic-Universitat de Barcelona, Barcelona, Spain. 6. Centro de Investigação em Saúde de Manhiça (CISM), Maputo, Mozambique. 7. Department of Global Health, Boston University School of Public Health, Boston, Massachusetts, USA. 8. American Society for Clinical Pathology, Chicago, Illinois, USA. 9. Catalan Institution for Research and Advanced Studies (ICREA) Barcelona, Spain. 10. Pediatric Infectious Diseases Unit, Pediatrics Department, Hospital Sant Joan de Déu (University of Barcelona), Barcelona, Spain. 11. Consorcio de Investigación Biomédica en Red de Epidemiología y Salud Pública (CIBERESP), Madrid, Spain.
Abstract
BACKGROUND: Differential etiologies of pediatric acute febrile respiratory illness pose challenges for all populations globally, but especially in malaria-endemic settings because the pathogens responsible overlap in clinical presentation and frequently occur together. Rapid identification of bacterial pneumonia with high-quality diagnostic tools would enable appropriate, point-of-care antibiotic treatment. Current diagnostics are insufficient, and the discovery and development of new tools is needed. We report a unique biomarker signature identified in blood samples to accomplish this. METHODS: Blood samples from 195 pediatric Mozambican patients with clinical pneumonia were analyzed with an aptamer-based, high-dynamic-range, quantitative assay (~1200 proteins). We identified new biomarkers using a training set of samples from patients with established bacterial, viral, or malarial pneumonia. Proteins with significantly variable abundance across etiologies (false discovery rate <0.01) formed the basis for predictive diagnostic models derived from machine learning techniques (Random Forest, Elastic Net). Validation on a dedicated test set of samples was performed. RESULTS: Significantly different abundances between bacterial and viral infections (219 proteins) and bacterial infections and mixed (viral and malaria) infections (151 proteins) were found. Predictive models achieved >90% sensitivity and >80% specificity, regardless of number of pathogen classes. Bacterial pneumonia was strongly associated with neutrophil markers-in particular, degranulation including HP, LCN2, LTF, MPO, MMP8, PGLYRP1, RETN, SERPINA1, S100A9, and SLPI. CONCLUSIONS: Blood protein signatures highly associated with neutrophil biology reliably differentiated bacterial pneumonia from other causes. With appropriate technology, these markers could provide the basis for a rapid diagnostic for field-based triage for antibiotic treatment of pediatric pneumonia.
BACKGROUND: Differential etiologies of pediatric acute febrile respiratory illness pose challenges for all populations globally, but especially in malaria-endemic settings because the pathogens responsible overlap in clinical presentation and frequently occur together. Rapid identification of bacterial pneumonia with high-quality diagnostic tools would enable appropriate, point-of-care antibiotic treatment. Current diagnostics are insufficient, and the discovery and development of new tools is needed. We report a unique biomarker signature identified in blood samples to accomplish this. METHODS: Blood samples from 195 pediatric Mozambican patients with clinical pneumonia were analyzed with an aptamer-based, high-dynamic-range, quantitative assay (~1200 proteins). We identified new biomarkers using a training set of samples from patients with established bacterial, viral, or malarial pneumonia. Proteins with significantly variable abundance across etiologies (false discovery rate <0.01) formed the basis for predictive diagnostic models derived from machine learning techniques (Random Forest, Elastic Net). Validation on a dedicated test set of samples was performed. RESULTS: Significantly different abundances between bacterial and viral infections (219 proteins) and bacterial infections and mixed (viral and malaria) infections (151 proteins) were found. Predictive models achieved >90% sensitivity and >80% specificity, regardless of number of pathogen classes. Bacterial pneumonia was strongly associated with neutrophil markers-in particular, degranulation including HP, LCN2, LTF, MPO, MMP8, PGLYRP1, RETN, SERPINA1, S100A9, and SLPI. CONCLUSIONS: Blood protein signatures highly associated with neutrophil biology reliably differentiated bacterial pneumonia from other causes. With appropriate technology, these markers could provide the basis for a rapid diagnostic for field-based triage for antibiotic treatment of pediatric pneumonia.
Pediatric febrile respiratory illness is a leading cause of mortality and morbidity globally. Identifying the etiology—bacterial [1], viral, or (less commonly) malaria [2, 3]—is crucially important but difficult due to similar clinical presentations. The critical need globally is to identify bacterial infections [3] so they can be treated appropriately and reduce mortality [4, 5]. Rapid bacterial diagnosis is challenged by current diagnostic tests: laborious microbiological culture or molecular testing methods, if available, often lack sensitivity to detect bacterial pathogens [6] as do radiological evaluations (through chest-X-ray or ultrasound), with equal limitation in availability. Malaria or viral infections and bacterial secondary coinfections occur commonly together, increasing the challenge of a specific, treatable diagnosis [7, 8].Host cellular responses to bacterial, viral, and malaria infections are distinct, being chiefly neutrophilic, lymphocytic, or monocytic, respectively, and represent prime targets as diagnostic indicators. To date, these approaches are not sufficiently reliable [9-13], based on a recommended benchmark [14] of thresholds for sensitivity (desirable, ≥95%; acceptable, ≥90%) and specificity (≥90% and ≥80%). We hypothesized that the distinctive cellular host responses could be detected at the protein level. We test this hypothesis based on the differential expression of proteins in pediatric febrile respiratory illness blood specimens from southern Mozambique, where malaria is endemic. Febrile respiratory illness cases were classified by available gold standards, and using highly specific case definition, to 1 of 3 underlying causes—bacteria, viruses, or malaria—or to a combination (“mixed infections”). Proteins were assayed with SOMAScan technology (Somalogic, Boulder, CO), an array-based modified aptamer platform covering a range of biological pathways including inflammation, signal transduction, and immune processes. This quantitative assay of approximately 1200 proteins simultaneously offers a high dynamic range and has modest sample requirements (150 μL plasma) [15].The resulting protein expression data were used to create machine learning–based models for distinguishing bacterial from viral or malaria infections. The same data, along with data from our prior RNA- and protein-based studies [9, 13], provided the basis for pathway analyses, to help confirm the underlying biology of the host response.
METHODS
Study Design
The study recruited 2 groups of children (<10 years of age) at the Manhiça District Hospital in Mozambique as follows: (1) children with febrile respiratory illness admitted to the hospital fulfilling the “clinical pneumonia” criteria (as defined by the World Health Organization [WHO]) and (2) afebrile and asymptomatic healthy community controls used to establish a baseline. Febrile respiratory illness cases were assigned by all available gold-standard tests to 1 of 3 underlying causes—bacteria, viruses, or malaria—or to a combination (“mixed infections”).
Study Population and Sample Classification Procedure
Children with fever at admission (>37.5°C axillary temperature) or prior 24-hour history of fever meeting the WHO case definition for clinical pneumonia (increased respiratory rate and cough or difficulty breathing) [16] were selected for the study. Informed consent was obtained from parents/guardians. All children underwent anteroposterior chest radiography; images were independently interpreted following the WHO-recommended guidelines for pneumonia diagnosis by 2 experienced clinicians [17].Patients were classified as having clinical pneumonia associated with bacterial, malaria, or viral infection using the criteria described in Valim et al [13], with minor modifications. In brief, patients were classified as bacterial pneumonia when pathogenic bacteria were isolated (or detected through reverse transcription–polymerase chain reaction [RT-PCR]) from blood or pleural exudate, and after confirming the absence of malarial infection. Viral pneumonia required the detection in the nasopharyngeal aspirate (NPA) of a viral respiratory pathogen, no isolated bacteria in the blood culture or RT-PCR, no “endpoint pneumonia” in the chest X-ray, and negative malaria microscopy. Finally, a malaria case required a positive malaria smear microscopy (according to predetermined parasitemia thresholds in relation to age [18]), normal chest X-ray, and no detectable bacterial infection. We analyzed our case definitions against ALMANACH criteria (Supplementary Material, Supplementary Methods and Supplementary Table 7).To address the known insensitivity of blood culture for bacterial pneumonia, cases were also assigned a bacterial etiology if the NPA was negative for virus but the patient had leukocytosis and a dense radiographic consolidation (endpoint pneumonia) based on consensus of 2 independent experts. Since NPAs are often positive on RT-PCR for potential viral respiratory pathogens even in clinically well children, the detection of a virus in the NPA did not alter the class assignments for confirmed bacterial or malarial cases. See Supplementary Figure 1 for a comprehensive flowchart for patient classification.In addition, patient samples with mixed infections were also included in the study (for details see Supplementary Table 3). “Virus and probable bacterial secondary coinfection” samples were virus positive, culture- and PCR-negative for bacteria but with leukocytosis and radiographic endpoint pneumonia, suggestive of a secondary bacterial infection.
SOMAScan Protein Assay
The SOMAScan assay uses SOMAmers (Slow Off-rate Modified Aptamers) to capture proteins and translates binding events into signals measured in relative fluorescence units, which are directly proportional to target protein abundance in the sample, calculated by a standard curve generated for each protein–SOMAmer pair. The dynamic range is enhanced by 3 serial dilutions, with the least concentrated dilution used to quantify the most abundant proteins (approximate micromolar concentration in the original sample), and the most concentrated used for the least abundant proteins (femtomolar to picomolar concentration) [15]. Samples were assayed in 2 batches (15 samples replicated to verify consistency); the SOMAScan assays used in the first set of 167 samples quantified 1129 proteins and the SOMAScan assay used in the second set of 49 samples quantified 1279 proteins. In the 2 batches, 96.4% (161/167) and 100% (49/49) of samples passed Somalogic normalization acceptance criteria.We use Somalogic protein marker labels throughout this article (Supplementary Data File 1 provides full protein names).
Protein Marker Selection and Predictive Model Building
Selection was based on statistical significance of differences in marker abundance between the bacteria versus virus (BvV) and bacteria versus malaria or virus (BvVM) comparisons. Classifiers discriminated (1) BvV and (2) BvVM using the 219 and 151 statistically significant (false discovery rate [FDR] <0.01) markers, respectively, and their corresponding surrogates. Using optimal subsets of N-protein markers (n = 5, 10, 15, 25, 50, 100) identified using genetic algorithms, 2-class Random Forest (RF) and Elastic Net (EN) models were constructed, achieving predictive results with high sensitivity and specificity with a small subset of markers (see Figure 1) (details in the “Data Analysis Pipeline” in the Supplementary Appendix and Supplementary Figure 3).
Figure 1.
Data analysis workflow. A total of 210 samples passed QC on the SOMAScan assay to quantify 1107 proteins. The 171 single etiology samples were classified as malaria, virus, or bacteria and included 12 repeats that were randomly split between the training and validation datasets; the 4 repeats that ended up in the validation dataset were excluded from downstream analysis. The remaining 39 samples consisted of 16 healthy community controls and 23 samples with mixed etiology. Single etiology samples were divided into a training set of 120 and a validation set of 47 samples. The training data were used for identifying differentially expressed markers between bacteria and virus, or bacteria and malaria or virus samples. Genetic algorithms were used to select the best 5, 10, 15, 25, 50, and 100 markers. Classifiers for BvV and BvVM were trained using RF and EN algorithms. Models were tuned using cross-validation, and final model performance was assessed using the validation data. In order to contend with the situation where a marker is unavailable (eg, due to difficulty in measuring the marker in a clinical setting), we determined a set of surrogate markers for each differential marker using information correlation, a criterion based on mutual information. We then assessed model performance when 10% or 20% of differential markers were substituted with their corresponding surrogates. Abbreviations: BvV, bacteria vs virus; BvVM, bacterial vs malaria or virus; EN, Elastic Net; QC, quality control; RF, Random Forest.
Data analysis workflow. A total of 210 samples passed QC on the SOMAScan assay to quantify 1107 proteins. The 171 single etiology samples were classified as malaria, virus, or bacteria and included 12 repeats that were randomly split between the training and validation datasets; the 4 repeats that ended up in the validation dataset were excluded from downstream analysis. The remaining 39 samples consisted of 16 healthy community controls and 23 samples with mixed etiology. Single etiology samples were divided into a training set of 120 and a validation set of 47 samples. The training data were used for identifying differentially expressed markers between bacteria and virus, or bacteria and malaria or virus samples. Genetic algorithms were used to select the best 5, 10, 15, 25, 50, and 100 markers. Classifiers for BvV and BvVM were trained using RF and EN algorithms. Models were tuned using cross-validation, and final model performance was assessed using the validation data. In order to contend with the situation where a marker is unavailable (eg, due to difficulty in measuring the marker in a clinical setting), we determined a set of surrogate markers for each differential marker using information correlation, a criterion based on mutual information. We then assessed model performance when 10% or 20% of differential markers were substituted with their corresponding surrogates. Abbreviations: BvV, bacteria vs virus; BvVM, bacterial vs malaria or virus; EN, Elastic Net; QC, quality control; RF, Random Forest.
Biological Processes and Pathways
To better understand the biological significance of the differentially expressed proteins, differential markers were used as input to the Metascape Gene Annotation and Analysis Resource (http://metascape.org) to query multiple ontology resources including KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway, Gene Ontology (GO) Biological Processes, Reactome Gene Sets, Canonical Pathways, and CORUM. Both 3-way (bacteria vs malaria vs virus) and binary (BvV) comparisons were explored (see Supplementary Appendix for details).
Comparative Marker Analysis Between Technologies
To assess whether markers identified as indicating bacterial infection were consistent across technology platforms, extensive comparisons were made between this and 2 previous marker studies of the same patients (RNA-sequencing and multiplex bead-based protein immunoassays); both studied different but overlapping samples within the same study population (see details in the Supplementary Appendix).
RESULTS
Patient Characteristics
Between July 2010 and November 2014, 576 patients were recruited as inpatients, along with 117 community controls. A total of 195 patients under 10 years of age with acute febrile respiratory illness met the stringent inclusion criteria and were included in this analysis. To identify differentially expressed proteins between underlying etiologies, patients were characterized as having bacterial (69 patients), malaria (42 patients), viral (48 patients), or mixed (23 patients) infections (for details see Supplementary Table 3). Thirteen healthy subjects were included as controls. The classification scheme was similar, as previously described (see Supplementary Figure 1 for a patient classification flowchart) [9, 13]. No significant differences in age, sex, weight, height, nutritional status, or duration of hospital admission were observed between bacterial, viral, and malaria sample sets (see Table 1 and Supplementary Table 1 for patient demographic and disease characteristics). Case-fatality rates were high (6%) for the bacterial group, but none of the malaria cases or viral cases died. Malnutrition was highly prevalent among the 3 groups, and human immunodeficiency virus (HIV) prevalence was also high, although significantly higher among the bacterial group. Bacterial cases had the highest leukocyte count and respiratory rates. Malaria cases were the most anemic, had the highest mean axillary temperature, and had the lowest respiratory rates. Viral cases had the lowest leukocyte count, had lower mean axillary temperature, and were less anemic. Neutrophil levels were statistically higher for the bacterial etiology, but the overlap between etiologies was too great for this to serve as a classifier.
Table 1.
Patient Demographic and Disease Characteristics at Admission
Bacteria and PCR Bacteria
Malaria
Virus
Values
n
Values
n
Values
n
Pa
Features on admission (signs, symptoms, and laboratory results)
Age, mean (SD), m
29.7 (29.4)
69
26.3 (23.3)
42
19.4 (19.9)
48
.12
Female sex, n (%)
24 (45)
69
28 (67)
42
23 (48)
48
.07
Clinical examination results on arrival, mean (SD)
Weight, kg
10.2 (4.8)
69
10.3 (4.6)
42
9 (3.4)
48
.44
Height, cm
80.3 (18.7)
69
79.7 (18)
42
74.5 (14.9)
48
.35
MUAC, cm
13.5 (2)
69
14.3 (2)
40
14.0 (1.5)
48
.75
Temperature, °C
38.2 (1.2)
69
38.4 (1.4)
42
37.6 (1.1)
48
.041
Respiratory rate, cycles per minute
60.3 (14.7)
69
53.1 (8.7)
40
56.7 (9.4)
48
.02
Nutritional status
WAZ > −1 SD, n (%)
16 (24.6)
69
17 (40.5)
42
24 (50.0)
48
WAZ −1 SD to −3 SDs (low to severe underweight), n (%)
30 (52.2)
69
18 (42.9)
42
17 (35.4)
48
WAZ < −3 SDs (severe underweight), n (%)
12 (23.2)
69
7 (16.7)
42
7 (14.6)
48
WAZ, mean (SD)
−2 (1.8)
69
−1.5 (1.7)
42
−1.5 (1.7)
48
.09
Anemia status on admission
Hemoglobin, mean (SD), g/dL
8.6 (2.2)
69
7.4 (2.3)
40
10.0 (2.1)
48
<.0001
HCT (%), mean (SD)
26.1 (6.2)
69
22.1 (7)
40
29.7 (5.9)
48
<.0001
No anemia (HCT >33%), n (%)
4 (6)
69
2 (5.0)
40
14 (29.2)
48
Mild anemia (HCT 25–33%), n (%)
31 (46.3)
69
11 (27.5)
40
25 (52.1)
48
Moderate anemia (HCT 15–25%), n (%)
32 (47.8)
69
22 (55.0)
40
8 (16.7)
48
Severe anemia (HCT ≤15%), n (%)
0 (0)
69
5 (12.5)
40
1 (2.1)
48
Microbiology and other laboratory results on admission
HIV status positive, n (%)
25 (36.2)
69
3 (7.1)
42
5 (10.4)
48
<.0001
Viral coinfections, n (%)
32 (46)
69
22 (52)
42
-
-
.56b
Positive blood culture, n (%)
29 (42.0)
69
0 (0)
42
0 (0)
48
WBC count, mean (SD), 103/μL
21.3 (13.1)
69
14.2 (8.4)
41
10.8 (2.7)
48
<.0001
Neutrophil granulocytes, mean (SD), 103/μL
13.7 (9.8)
57
5.2 (3)
32
4.8 (2.4)
45
<.0001
Plasmodium density, geometric mean (SD), parasites/μL
0 (0)
69
5.9 (5.8)
42
0 (0)
48
Malaria positive (microscopy positive)
0 (0)
69
42 (100)
42
0 (0)
48
<.0001
Chest X-ray results, n (%)
Normal
9 (15.0)
69
42 (100)
42
27 (56.3)
48
<.0001
Other infiltrate/abnormality
7 (11.7)
69
0 (0)
42
21 (43.8)
48
Primary endpoint pneumonia
44 (73.3)
69
0 (0)
42
0 (0)
48
Evolution during admission
Length of admission, median (IQR), d
4.1 (2–6.1)
6
3.4 (2–5)
4
3.8 (2.2–4.8)
4
.29
Case-fatality rate (in-hospital death), n (%)
3 (4.4)
6
0 (0)
4
0 (0)
4
.15
The “bacteria” group includes blood or pleural fluid culture-positive samples, samples PCR positive for respiratory pathogens, and samples with positive leukocytosis and a dense radiographic consolidation (endpoint pneumonia) as independently assessed by 2 experts. Samples that were culture or PCR positive for contaminant bacteria were excluded. Abbreviations: HCT, hematocrit; HIV, human immunodeficiency virus; IQR, interquartile range; MUAC, middle upper arm circumference; PCR, polymerase chain reaction; SD, standard deviation; WAZ, weight-for-age z score, z-score cutoff point of < −2 SDs and < −3 SDs is classified as low weight for age and severe undernutrition, respectively; WBC, white blood cell.
a
P values for continuous variables were estimated through analysis of variance (Kruskal-Wallis test). P values for categorical variables used chi-square test.
b
P value of the categorical variable was estimated through Fisher’s exact test.
Patient Demographic and Disease Characteristics at AdmissionThe “bacteria” group includes blood or pleural fluid culture-positive samples, samples PCR positive for respiratory pathogens, and samples with positive leukocytosis and a dense radiographic consolidation (endpoint pneumonia) as independently assessed by 2 experts. Samples that were culture or PCR positive for contaminant bacteria were excluded. Abbreviations: HCT, hematocrit; HIV, human immunodeficiency virus; IQR, interquartile range; MUAC, middle upper arm circumference; PCR, polymerase chain reaction; SD, standard deviation; WAZ, weight-for-age z score, z-score cutoff point of < −2 SDs and < −3 SDs is classified as low weight for age and severe undernutrition, respectively; WBC, white blood cell.a
P values for continuous variables were estimated through analysis of variance (Kruskal-Wallis test). P values for categorical variables used chi-square test.b
P value of the categorical variable was estimated through Fisher’s exact test.From the 195 patients, 210 peripheral blood samples (including 15 replicates, 4 of which were excluded from downstream analysis) were assayed for protein composition using the SOMAScan platform (see Figure 1). Sample characteristics and designations of single (167 samples) and mixed infections with controls (39 samples) can be found in Supplementary Tables 2 and 3, respectively.
Differential Markers
Using the SOMAScan data, 219 and 151 differentially expressed protein markers (FDR <0.01) were identified in the BvV comparison (Supplementary Table 4, heatmap in Figure 2) and the BvVM comparison (Supplementary Table 4, heatmap in Supplementary Figure 2), respectively. The differential protein expression signatures determined by SOMAscan are shown in the heatmap in Figure 2A. This signal is manifest only after marker selection; unsupervised clustering in the space of the entire 1107 protein panel does not reveal a clear dominant structure related to infectious etiology (Supplementary Figure 2). Box-and-whisker plots of the 100 top-ranked markers are depicted in Supplementary Figure 4.
Figure 2.
A, Hierarchically clustered heatmap of normalized SOMAscan expression values for 219 significant markers (FDR <0.01) from the SOMAScan BvV comparison in the space of all single etiology bacterial and viral samples in this study (see Supplementary Figure 10 for full resolution with details). Top track: viral (yellow) and bacterial (blue) etiology. B, Top 10 rank-ordered protein markers (highest to lowest, left to right) in our BvV and BvVM marker sets. Abbreviations: BvV, bacteria vs virus; BvVM, bacterial vs malaria or virus; CCL23, C-C motif chemokine 23; CSF3, granulocyte colony-stimulating factor; CX3CL1, fractalkine; ESD, S-formylglutathione hydrolase; FDR, false discovery rate; HP, haptoglobin; IL1RL1, interleukin-1 receptor-like 1; IL6, interleukin-6; ITIH4, inter-ɑ-trypsin inhibitor heavy chain H4; KYNU, kynureninase; LCN2, neutrophil gelatinase-associated lipocalin; max, maximum; min, minimum; NTN4, netrin-4; PLA2G2A, phospholipase A2; RETN, resistin; S100A9, protein S100-A9; SERPINA1, ɑ1-antitrypsin; SLPI, anti-leukoproteinase.
A, Hierarchically clustered heatmap of normalized SOMAscan expression values for 219 significant markers (FDR <0.01) from the SOMAScan BvV comparison in the space of all single etiology bacterial and viral samples in this study (see Supplementary Figure 10 for full resolution with details). Top track: viral (yellow) and bacterial (blue) etiology. B, Top 10 rank-ordered protein markers (highest to lowest, left to right) in our BvV and BvVM marker sets. Abbreviations: BvV, bacteria vs virus; BvVM, bacterial vs malaria or virus; CCL23, C-C motif chemokine 23; CSF3, granulocyte colony-stimulating factor; CX3CL1, fractalkine; ESD, S-formylglutathione hydrolase; FDR, false discovery rate; HP, haptoglobin; IL1RL1, interleukin-1 receptor-like 1; IL6, interleukin-6; ITIH4, inter-ɑ-trypsin inhibitor heavy chain H4; KYNU, kynureninase; LCN2, neutrophil gelatinase-associated lipocalin; max, maximum; min, minimum; NTN4, netrin-4; PLA2G2A, phospholipase A2; RETN, resistin; S100A9, protein S100-A9; SERPINA1, ɑ1-antitrypsin; SLPI, anti-leukoproteinase.
Performance of Predictive Diagnostic Models
Our chief aim was to develop a protein-based biomarker panel to distinguish bacterial from other etiologies of clinical pneumonia with accuracy that would support clinical decision making. The RF and EN models had generally similar performance, with RF models performing slightly better overall (see Supplementary Table 5and 5) and declining in performance more smoothly with fewer input markers. We therefore focused subsequent analyses on RF results.In single etiology samples, the performance of the BvV model (evaluated on the held-aside validation samples) was excellent. Sensitivity and specificity for bacterial cases using all 219 markers were 90% and 100%, respectively, meeting the Foundation for Innovative New Diagnostics (FIND) proposed criteria for a diagnostic test of these characteristics [14]. Furthermore, sensitivity and specificity remained at 90% and 85% with only 5 markers, potentially simplifying the translation to a field-deployable diagnostic. Accuracy was 94% (95% confidence interval [CI]: .79–.99) and 88% (95% CI: .71–.96), with 219 and 5 markers, respectively (Table 2A and Supplementary Table 5).
Table 2.
Single Etiology and Mixed-Infection Validation Set Predictive Diagnostic Results
(A) Single Etiology Samples
Bacteria vs Virus (n = 32)
Bacteria vs Virus or Malaria (n = 47)
No. of Markers
All Significant Markers (219)
5 Markers
All Significant Markers (151)
5 Markers
Confusion Matrix
Actual
Actual
Actual
Actual
Bacteria
Virus
Bacteria
Virus
Bacteria
Virus
Bacteria
Virus
Predicted
Bacteria
17
0
Predicted
Bacteria
17
2
Predicted
Bacteria
13
0
Predicted
Bacteria
10
10
Virus
2
M
Virus
2
11
Virus or malaria
6
28
Virus or Malaria
9
18
Accuracy
0.94
0.88
0.87
0.60
95% CI
(.79, .99)
(.71, .96)
(.74, .95)
(.44, .74)
Sensitivity
0.90
0.90
0.68
0.53
Specificity
1.00
0.85
1.00
0.64
No. of markers
219
100
50
25
15
10
5
151
100
50
25
15
10
5
Accuracy
0.94
0.94
0.97
0.91
0.84
0.84
0.88
0.87
0.83
0.81
0.85
0.77
0.77
0.60
10% Surrogates
0.94
0.91
0.94
0.88
0.88
0.88
0.91
0.83
0.83
0.81
0.83
0.77
0.74
0.62
20% Surrogates
0.97
0.91
0.97
0.91
0.81
0.78
0.88
0.85
0.83
0.81
0.83
0.77
0.77
0.57
(B) Mixed-infection samples
Bacteria vs Virus (n = 39)
Bacteria vs Virus or Malaria (n = 39)
No. of markers
All Significant Markers (219)
5 Markers
All Significant Markers (151)
5 Markers
Confusion Matrix
Actual
Actual
Actual
Actual
Bacteria
Virus
Bacteria
Virus
Bacteria
Virus
Bacteria
Virus
Predicted
Bacteria
19
3
Predicted
Bacteria
16
4
Predicted
Bacteria
13
4
Predicted
Bacteria
17
6
Virus
1
16
Virus
4
15
Virus or Malaria
7
15
Virus or Malaria
3
13
Accuracy
0.90
0.79
0.72
0.77
95% CI
(.76, .97)
(.64, .91)
(.55, .85)
(.61, .89)
Sensitivity
0.95
0.80
0.65
0.85
Specificity
0.84
0.79
0.79
0.68
No. of markers
219
100
50
25
15
10
5
151
100
50
25
15
10
5
Accuracy
0.90
0.90
0.90
0.90
0.74
0.85
0.79
0.72
0.77
0.67
0.90
0.85
0.82
0.77
Sensitivity
0.95
0.90
0.95
0.90
0.85
0.85
0.80
0.65
0.80
0.80
0.90
0.95
0.95
0.85
Specificity
0.84
0.89
0.84
0.90
0.63
0.84
0.79
0.79
0.74
0.53
0.89
0.74
0.68
0.68
(A) Confusion matrices and performance specifications for models using all (219 and 151, respectively) markers and 5 markers, as well as accuracy for models using 5, 10, 15, 25, 50, 100, and 219 markers with 0%, 10%, or 20% surrogates. The BvV validation set contains 19 bacteria and 13 virus samples, and the BvVM set contains 19 bacteria, 15 malaria, and 13 virus samples. (B) Confusion matrices and performance statistics with all (151) or only 5 markers depicted. The mixed-infection test set contains 39 samples. All samples that contain the term “bacteria” are considered positive bacterial pneumonia cases. Abbreviations: BvV, bacteria vs virus; BvVM, bacterial vs malaria or virus; CI, confidence interval.
Single Etiology and Mixed-Infection Validation Set Predictive Diagnostic Results(A) Confusion matrices and performance specifications for models using all (219 and 151, respectively) markers and 5 markers, as well as accuracy for models using 5, 10, 15, 25, 50, 100, and 219 markers with 0%, 10%, or 20% surrogates. The BvV validation set contains 19 bacteria and 13 virus samples, and the BvVM set contains 19 bacteria, 15 malaria, and 13 virus samples. (B) Confusion matrices and performance statistics with all (151) or only 5 markers depicted. The mixed-infection test set contains 39 samples. All samples that contain the term “bacteria” are considered positive bacterial pneumonia cases. Abbreviations: BvV, bacteria vs virus; BvVM, bacterial vs malaria or virus; CI, confidence interval.The BvVM RF model had an accuracy of 87% (95% CI: .74–.95), a specificity of 100%, and a sensitivity of 68%. When decreasing the panel size to only 5 markers, accuracy decreased to 60%, specificity to 64%, and sensitivity to 53% (see Table 2 and Supplementary Table 5).On healthy controls and mixed-infection samples, the BvV RF model performed well, with 95% sensitivity, 84% specificity, and 90% accuracy (95% CI: .76–.97) (Table 2B). The model correctly predicted the majority of bacterial infections and bacterial coinfections, successfully distinguishing these from nonbacterial infections (malaria and/or virus). Supplementary Table 6 depicts BvV and BvVM RF model statistics on mixed-infection samples without controls. To compare, we found that the clinical ALMANACH models were uniformly inferior to our molecular predictors, with a particularly dramatic loss of specificity (see Supplementary Table 7 for details).
Genetic Algorithm-Derived and Surrogate Markers
Marker subsets (with “n” ranging from 5 to 100 markers) were selected using genetic algorithms. Since the results can be nondeterministic, the method was re-run multiple times. Across all runs of the genetic algorithm, interleukin 1 receptor type 1, high mobility group box 1, programmed cell death 1 ligand 2, roundabout guidance receptor 2, and pregnancy-associated plasma protein A were the 5 protein markers most often selected. For the BvVM models, the most-selected markers were lymphotoxin ɑ2/B1 protein (LTA.LTB1), TPI1, ɑ1-antitrypsin (SERPINA1), IGFBP2, and ROR1 (see Supplementary Data File 2 for complete marker lists).We next assessed whether models were robust to replacement of individual markers by corresponding surrogates. This provides an index of model stability and has practical relevance when converting predictive models into diagnostics, which may require marker substitution for technical reasons. The RF (and EN) classifiers for both BvV and BvVM proved to be robust to the choice of specific markers: classifier accuracy did not significantly decline even when 20% of the markers were replaced with surrogates (Supplementary Figure 5).
Biological Processes and Pathway Analysis
To gain insight into the biology underlying the markers of bacterial and viral infection, multiple databases were queried for functional and pathway annotations. Terms significantly enriched in the bacterial or viral pneumonia marker sets were automatically clustered into nonredundant groups (details in Methods). Marker support for terms is shown in Figure 3A and the top 20 clusters in Figure 3B. Individual terms (and therefore clusters) could be supported by both bacterial and viral markers. Most clusters had support from both etiologies, but a subset (blue or red circles in Figure 3) was strongly associated with a single etiology. Two GO clusters, “chemotaxis” and “regulation of neurogenesis,” were driven almost exclusively by viral markers, while “response to bacterium” (our top ranked GO term with 39 gene hits), “regulated exocytosis,” “antimicrobial humoral response,” “positive regulation of response to external stimulus,” and “signaling by interleukins” were driven almost exclusively by bacterial markers.
Figure 3.
Pathways and gene enrichment analysis with differential markers shared between this study, RNA-sequencing, and RBM multiplex assay studies with the same study population. A–C, Clustered terms enriched in our bacteria vs virus 2-class comparison. Each node represents 1 term describing a biological process or pathway. Edges connect similar terms (similarity score [κ] >0.3); the thickness of the edge represents the similarity score. Each term is represented by a circle node, where the size is proportional to the number of input markers. The underlying file can be found as an additional supplementary file (“Cytoscape BvV network”). A, Distribution of support for each node from bacterial (red) and viral (blue) markers (ie, each pie sector is proportional to the number of hits that originated from a particular marker list). B, Nodes colored by their membership in 1 of the top 20 clusters. Each cluster is named for the term (node) with the best P value. Inset table: neutrophil degranulation, considered as a sub-pathway of regulated exocytosis, was detected as the major biological GO pathway shared between the BvVM marker set of this study and RNA-sequencing data. Of the 18 bacterial markers overlapping between the studies, 10 markers are directly involved in neutrophil degranulation (see panel E for all 18 markers). C, Bacteria vs virus marker set with nodes colored by P value. The darker the color, the more statistically significant the node (see legend for P value ranges). D and E, RBM and SOMAScan protein aliases were converted into their gene names to compare markers between studies. D, Overlap of selected marker sets: SOMAScan (BvVM, n = 156), RNA-sequencing (BvVM, n = 431), and RBM immunoassay (BvV and BvM, n = 21). E, Two direct comparisons of marker sets derived through the same approach (BvV and BvVM); filled circles indicate a marker identified in the specified analysis. Markers that overlapped in the 2 direct comparisons are depicted by filled circles, but not between the 4 individual marker sets. The color indicates the direction of expression change. Red: upregulation in bacterial samples; dark blue: downregulation. Light gray: the marker was not detected or not included in at least 1 of the 2 marker sets. Abbreviations: ALPL, alkaline phosphatase; BvV, bacteria vs virus; BvVM, bacterial vs malaria or virus; CHIT1 - chitinase 1; CKM, creatine kinase, M-type; CST7, cystatin F; GO, Gene Ontology; HGF, hepatocyte growth factor; HP, haptoglobin; IL18R1, interleukin 18 receptor 1; IL6, interleukin-6; ILI8RAP, interleukin 18 receptor accessory protein; LCN2, neutrophil gelatinase-associated lipocalin; LTF, lactotransferrin; MMP8, matrix metalloproteinase-8; MPO, myeloperoxidase; OSM, oncostatin M; PGLYRP1, peptidoglycan recognition protein 1; PLXNC1, plexin C1; RETN, resistin; S100A9, protein S100-A9; SERPINA1, ɑ1-antitrypsin; SLPI, anti-leukoproteinase.
Pathways and gene enrichment analysis with differential markers shared between this study, RNA-sequencing, and RBM multiplex assay studies with the same study population. A–C, Clustered terms enriched in our bacteria vs virus 2-class comparison. Each node represents 1 term describing a biological process or pathway. Edges connect similar terms (similarity score [κ] >0.3); the thickness of the edge represents the similarity score. Each term is represented by a circle node, where the size is proportional to the number of input markers. The underlying file can be found as an additional supplementary file (“Cytoscape BvV network”). A, Distribution of support for each node from bacterial (red) and viral (blue) markers (ie, each pie sector is proportional to the number of hits that originated from a particular marker list). B, Nodes colored by their membership in 1 of the top 20 clusters. Each cluster is named for the term (node) with the best P value. Inset table: neutrophil degranulation, considered as a sub-pathway of regulated exocytosis, was detected as the major biological GO pathway shared between the BvVM marker set of this study and RNA-sequencing data. Of the 18 bacterial markers overlapping between the studies, 10 markers are directly involved in neutrophil degranulation (see panel E for all 18 markers). C, Bacteria vs virus marker set with nodes colored by P value. The darker the color, the more statistically significant the node (see legend for P value ranges). D and E, RBM and SOMAScan protein aliases were converted into their gene names to compare markers between studies. D, Overlap of selected marker sets: SOMAScan (BvVM, n = 156), RNA-sequencing (BvVM, n = 431), and RBM immunoassay (BvV and BvM, n = 21). E, Two direct comparisons of marker sets derived through the same approach (BvV and BvVM); filled circles indicate a marker identified in the specified analysis. Markers that overlapped in the 2 direct comparisons are depicted by filled circles, but not between the 4 individual marker sets. The color indicates the direction of expression change. Red: upregulation in bacterial samples; dark blue: downregulation. Light gray: the marker was not detected or not included in at least 1 of the 2 marker sets. Abbreviations: ALPL, alkaline phosphatase; BvV, bacteria vs virus; BvVM, bacterial vs malaria or virus; CHIT1 - chitinase 1; CKM, creatine kinase, M-type; CST7, cystatin F; GO, Gene Ontology; HGF, hepatocyte growth factor; HP, haptoglobin; IL18R1, interleukin 18 receptor 1; IL6, interleukin-6; ILI8RAP, interleukin 18 receptor accessory protein; LCN2, neutrophil gelatinase-associated lipocalin; LTF, lactotransferrin; MMP8, matrix metalloproteinase-8; MPO, myeloperoxidase; OSM, oncostatin M; PGLYRP1, peptidoglycan recognition protein 1; PLXNC1, plexin C1; RETN, resistin; S100A9, protein S100-A9; SERPINA1, ɑ1-antitrypsin; SLPI, anti-leukoproteinase.Neutrophil-related biological processes emerged as a key biological theme associated with bacterial infection. In particular, the “regulated exocytosis” GO cluster (34 gene hits) represents mostly neutrophil- or leukocyte-related terms. Within the top 36 GO clusters (out of 1388 total clusters, ranked by P value), 6 highly significant clusters consisting of 14 to 26 gene hits each were identified as neutrophil processes (“migration,” “mediated-immunity,” “activation,” “degranulation,” “activation involved in immune response,” and “chemotaxis”). Notably, no other cell type or subpopulation besides neutrophils appeared within the first 243 rank-ordered GO clusters. The “neutrophil degranulation” cluster was particularly prominent in markers that were identified by both SOMAScan and RNA-sequencing; it contained 10 of the 24 markers that emerged from that cross-platform comparison (Figure 3B, 3D, and 3E). This was further demonstrated by 3-way enrichment heatmap comparisons (Supplementary Figure 9).
Comparisons Between Datasets and Technologies
To assess the consistency of the results, we compared gene marker sets from similar marker-focused studies of the same population using different technologies. First, the current BvVM marker set was compared with markers found in our previously published RNA-sequencing approach [9]. Of the 1107 proteins included in our SOMAScan assay, 78 were represented by genes from the set of 600 significant differentially expressed markers in the RNA-sequencing analysis (of ~12 000 expressed genes) (Supplementary Data File 1). Twenty-five of these 78 genes (corresponding to 24 proteins) proved to be statistically significant markers in our comparison (Supplementary Figure 6).In the RNA data, 18 of those 24 proteins were markers for bacterial infection and 6 were markers for malaria infection. A heatmap of these markers highlights the strong class distinctions (Supplementary Figure 7). Haptoglobin (HP) is markedly down and hemoglobin up in malaria samples, but the majority of markers are elevated in bacterial samples (Figure 3, and see Supplementary Figures 6 and 8 for details on the malaria markers). When we used the SOMAScan data for these 24 markers to build RF and EN models, they performed similarly to 25 protein marker models optimized by the genetic algorithm (Supplementary Table 5), suggesting that those 24 markers would also be good candidate markers for a diagnostic assay.We also compared the SOMAScan marker sets with findings from a previous protein-based immunoassay (the rules based medicine [RBM] multiplex immunoassay) [13]. Five markers were identified as differential markers for bacterial pneumonia in both datasets: CKM, HP, IL6, myeloperoxidase (MPO), and SERPINA1 (Figure 3, Supplementary Figure 6). Three markers, HP, MPO, and SERPINA1, were identified as significant markers in all 3 studies (SOMAScan, multiplex immunoassay, and RNA-sequencing) despite the very different methodologies employed (Venn diagram in Figure 3D). Two markers appear in both the SOMAScan and multiplex immunoassay data as likely markers for malaria infection, VCAM1 and APCS [13].
DISCUSSION
We present diagnostic models based on aptamer-derived blood protein signatures that accurately discriminate bacterial from viral infections of pediatric febrile respiratory illness with as few as 5 protein markers (94% accuracy, 90% sensitivity, 85% specificity), meeting/exceeding the FIND-sponsored expert consensus guidelines on diagnostics for bacterial pneumonia [14]. Accurate discrimination of bacterial infection from both viral and malaria etiologies was achieved with 25 markers.Because the BvV model was highly predictive, we investigated the proteins to understand the processes that typify bacterial and viral infections. Gene enrichment and pathway analyses showed neutrophil-dominated processes in bacterial infections. The consistency of a neutrophilic host-response signature is highlighted by common signals (18 bacterial markers) across prior studies at both the RNA and protein level, despite model and platform differences (Figure 3E). Reinforcing this observation, a cross-platform 24-marker set, highly enriched for neutrophil-associated proteins (neutrophil degranulation being prominent), proved to be equally effective in differentiating bacteria versus other (Supplementary Table 5). Ten of the 18 bacterial markers were associated with bacterial airway inflammation, modifying, mitigating, or augmenting neutrophil immunological responses. For example, SERPINA1 and SLPI are both protease inhibitors regulating neutrophil elastase activity [19-21].Bacterial pneumonia diagnostics are a challenge globally for all countries in pediatric populations with a need for better diagnostics to improve antibiotic stewardship and mortality outcomes. The limitations of the current WHO clinical pneumonia definition were improved upon by our strict further criteria, laboratory testing, and consensus review to produce the best possible set of pneumonia cases. Our objective was to develop protein-based predictors that could eventually be ported to a field-deployable device for discriminating bacterial from nonbacterial pneumonia. While larger validation studies are needed, this study provides strong evidence that a blood-based protein panel of limited size can achieve the sensitivity and specificity required to guide clinical decisions regarding antibiotic therapy. By identifying biologically plausible sets of markers, the groundwork for the development of a point-of-care test has been established, particularly considering that some of these markers (HP, SERPINA1, MPO, etc) are relatively simple to measure. We identified surrogate proteins that can be exchanged for markers in our models without loss of accuracy, allowing flexibility in developing a diagnostic test. Although optimized for single etiology samples, our models performed well in mixed infections representing the natural complexity of febrile respiratory illness. Importantly, these markers seem to discriminate appropriately, even in the context of a high underlying malnutrition or HIV prevalence, such as the one in Manhiça, southern Mozambique [22, 23]. This is a significant benchmark, as a predictor must be effective across the spectrum of real-life clinical scenarios. Finally, our study provided insights into the host-response biology in our discriminant marker proteins. These observations may inform marker selection in future prospective studies and, together with our specific models and markers, may facilitate the development of the optimized markers for pneumonia diagnosis with the eventual transition to point-of-care tests that are needed to change future clinical practice, particularly for those settings where associated case-fatality rates for common infections remain high and diagnostic tools scarce.
Supplementary Data
Supplementary materials are available at Clinical Infectious Diseases online. Consisting of data provided by the authors to benefit the reader, the posted materials are not copyedited and are the sole responsibility of the authors, so questions or comments should be addressed to the corresponding author.Click here for additional data file.
Authors: Kerry-Ann F O'Grady; Paul J Torzillo; Alan R Ruben; Debbie Taylor-Thomson; Patricia C Valery; Anne B Chang Journal: Pediatr Pulmonol Date: 2011-09-13
Authors: Jacob Silterra; Michael A Gillette; Miguel Lanaspa; Karell G Pellé; Clarissa Valim; Rushdy Ahmad; Sozinho Acácio; Katherine D Almendinger; Yan Tan; Lola Madrid; Pedro L Alonso; Steven A Carr; Roger C Wiegand; Quique Bassat; Jill P Mesirov; Danny A Milner; Dyann F Wirth Journal: J Infect Dis Date: 2017-01-15 Impact factor: 5.226
Authors: Melinda M Pettigrew; Janneane F Gent; Yong Kong; Martina Wade; Shane Gansebom; Anna M Bramley; Seema Jain; Sandra L R Arnold; Jonathan A McCullers Journal: BMC Infect Dis Date: 2016-07-08 Impact factor: 3.090
Authors: Soumyaroop Bhattacharya; Alex F Rosenberg; Derick R Peterson; Katherine Grzesik; Andrea M Baran; John M Ashton; Steven R Gill; Anthony M Corbett; Jeanne Holden-Wiltse; David J Topham; Edward E Walsh; Thomas J Mariani; Ann R Falsey Journal: Sci Rep Date: 2017-07-26 Impact factor: 4.379
Authors: Sabine Dittrich; Birkneh Tilahun Tadesse; Francis Moussy; Arlene Chua; Anna Zorzet; Thomas Tängdén; David L Dolinger; Anne-Laure Page; John A Crump; Valerie D'Acremont; Quique Bassat; Yoel Lubell; Paul N Newton; Norbert Heinrich; Timothy J Rodwell; Iveth J González Journal: PLoS One Date: 2016-08-25 Impact factor: 3.240