Literature DB >> 17531104

Optimization and clinical validation of a pathogen detection microarray.

Christopher W Wong1, Charlie Lee Wah Heng, Leong Wan Yee, Shirlena W L Soh, Cissy B Kartasasmita, Eric A F Simoes, Martin L Hibberd, Wing-Kin Sung, Lance D Miller.   

Abstract

DNA microarrays used as 'genomic sensors' have great potential in clinical diagnostics. Biases inherent in random PCR-amplification, cross-hybridization effects, and inadequate microarray analysis, however, limit detection sensitivity and specificity. Here, we have studied the relationships between viral amplification efficiency, hybridization signal, and target-probe annealing specificity using a customized microarray platform. Novel features of this platform include the development of a robust algorithm that accurately predicts PCR bias during DNA amplification and can be used to improve PCR primer design, as well as a powerful statistical concept for inferring pathogen identity from probe recognition signatures. Compared to real-time PCR, the microarray platform identified pathogens with 94% accuracy (76% sensitivity and 100% specificity) in a panel of 36 patient specimens. Our findings show that microarrays can be used for the robust and accurate diagnosis of pathogens, and further substantiate the use of microarray technology in clinical diagnostics.

Entities:  

Mesh:

Substances:

Year:  2007        PMID: 17531104      PMCID: PMC1929155          DOI: 10.1186/gb-2007-8-5-r93

Source DB:  PubMed          Journal:  Genome Biol        ISSN: 1474-7596            Impact factor:   13.583


Background

Timely, accurate and sensitive detection of infectious disease agents is still difficult today, despite a long history of progress in this area. Traditional methods of culture and antibody-based detection still play a central role in microbiological laboratories despite the problems of the delay between disease presentation and diagnosis, the limited number of organisms that can be detected by these approaches, and the 'hit-or-miss' nature of the diagnostic process, which depends on a clinical prediction of the infectious source [1]. Faster diagnosis of infections would reduce morbidity and mortality, for example, through the earlier implementation of appropriate antimicrobial treatment. During the past few decades, various methods have been proposed to achieve this, with those based on nucleic acid detection, including PCR and microarray-based techniques, seeming the most promising. These approaches are beginning to rapidly decrease laboratory turnaround times so that results can be available within 2-6 hours compared to perhaps 24 hours. Future developments may see this reduced even further; and through the development of point-of-care devices, perhaps enable the clinician to make the diagnosis directly at the bed-side [2,3]. While pathogen microarrays and their utility in discovering emerging infectious diseases such as SARS have been described, technical problems related to accuracy and sensitivity of the assay prevent their routine use in patient care [4-9]. For microarrays to become a standard diagnostic tool, the following questions must be addressed: what are the factors that influence probe design and performance? How is a pathogen 'signature' measured and detected? What is the specificity and sensitivity of an optimized detection platform? Can detection algorithms distinguish co-infecting pathogens and closely related viral strains? [10-12]. Noisy signals caused by cross-hybridization artifacts present a major obstacle to the interpretation of microarray data, particularly for the identification of rare pathogen sequences present in a complex mixture of nucleic acids. For example, in clinical specimens, contaminating nucleic acid sequences, such as those derived from the host tissue, will cross-hybridize with pathogen-specific microarray probes above some threshold of sequence complementarity. This can result in false-positive signals that lead to erroneous conclusions. Similarly, the pathogen sequence, in addition to binding its specific probes, may cross-hybridize with other non-target probes (that is, probes designed to detect other pathogens). This latter phenomenon, though seemingly problematic, could provide useful information for pathogen identification to the extent that such cross-hybridization can be accurately predicted. With various metrics to assess annealing potential and sequence specificity, microarray probes have traditionally been designed to ensure maximal specific hybridization (to a known target) with minimal cross-hybridization (to non-specific sequences). However, in practice we have found that many probes, though designed using optimal in silico parameters, do not perform according to expectations for reasons that are unclear (CW Wong et al., unpublished data). Here, we report the results of a systematic investigation of the complex relationships between viral amplification efficiency, hybridization signal output, target-probe annealing specificity, and reproducibility of pathogen detection using a custom designed microarray platform. Our findings form the basis of a novel methodology for the in silico prediction of pathogen 'signatures', shed light on the factors governing viral amplification efficiency and demonstrate the important connection between a viral amplification efficiency score (AES) and optimal probe selection. Finally, we describe a new statistics-based pathogen detection algorithm (PDA) to link this all together, permitting confident identification of organisms entirely by prediction, and evaluate the entire platform in relation to conventional PCR techniques in a cohort of patients with lower respiratory illness.

Results and discussion

Empirical determination of cross-hybridization thresholds on a pathogen detection microarray

To systematically investigate the dynamics of array-based pathogen detection, we created an oligonucleotide array using Nimblegen array synthesis technology [13]. The array was designed to detect up to 35 RNA viruses using 40-mer probes tiled at an average 8-base resolution across the full length of each genome (53,555 probes; Figure S1 and Table S1 in Additional data file 1). Together with 7 replicates for each viral probe, and control sequences for array synthesis and hybridization (see Materials and methods), the array contained a total of 390,482 probes. Initially, we studied virus samples purified from cell lines, reverse-transcribed and PCR-amplified with virus-specific primers (instead of random primers). This allowed us to study array hybridization dynamics in a controlled fashion, without the complexity of cross-hybridization from human RNA and random annealing dynamics, which occur with random primers. We then applied our findings to clinical samples amplified using random primers. SARS coronavirus and Dengue serotype 1 genomic cDNA were amplified in entirety (as confirmed by sequencing), labeled with Cy3 and hybridized separately on microarrays. The SARS sample hybridized well to the SARS tiling probes, with all 3,805 SARS-specific probes displaying fluorescent (Cy3) signal well above the detection threshold (determined by probe signal intensities >2 standard deviations (SD) above the mean array signal intensity; Figure 1a). Cross-hybridization with other pathogen probe sets was minimal, observed only for other members of Coronaviridae and a few species of Picornaviridae and Paramyxoviridae, consistent with the observation that SARS shares little sequence homology with other known viruses [14]. The hybridization pattern of Dengue 1, on the other hand, was more complex (Figure 1b). First, we observed that hybridization to the Dengue 1 probe set was partially incomplete (that is, there were regions absent of signal) due to sequence polymorphisms. The Dengue 1 sample hybridized on the array was cultured from a 1944 Hawaiian isolate, whereas the array probe set was based on the sequence of a Singaporean strain S275/90, isolated in 1990 [15]. Sequencing the entire genomes of these 2 isolates revealed that the array probes that failed to hybridize each contained at least 3 mismatches (within a 15-base stretch) to the sample sequence. Second, we observed that cross-hybridization occurred to some degree with almost all viral probe sets present on the array, particularly with probes of other Flaviviridae members, consistent with the fact that the 4 Dengue serotypes share 60-70% homology. To understand the relationship between hybridization signal output and annealing specificity, we first compared all probe sequences to each viral genome using two measures of similarity: probe hamming distance (HD) and maximum contiguous match (MCM). HD measures the overall similarity distance of two sequences, with low scores for similar sequences [16,17]. MCM measures the number of consecutive bases that are exact matches, with high scores for similar sequences [17,18].
Figure 1

Heatmap of microarray probe signal intensities. Cells corresponding to probes are aligned in genomic order and colored according to the signal intensity-color scales shown. Hybridization signatures corresponding to (a) SARS Sin850 or (b) Dengue 1 Hawaiian isolate are shown.

Heatmap of microarray probe signal intensities. Cells corresponding to probes are aligned in genomic order and colored according to the signal intensity-color scales shown. Hybridization signatures corresponding to (a) SARS Sin850 or (b) Dengue 1 Hawaiian isolate are shown. We calculated the HD and MCM scores for every probe relative to the Hawaiian Dengue 1 isolate and observed that these scores correlated negatively (HD) and positively (MCM) with probe signal intensity (Figure 2). All probes on the array with high similarity to the Hawaiian Dengue I genome, that is, HD ≤ 2 (n = 942) or MCM ≥ 27 (n = 627), hybridized with median signal intensity 3 SD above detection threshold. Although 98% of probes were detectable at the low HD range from 0-4, or high MCM range from 18-40, median probe signal intensity decreased at every increment of sequence distance (Figure 2). Median signal intensity dropped off sharply to background levels at HD = 7 and MCM = 15, with 43% and 46% detectable probes, respectively. The majority of probes (>96%, n > 51,000) had HD scores between 8 and 21 and/or MCM scores between 0 and 15, of which only 1.23% and 1.57%, respectively, were detectable.
Figure 2

Relationship between probe HD, probe MCM and probe signal intensity. Average probe signal intensity and percentage of detectable probes (signal intensity > mean + 2 SD) decreases as HD increases and MCM decreases. The optimal cross-hybridization thresholds HD ≤ 4 or MCM ≥ 18, where >98% of probes can be detected, is shaded in blue.

Relationship between probe HD, probe MCM and probe signal intensity. Average probe signal intensity and percentage of detectable probes (signal intensity > mean + 2 SD) decreases as HD increases and MCM decreases. The optimal cross-hybridization thresholds HD ≤ 4 or MCM ≥ 18, where >98% of probes can be detected, is shaded in blue. At the optimal similarity thresholds HD ≤ 4 and MCM ≥ 18, >98% of probes could be detected with median signal intensity 2 SD above detection threshold, whereas adjusting the similarity threshold down 1 step to HD ≤ 5 and MCM ≥ 17 would result in only approximately 85% probe detection and median signal intensity approximately 1.2 SD above detection threshold (Figure 2). Using these optimal HD and MCM thresholds to guard against cross-hybridization, we binned all probes into specific 'recognition signature probe sets' (that is, r-signatures) most likely to specifically detect a given pathogen, and we defined r-signatures for each of the 35 pathogen genomes represented on the array (Table 1). Each pathogen's r-signature comprised tiling probes derived from its genome sequence (HD = 0, MCM = 40), as well as cross-hybridizing probes derived from other pathogens (HD ≤ 4, MCM ≥ 18). According to these criteria, a given probe could belong to multiple different r-signatures, thereby maximizing probe-level evidence for pathogen detection.
Table 1

Binning of probes into specific pathogen signature probe sets

PathogenFamilyGenome size (nt)Total tiling probesTop 20% AES* (a)GC content filter (b)Human genome filter (c)No. of filtered probes left (d = a - (b + c))No. of predicted cross-hybridizing probes (HD ≤ 4 and MCM ≥ 18) (e)No. of probes in pathogen r-signature (d + e)
1LCMVArenaviridae10,0561,283348283380338
2HantaanBunyaviridae6,533834156551466152
3Sin NombreBunyaviridae6,562837182121796185
4229ECoronaviridae27,3173,49549411114720472
5OC43Coronaviridae30,7383,93763415225973600
6SARSCoronaviridae29,7113,805575825651566
7Dengue serotype 1Flaviviridae10,7171,370230282208228
8Dengue serotype 2Flaviviridae10,7221,3702410923211243
9Dengue serotype 3Flaviviridae10,7071,3702300422613239
10Dengue serotype 4Flaviviridae10,6491,361229172213224
11Japanese encephalitisFlaviviridae10,9761,4043103230512317
12West NileFlaviviridae10,9621,401320223169325
13Yellow feverFlaviviridae10,8621,389255232502252
14Hepatitis BHepadnaviridae3,2154091471401330133
15Influenza AOrthomyxoviridae12,5611,5825101154940494
16Influenza BOrthomyxoviridae14,4521,8226655186422644
17Human papillomavirus type 10Papillomaviridae7,9191,0112871692620262
18hMPVParamyxoviridae13,3351,70532244172610261
19Newcastle diseaseParamyxoviridae15,1861,943329023273330
20NipahParamyxoviridae18,2462,3353891253720372
21Parainfluenza 1Paramyxoviridae15,6001,9953308133092311
22Parainfluenza 2Paramyxoviridae15,6462,0023331023210321
23Parainfluenza 3Paramyxoviridae15,4621,97940928233583361
24RSV BParamyxoviridae15,2251,9483832843514355
25Echovirus 1Picornaviridae7,39794523811022722249
26Enterovirus APicornaviridae7,413946193301908198
27Enterovirus BPicornaviridae7,3899441790417522197
28Enterovirus CPicornaviridae7,401945183001834187
29Enterovirus DPicornaviridae7,390944155031528160
30Foot and mouth diseasePicornaviridae8,1151,0361941431770177
31Hepatitis APicornaviridae7,478955163161560156
32Rhinovirus A (type 89)Picornaviridae7,152913191661791180
33Rhinovirus BPicornaviridae7,212920197221930193
34HIV 1Retroviridae9,1811,174191401870187
35RubellaTogaviridae9,7551,24611765052052
Total419,24253,55597689921

*AES scores for all tiling probes were ranked together and only those probes in the top 20th percentile were retained. †Segment 7 of Influenza A was omitted during probe design.

Binning of probes into specific pathogen signature probe sets *AES scores for all tiling probes were ranked together and only those probes in the top 20th percentile were retained. †Segment 7 of Influenza A was omitted during probe design. We next considered other non-specific hybridization phenomena that could affect performance of our r-signature probes. For example, we observed a linear relationship between probe signal and %GC content (data not shown). Consistent with previous observations, we found that probes <40% GC hybridized with diminished signal intensities, while probes with >60% GC content showed higher signal intensities [19,20]. Thus, we censored probes with GC <40% or >60% from the r-signatures, despite optimal HD or MCM values. Furthermore, as cross-hybridization with human sequences could also confound results, we compared all probes to the human genome assembly (build 17) by BLAST using a word size of 15 [21]. Probes with an expectation value of 100 were also censored (Table 1). While the ideal pathogen r-signature would be one where all probes would hybridize to the target sequence at detectable levels, polymorphic variation between the probes (derived from a consensus sequence) and the actual target would be expected to impede the performance of the r-signature probes at some level. To test this hypothesis, we compared the ratios of detectable to undetectable probes across all r-signatures in the context of the hybridization involving the Hawaiian Dengue 1 isolate. Although the Dengue 1 sequence used to derive the Dengue 1 r-signature was approximately 5% different from the Hawaiian isolate, the detectable probe ratio of the Dengue 1 specific probes was 151/152 (99%), 12 times higher then that for the nearest Dengue serotype signature, suggesting that moderate polymorphic variation is quite tolerable, allowing, in this case, for discernment of the correct pathogen.

Predicting genome-wide amplification bias

Random priming amplification, rather than primer-specific amplification, is preferred for identifying unknown pathogens in clinical specimens. However, in initial experiments using random priming amplification to identify known pathogens, we frequently observed incomplete hybridization of the pathogen genome marked by interspersed genomic regions not detected by the probes. An example involving the amplification of respiratory syncytial virus (RSV) B from a human nasopharyngeal aspirate is shown in Figure 3. In preliminary analyses, sequence polymorphisms, probe GC content and genome secondary structure failed to explain this phenomenon, suggesting that it might result from a PCR-based amplification bias stemming from differential abilities of the random primers to bind to the viral genome at the reverse transcription (RT) step. The random primer used in our experiments was a 26-mer composed of a random nonamer (3') tagged with a fixed 17-mer sequence (5'-GTTTCCCAGTCACGATA) [4,9,22]. Intra-primer secondary structure formation, such as dimer and hairpin formation between the 17-mer tag and nonamer, and probe melting temperature are known to influence binding efficiency [23,24]. To explore our hypothesis, we designed an algorithm to model the RT-PCR process using experimental data (see Additional data file 1 for details). Briefly, it calculates the probability that a 500-1,000 base-pair product (average size range of PCR product) can be generated from each possible starting position in the genome assuming that a nonamer in the random primer mix will complement the viral sequence perfectly. This probability is reduced when intra-primer hairpin formation is predicted, and increased according to degree of complementarity between tag sequence and viral sequence. In this manner, the probability that each nucleotide will be successfully PCR-amplified is reflected in its AES (see supplemental methods in Additional data file 1 and [25]). To validate the algorithm, we ranked the hybridization signal intensities for all 1,948 probes tiled across the RSV B genome and compared them to their AES values (Figure 3). We observed that high AES significantly correlates to probe hybridization signal intensity above the detection threshold (P = 2.2 × 10-16; Fisher's exact test). In another experiment involving a patient sample positive for metapneumovirus (hMPV), the probes tiled across the hMPV genome showed a similar result, P = 1.3 × 10-9. Repeatedly, we observed that higher AES correlated with greater probe detection, with, on average, >70% detection for probes in the top 20% AES (see supplemental methods in Additional data file 1).
Figure 3

Measurement and application of AES. An RSV patient sample was amplified using original primer A1 (black line), or AES-optimized primer (blue line). The probes that have detectable signal above threshold are shown in purple in the corresponding heatmaps. For primer A1, the detectable regions correspond to regions that have higher AES scores than undetectable regions.

Measurement and application of AES. An RSV patient sample was amplified using original primer A1 (black line), or AES-optimized primer (blue line). The probes that have detectable signal above threshold are shown in purple in the corresponding heatmaps. For primer A1, the detectable regions correspond to regions that have higher AES scores than undetectable regions. While HD, MCM, %GC and sequence uniqueness were valuable parameters for probe selection, they did not take into account PCR bias, and were insufficient predictors of probe performance when considered in the absence of AES (Figure 4). We found that using only the probes within the top 20% AES (Table 1) substantially improved the efficacy of our prediction algorithm (discussed in the following section). In total, after applying all probe selection criteria, the r-signatures utilized 9,768 of the >50,000 unique probes initially included on the array.
Figure 4

Effects of probe filtering criteria on r-signature probe detection. The 1,948 probes tiled across the RSV B genome were binned according to different filtering criteria and plotted against the percentage of probes with detectable signal. Measurements reflect the average of five experiments.

Effects of probe filtering criteria on r-signature probe detection. The 1,948 probes tiled across the RSV B genome were binned according to different filtering criteria and plotted against the percentage of probes with detectable signal. Measurements reflect the average of five experiments. We next hypothesized that amplification efficiency scoring could be used to select an optimal tag sequence (that is, for the RT-PCR primers) for achieving uniformly high AES across viral genomes, thus globally maximizing PCR efficiency (see supplemental methods in Additional data file 1 and [25]). Briefly, we generated 10,000 primer sequences, eliminated those that formed self-dimers, and calculated AES for every genome based on each candidate primer tag. Primer A2, which had the highest average AES for all 35 viruses present on the array, was selected as the 'AES-optimized' primer. In a comparative study of eight patient samples (five RSV, three hMPV), we observed that primer A2 showed a marked improvement in overall PCR efficiency in amplifying both RSV and hMPV over the original primer, A1 (Figures S2 and S3 in Additional data file 1). The increased PCR efficiency contributed to increased hybridization of DNA to the probes, and is reflected in the uniformly higher signal intensities observed using primer A2. Consequently, >70% of viral probes had signal intensities above detection threshold when using primer A2, compared to approximately 20% using primer A1 (Anova test, P = 0.00026; Figure S3 in Additional data file 1).

PDA: an algorithm for detecting pathogens

We observed that while the signal intensities for all pathogen r-signatures approximate a normal distribution, a large proportion of probes comprising the signature of a detectable pathogen have relatively strong signal intensities resulting in a right-skewed distribution (Figure 5a). We reasoned that analysis of the tails of the signal intensity distributions for each r-signature might better enable not only the identification of an infecting pathogen, but also the presence of co-infecting pathogens in the same sample. Thus, we devised a robust statistics-based PDA that analyzes the distribution of probe signal intensities relative to the in silico r-signatures (see supplemental methods in Additional data file 1 and [25]). The PDA software comprises two parts: evaluation of signal intensity of probes in each pathogen r-signature using a modified Kullback-Leibler Divergence (KL); and statistical analysis of modified KL scores using the Anderson-Darling test.
Figure 5

Distribution of probe signal intensities and WKL scores. RNA isolated from a RSV-infected patient was hybridized onto the array. (a) Distribution of probe signal intensities of all 53,555 probes (red) and r-signature probes for an absent pathogen, for example, parainfluenza-1 (dotted line), show a normal distribution. The distribution of signal intensity for RSV r-signature probes are positively skewed, with higher signal intensities in the tail of the distribution. (b) Distribution frequency of WKL scores for the 35 pathogen r-signatures with the majority ranging between -5 and 3. A non-normal WKL score distribution is observed (P < 0.05 by Anderson Darling test). The presence of a pathogen is indicated by a non-normal distribution caused by outlier WKL = 17, corresponding to RSV. Excluding the RSV r-signature WKL score results in a normal distribution. From this computation, we conclude that RSV is present in the hybridized sample.

Distribution of probe signal intensities and WKL scores. RNA isolated from a RSV-infected patient was hybridized onto the array. (a) Distribution of probe signal intensities of all 53,555 probes (red) and r-signature probes for an absent pathogen, for example, parainfluenza-1 (dotted line), show a normal distribution. The distribution of signal intensity for RSV r-signature probes are positively skewed, with higher signal intensities in the tail of the distribution. (b) Distribution frequency of WKL scores for the 35 pathogen r-signatures with the majority ranging between -5 and 3. A non-normal WKL score distribution is observed (P < 0.05 by Anderson Darling test). The presence of a pathogen is indicated by a non-normal distribution caused by outlier WKL = 17, corresponding to RSV. Excluding the RSV r-signature WKL score results in a normal distribution. From this computation, we conclude that RSV is present in the hybridized sample. Since the original KL cannot reliably determine differences in the tails of a probability distribution, and is highly dependent on the number of probes per genome and the size of each signal intensity bin, we incorporated the Anderson-Darling statistic to give more weight to the tails of each distribution. By using a cumulative distribution function instead of the original probability distribution, the p value generated is independent of the binning criteria, eliminating errors that occur if a particular signal intensity bin is empty [26,27]. We call our modified KL divergence the 'weighted Kullback-Leibler divergence' (WKL): where Q(j) is the cumulative distribution function of the signal intensities of the probes in Pfound in bin b; is the cumulative distribution function of the signal intensities of the probes in found in bin b. R-signatures representing absent pathogens should have normal signal intensity distributions and thus relatively low WKL scores, whereas those representing present pathogens should have high, statistically significant outlying WKL scores (Figure 5b). In the second part of PDA, the distribution of WKL scores is subjected to an Anderson-Darling test for normality. If P < 0.05, the WKL distribution is considered not normal, implying that the pathogen with an outlying WKL score is present. Upon identification of a pathogen, that pathogen's WKL score is left out, and a separate Anderson-Darling test is performed to test for the presence of co-infecting pathogens. In this manner, the procedure is iteratively applied until only normal distributions remain (that is, P > 0.05). The PDA algorithm is extremely fast, capable of making a diagnosis from a hybridized microarray in less than 10 seconds.

Microarray performance on clinical specimens

To assess the clinical utility of the pathogen prediction platform, we analyzed 36 nasal wash specimens according to the workflow illustrated in Figure 6. These specimens were obtained from children under 4 years of age with lower respiratory tract infections (LRTI), of which 14 were hospitalized for severe disease and 22 with ambulatory LRTI. The clinical diagnosis of these patients was bronchiolitis or pneumonia. All 36 specimens had been previously analyzed for the presence of hMPV, and RSV A and B using real-time PCR. Twenty-one specimens tested positive for one or more viruses, while fifteen were PCR-negative for all three. All specimens were analyzed by microarray in a blinded fashion (Table 2).
Figure 6

Schema of pathogen detection process. AD, Anderson-Darling.

Table 2

Comparison of microarray and real-time PCR performance in detection of pathogen genera (HRV, pneumovirus)

Patient IDArrayWKLP valuePDA genus diagnosisPCR diagnosisPCR Ct valueVirus copy no.
11135915NDND
1223588720.872.47 × 10-29PneumovirusPneumovirus24.85.0 × 104
1337118022.336.93 × 10-62PneumovirusPneumovirus25.14.0 × 104
1656669116.953.49 × 10-4PneumovirusPneumovirus27.93.9 × 103
185*66696NDND
2547093525.022.87 × 10-39PneumovirusPneumovirus225.4 × 105
261*66697NDND
283*6378123.992.28 × 10-25PneumovirusHRV28.36.1 × 104
14.074.66 × 10-11HRV
312*66701NDPneumovirus33.744
321*71006NDPneumovirus31.1340
324*3525920.613.55 × 10-94PneumovirusPneumovirus21.43.0 × 106
331*66698NDHRV31.73.6 × 103
3377119221.733.49 × 10-14PneumovirusPneumovirus26.21.1 × 105
8.31.92 × 10-4HRVHRV29.13.1 × 104
3553566218.002.97 × 10-40PneumovirusPneumovirus20.36.7 × 106
368*66702NDND
37466695NDPneumovirus34.1500
3787093313.827.77 × 10-17PneumovirusPneumovirus23.95.4 × 105
393*7118925.411.15 × 10-18HRVHRV30.22.1 × 105
4123589019.662.42 × 10-49PneumovirusPneumovirus23.56.9 × 105
4147102549.911.18 × 10-65PneumovirusPneumovirus22.33.9 × 105
HRV332.6 × 103
46166699NDND
47871027NDPneumovirus34.818
483*3605312.171.47 × 10-12PneumovirusPneumovirus24.82.9 × 105
5547099778.554.59 × 10-120HRVHRV23.51.5 × 106
5736670038.096.26 × 10-22HRVHRV22.23.6 × 106
639*711829.237.91 × 10-6HRVND
69971007NDND
7697306724.623.70 × 10-52PneumovirusPneumovirus25.72.5 × 104
8187092710.401.63 × 10-8HRVHRV34.21.2 × 103
8327306813.524.54 × 10-6PneumovirusPneumovirus28.23.1 × 103
40.431.73 × 10-36PneumovirusPneumovirus23.81.2 × 105
8417307022.116.80 × 10-50PneumovirusPneumovirus20.94.5 × 106
35.48
HRV29.23.3 × 104
853*66690NDND
8597118872.171.42 × 10-128HRVHRV24.52.8 × 106
892*6835912.435.77 × 10-5HRVPneumovirus3427
HRV32.34.2 × 103
9137102840.671.60 × 10-50PneumovirusPneumovirus19.14.7 × 106
924*6670312.792.56 × 10-6PneumovirusPneumovirus31.5250
Pneumovirus33.7630

*Hospitalized patients. †RSV A patient samples. ND, none detected.

Comparison of microarray and real-time PCR performance in detection of pathogen genera (HRV, pneumovirus) *Hospitalized patients. †RSV A patient samples. ND, none detected. Schema of pathogen detection process. AD, Anderson-Darling. As the RSV A full-genome sequence has not been published, our array was not designed to specifically detect this virus. Thus, we first assessed array performance using only results from the 16 patients diagnosed with either hMPV or RSV B by PCR (Table 3). Of this cohort, the microarray correctly detected the presence of hMPV or RSV B in 13/16 samples. This corresponds to an assay specificity of 100%, sensitivity of 76%, and diagnostic accuracy of 94%. All 4 false negative samples (patients 374, 841, 892, and 924) had Ct values >33.5, which is near the detection limit of real-time PCR, and thus perhaps beyond the range of detection by microarray.
Table 3

Comparison of microarray and real-time PCR performance in detecting RSV B or hMPV

Patient IDArrayWKLP valuePDA diagnosisPCR diagnosisPCR Ct valueVirus copy no.
1223588720.872.47 × 10-29hMPVhMPV24.85.0 × 104
1337118022.336.93 × 10-62hMPVhMPV25.14.0 × 104
1656669116.953.49 × 10-4hMPVhMPV27.93.9 × 103
2547093525.022.87 × 10-39hMPVhMPV225.4 × 105
7697306724.623.70 × 10-52hMPVhMPV25.72.5 × 104
8327306813.524.54 × 10-6hMPVhMPV28.23.1 × 103
892*68359NDhMPV3427
324*3525920.613.55 × 10-94RSV BRSV B21.43.0 × 106
3553566218.002.97 × 10-40RSV BRSV B20.36.7 × 106
37466695NDRSV B34.1500
3787093313.827.77 × 10-17RSV BRSV B23.95.4 × 105
4123589019.662.42 × 10-49RSV BRSV B23.56.9 × 105
483*3605312.171.47 × 10-12RSV BRSV B24.82.9 × 105
924*66703NDRSV B33.7630
3377119221.733.49 × 10-14RSV BRSV B1.1 × 105
8417307022.664.21 × 10-50RSV BRSV B20.94.4 × 106
hMPV35.48

*Hospitalized patients. ND, none detected.

Comparison of microarray and real-time PCR performance in detecting RSV B or hMPV *Hospitalized patients. ND, none detected. We next assessed array performance in the group of patients PCR-positive for RSV A (n = 7) and PCR-negative for all tested viruses (n = 15). The microarray made only two positive calls in this group, both for RSV B. Interestingly, both RSV B calls corresponded to high-titre RSV A specimens by PCR (patients 414 and 913), suggesting that certain probe sets can detect the presence of related, but unspecified, viruses. Analysis of the published RSV A partial genome sequence (923 bp, Genbank ID: AF516119) revealed that 7 probes on our microarray had 100% identity to RSV A. We created an 'RSV A r-signature' comprising these 7 probes, enabling the specific detection of RSV A by microarray in 4/7 patient samples PCR-positive for RSV A (patients 414, 832, 913, and 924). Although the performance of this small r-signature was not as robust as the other virus r-signatures (median size: 249 probes), it suggested that it was feasible to pursue a 'viral discovery' approach using r-signatures created to detect viruses at the family or genus level that were related to those species already represented on the microarray. Specifically, we binned probes into family- or genus-level r-signatures by relaxing our similarity criteria (to HD ≤ 5 or MCM ≥ 25) and selecting probes common to genome sequences within families and genera for the picornaviridae family, paramyxoviridae family, rhinovirus genus (HRV) and pneumovirus genus (inclusive of RSV and hMPV). Upon re-analysis of all 36 samples, we identified the presence of pneumovirus in 17 specimens as expected (1 false positive, patient 283), and additionally detected the presence of HRV in 9 specimens (Table 2). As HRV was a novel discovery, we re-screened all 36 samples by PCR and found HRV in 11 specimens. All nine HRV calls by microarray were confirmed by PCR except for one. This finding was intriguing given that the genomic diversity of the over 100 known rhinovirus serotypes makes detection by PCR notoriously difficult [28]. As the real-time PCR primers were capable of identifying only approximately 70% of rhinovirus strains, it is possible that the microarray correctly detected a rhinovirus strain that PCR failed to detect. Similarly, the pneumovirus genus detected in patient 283 could not be verified by RT-PCR, possibly owing to subtle genetic variations that prevented primer annealing. Thus, the greater genomic coverage afforded by the microarray might, in some cases, provide a more sensitive and accurate detection capability than pathogen-specific PCR. Though the microarray identified the majority of HRV and RSV A samples using the genus-level r-signatures, it failed to detect three samples positive for HRV and three positive for RSV A by real-time PCR. These false negatives had an average Ct value >32, again suggesting a detection threshold close to that of real-time PCR. However, that the microarray also made a number of accurate discoveries in the 30-35 Ct range suggests a considerable degree of detection variability in the titre range above an approximately 30 Ct equivalency. Notably, the microarray correctly detected the presence of co-infecting pathogens in two samples (337 and 832), demonstrating the unique potential of this microarray platform to reveal complex disease etiologies.

Alternative methods of array design and pathogen detection

Though pathogen detection by microarray is a young field, a number of different platforms and approaches have been described, each with important attributes. For example, the array described by Wang et al. [9] is based on probes designed to recognize the most conserved viral domains, facilitating the detection of a taxonomic fingerprint that provides powerful clues to viral identity with minimal probe usage. Lin et al. [8], on the other hand, described a probe-dense resequencing array capable of detecting a smaller set of predefined pathogens, but with higher detection specificity, including the ability to discern highly related subtypes. The microarray described herein represents a blend of these two concepts, integrating a probe tiling approach for substantial genomic coverage (though with lower probe density than a resequencing array), with a taxonomy-based strategy for binning probes into pathogen recognition signatures. Thus, our analytical output includes both family- and genus-level predictions (for r-signatures restricted to conserved probes) as well as species-specific predictions (for r-signatures composed of conserved and unique probes). Indeed, this capability allowed us to detect and accurately identify viruses in clinical samples (Table 2). Central to pathogen prediction are the algorithms that weigh the microarray data against pre-defined recognition signatures. Unfortunately, few such algorithms exist, and only one algorithm, E-Predict, has been reported and validated [5,29,30]. E-Predict matches hybridization signatures with predicted pathogen signatures derived from the theoretical free energy of hybridization for each microarray probe. To examine the performance of E-predict on our microarray platform, we analyzed a number of samples with both E-predict and our PDA algorithm. When applied to our microarray data, E-Predict performed well, with its first prediction tending to be the correct one (Table S2 in Additional data file 1). However, for each specimen, a number of false positive calls were also made, which seemed to reflect species with considerable sequence similarity to the true infecting pathogen (Table S2 in Additional data file 1). For example, in patient sample 412, E-Predict detected RSV (the correct pathogen), but also multiple species of coronavirus (which share some sequence similarity with RSV), yet real-time PCR using pancoronavirus primers as well as primers specific for strains OC43 and 229E indicated the absence of coronavirus from this sample (Figure S4 in Additional data file 1). These false positive calls can be explained by the fact that the function of E-Predict is less geared towards identifying and distinguishing specific pathogen strains, and aimed more at elucidating the best possible candidates as supported by the available probes. Thus, E-Predict is particularly advantageous in situations where a pathogen's sequence is not fully known [5]. In contrast, our PDA algorithm is designed to make calls with greater species-level resolution. A major strength of PDA is its ability to specifically identify sequence-characterized and co-infecting pathogens with low false positivity. This is aptly demonstrated by the ability of PDA to detect specifically the presence of Dengue 1 in the clinical sample, where 7/35 viruses on the array are from the Flaviviridae family, including 4 dengue serotypes that share 70% sequence homology. The benefits of using both algorithms simultaneously for detecting both known and novel pathogens should be further evaluated. An important discovery in this study was that the composition of the random primer tag has a significant impact on the efficiency of viral genome amplification, as assessed by an amplification efficiency score. The measurement of amplification efficiency allowed us to predict which probes would provide the most informative recognition signatures, markedly improving our pathogen prediction capability. Moreover, this finding allowed us to design AES-optimized primers that increased the amplification efficiency of our samples, resulting in greater sensitivity of pathogen detection. Whether multiplex RT-PCR using a variety of AES-designed primer tags can further increase amplification efficiency warrants further investigation. Additionally, it is feasible that other tag-based PCR applications, such as the generation of DNA libraries and enrichment of RNA for resequencing, may benefit from primer optimization using the AES algorithm. DNA microarrays have the potential to revolutionize clinical diagnostics through their ability to simultaneously investigate thousands of potential pathogens in order to make a diagnosis. However, questions remain regarding their sensitivity and reliability. In this work, we investigated the myriad factors that influence microarray performance in the context of virus detection in clinical specimens, and describe an optimized platform capable of identifying individual and co-infecting viruses with high accuracy and sensitivity that brings microarray technology closer to the clinic. Future improvements will include significant reductions in microarray manufacturing and usage costs. Multiplex microarray formats and 're-usable' arrays are developing technologies that promise to drive down these costs. Furthermore, alternative technologies, such as beads [31], microfluidics [32,33] and nanotube microarrays [34], might provide advantages in both assay cost and speed relative to traditional microarray platforms. Technology considerations aside, the advantages of a highly parallel, nucleic acid-based screening approach for detecting disease pathogens are clear. Validations in larger patient cohorts and in diverse clinical settings will be an important next step towards establishing the clinical role of pathogen detection microarrays.

Materials and methods

Microarray synthesis

Complete genome sequences of 35 clinically relevant human viruses (Table S1 in Additional data file 1) were downloaded from the NCBI Taxonomy Database [35] and used to generate 40-mer probe sequences tiled across each genome and overlapping at an average 8-base resolution. Seven replicates of each probe were synthesized at random positions on the microarray using Nimblegen proprietary technology [13]. For quality control purposes, 10,000 random sequence probes with 40-60% GC content were included to assess background signal levels. Additional controls included 400 probes to human immune genes (positive controls) and 162 probes to a plant virus, PMMV (negative control). In total, 390,482 probes were synthesized on the array.

Sample preparation, microarray hybridization and staining

Dengue (ATCC #VR-1254) was cultured as per ATCC recommendations and Sin850 SARS was cultured as described [36]. Clinical specimens (nasopharyngeal washes) were obtained from an Indonesian pediatric population using a standardized WHO protocol as described [37]. The patients were all aged between 0 and 48 months, showed symptoms of LRTI, and were diagnosed with bronchiolitis or pneumonia when they visited the clinic between February 1999 and February 2001. Of these patients, 14 were subsequently hospitalized. The samples were stored at -80°C in RNAzol (Leedo Medical Laboratories, Inc., Friendswood, TX, USA). RNA was later extracted from samples with RNAzol according to the manufacturer's instructions [38,39], resuspended in RNA storage solution (Ambion, Inc., Austin, TX, USA) and frozen at -80°C until further use. A detailed protocol is provided in the supplemental methods in Additional data file 1. Briefly, RNA was reverse transcribed to cDNA using tagged random primers as described [9,40]. The original primer A1 was 5' GTTTCCCAGTCACGATANNNNNNNNN; and the AES-optimized primer A2 was 5' GATGAGGGAAGATGGGGNNNNNNNNN. The cDNA was then amplified by random PCR, fragmented, end-labeled with biotin, hybridized onto the microarray and stained as previously described [19] with 1 exception: the addition of 0.82 M tetramethylammonium chloride (TMAC) to Nimblegen's hybridization buffer to minimize nonspecific hybridization.

Real-time PCR for clinical samples

A 20 μl reaction mixture containing 2 μl of the purified patient RNA, 5 U of MuLV reverse transcriptase, 8 U of recombinant RNase inhibitor, 10 μl of 2X universal PCR Master Mix with no UNG (all from Applied Biosystems, Foster City, CA, USA) was combined with 0.9 μM primer and 0.2 μM (RSV B and hMPV), 0.3 μM (HRV) or 0.5 μM (RSV A) probe. The primers and probe sequences for hMPV were: 5'-AGCAAAGCAGAAAGTTTA TTCGTTAA-3'; 5'-ACCCCCCACCTCAGCATT-3'; and 5'-FAM-ATTCATGCAA GCTTATGGTGCTGGTCAAA-TAMRA-3'. Primers and probes for RSV [41] and HRV [42] have been described. Samples underwent reverse transcription at 48°C for 30 minutes, then were heated at 95°C for 10 minutes and amplified by 40 cycles of 15 s at 95°C and 1 minute at 60°C on an ABI Prism 7900HT Sequence Detection System (Applied Biosystems). During amplification, fluorescence emissions were monitored at every thermal cycle. The threshold (Ct) represents the cycle at which significant fluorescence is first detected. Ct value was converted to copy number using a control plasmid of known concentration: RSV A, 5.06 × 109 copies had a Ct value of 10.469; RSV B, 2.61 × 109 copies had a Ct value of 11.897; hMPV, 7.51 × 109 copies had a Ct value of 10.51; HRV, 1.73 × 107 copies had a Ct value of 20.20.

One-step real-time PCR for coronavirus

Frozen live cultures of human coronavirus OC43 and 229E were purchased from ATCC (Cat #VR-1558, VR-740) for use as positive controls. RNA was extracted from these cultures using RNA Mini Kit (Qiagen, Hilden, Germany) in accordance with the manufacturer's instructions. The samples were amplified using diagnostic primer pairs for pancoronavirus, OC43 and 229E as previously described [43].

Data analysis

Microarrays were scanned at 5 μm resolution using an Axon 4000b scanner and Genepix 4 software (Molecular Devices, Sunnyvale, CA, USA). Signal intensities were extracted using Nimblescan 2.1 software (NimbleGen Systems, Madison, WI, USA). Using an automated script (J George and V Vega), we calculated the median signal intensity and standard deviation from the seven replicates of each probe. The probe signal intensities were sorted by genome and arranged in sequence order, then reformatted into CDT format for graphical viewing of signal intensities in Java Treeview [44]. In parallel, the probe median signal intensities were analyzed using PDA to determine which pathogen was present, and the associated confidence level of prediction. The AES and PDA algorithms are described in detail in the Results section and all algorithms, formulae, software and microarray data are available on the supplemental website [25] and in Additional data file 1.

Additional data files

The following additional data are available with the online version of this paper. Additional data file 1 includes supplementary materials and methods, figures, tables, pathogen microarray data and software.

Additional data file 1

All files are available for download in PDF, JPG, GIF, TIFF, HTML or ZIP formats as indicated on the webpage [25]. Supplementary methods: sample amplification and microarray protocols (PDF); RT-PCR modeling and amplification efficiency score (AES); pathogen detection algorithm (PDA). Supplementary figures. Figure S1: Probe design schema. Probes (40-mers) were tiled at an average 8-base resolution across each of the 35 viral genomes in the manner depicted above. Numbers represent the start and end positions of each probe. Figure S2: Choice of primer tag in random RT-PCR has significant effect on PCR efficiency. Heatmap of probe signal intensities for a clinical hMPV sample following random RT-PCR using original primer (a) A1 or (b) AES-optimized primer A2. Figure S3: Comparison of amplification efficiency of original primer A1 and AES-optimized primer A2. RNA from patients infected with RSV B (n = 5) or hMPV (n = 3) were reverse-transcribed and amplified using primer A1 or A2 and the percentage of r-signature probes with signal above detection threshold was determined. Figure S4: Diagnostic PCR results for RSV patient 412 show that the patient does not have a coronavirus infection. (a) PCR using pancoronavirus primers. Lane 1, 1 kb ladder; lane 2, blank; lane 3, OC43 coronavirus positive control; lane 4, 229E coronavirus positive control; lane 5, RSV patient 412; lane 6, PCR primers and reagents only, as a negative control. (b) PCR using OC43 specific primers. Lane 1, 50 bp ladder; lane 2, blank; lane 3, OC43 coronavirus positive control; lane 4, RSV patient 412; lane 5, purified RSV from ATCC; lane 6, PCR negative control. (c) PCR using 229E specific primers. Lane 1, 229E coronavirus positive control; lane 2, RSV patient 412; lane 3, PCR negative control; lane 4, 1 kb ladder. Supplementary tables. Table S1: List of genomes represented on the pathogen detection microarray. Table S2: Comparison of E-Predict and PDA algorithms. Pathogen microarray data: data have been deposited in NCBI's Gene Expression Omnibus and are accessible through GEO accession number GSE3779 [45]. Software downloads. Amplification efficiency score software: Primerselect Readme.txt; Primerselect.java. Pathogen detection algorithm (PDA): WKL Readme.txt; WKL.cpp. Click here for file
  39 in total

Review 1.  New microbiology tools for public health and their implications.

Authors:  Betty H Robertson; Janet K A Nicholson
Journal:  Annu Rev Public Health       Date:  2005       Impact factor: 21.981

2.  Illumina, Inc.

Authors:  Frank J Steemers; Kevin L Gunderson
Journal:  Pharmacogenomics       Date:  2005-10       Impact factor: 2.533

3.  Fast and accurate probe selection algorithm for large genomes.

Authors:  Wing-Kin Sung; Wah-Heng Lee
Journal:  Proc IEEE Comput Soc Bioinform Conf       Date:  2003

4.  Broad-spectrum respiratory tract pathogen identification using resequencing DNA microarrays.

Authors:  Baochuan Lin; Zheng Wang; Gary J Vora; Jennifer A Thornton; Joel M Schnur; Dzung C Thach; Kate M Blaney; Adam G Ligler; Anthony P Malanoski; Jose Santiago; Elizabeth A Walter; Brian K Agan; David Metzgar; Donald Seto; Luke T Daum; Russell Kruzelock; Robb K Rowley; Eric H Hanson; Clark Tibbetts; David A Stenger
Journal:  Genome Res       Date:  2006-02-15       Impact factor: 9.043

5.  Multiplex real time RT-PCR for the detection and quantitation of norovirus genogroups I and II in patients with acute gastroenteritis.

Authors:  Xiaoli L Pang; Jutta K Preiksaitis; Bonita Lee
Journal:  J Clin Virol       Date:  2005-06       Impact factor: 3.168

6.  E-Predict: a computational strategy for species identification based on observed DNA microarray hybridization patterns.

Authors:  Anatoly Urisman; Kael F Fischer; Charles Y Chiu; Amy L Kistler; Shoshannah Beck; David Wang; Joseph L DeRisi
Journal:  Genome Biol       Date:  2005-08-30       Impact factor: 13.583

7.  Selection of optimal oligonucleotide probes for microarrays using multiple criteria, global alignment and parameter estimation.

Authors:  Xingyuan Li; Zhili He; Jizhong Zhou
Journal:  Nucleic Acids Res       Date:  2005-10-24       Impact factor: 16.971

8.  RNA viral community in human feces: prevalence of plant pathogenic viruses.

Authors:  Tao Zhang; Mya Breitbart; Wah Heng Lee; Jin-Quan Run; Chia Lin Wei; Shirlena Wee Ling Soh; Martin L Hibberd; Edison T Liu; Forest Rohwer; Yijun Ruan
Journal:  PLoS Biol       Date:  2006-01       Impact factor: 8.029

9.  A multivariate prediction model for microarray cross-hybridization.

Authors:  Yian A Chen; Cheng-Chung Chou; Xinghua Lu; Elizabeth H Slate; Konan Peck; Wenying Xu; Eberhard O Voit; Jonas S Almeida
Journal:  BMC Bioinformatics       Date:  2006-03-01       Impact factor: 3.169

10.  A novel pancoronavirus RT-PCR assay: frequent detection of human coronavirus NL63 in children hospitalized with respiratory tract infections in Belgium.

Authors:  Elien Moës; Leen Vijgen; Els Keyaerts; Kalina Zlateva; Sandra Li; Piet Maes; Krzysztof Pyrc; Ben Berkhout; Lia van der Hoek; Marc Van Ranst
Journal:  BMC Infect Dis       Date:  2005-02-01       Impact factor: 3.090

View more
  23 in total

Review 1.  Basic concepts of microarrays and potential applications in clinical microbiology.

Authors:  Melissa B Miller; Yi-Wei Tang
Journal:  Clin Microbiol Rev       Date:  2009-10       Impact factor: 26.132

2.  LNA-modified isothermal oligonucleotide microarray for differentiating bacilli of similar origin.

Authors:  Jing Yan; Ying Yuan; Runqing Mu; Hong Shang; Yifu Guan
Journal:  J Biosci       Date:  2014-12       Impact factor: 1.826

3.  Viral nucleic acids in live-attenuated vaccines: detection of minority variants and an adventitious virus.

Authors:  Joseph G Victoria; Chunlin Wang; Morris S Jones; Crystal Jaing; Kevin McLoughlin; Shea Gardner; Eric L Delwart
Journal:  J Virol       Date:  2010-04-07       Impact factor: 5.103

4.  Rapid detection of urinary tract infections using isotachophoresis and molecular beacons.

Authors:  M Bercovici; G V Kaigala; K E Mach; C M Han; J C Liao; J G Santiago
Journal:  Anal Chem       Date:  2011-05-05       Impact factor: 6.986

5.  VIPR: A probabilistic algorithm for analysis of microbial detection microarrays.

Authors:  Adam F Allred; Guang Wu; Tuya Wulan; Kael F Fischer; Michael R Holbrook; Robert B Tesh; David Wang
Journal:  BMC Bioinformatics       Date:  2010-07-20       Impact factor: 3.169

6.  Peptide nucleic acid-based array for detecting and genotyping human papillomaviruses.

Authors:  Jae-jin Choi; Chunhee Kim; Heekyung Park
Journal:  J Clin Microbiol       Date:  2009-04-15       Impact factor: 5.948

7.  Pathogen chip for respiratory tract infections.

Authors:  Eric A F Simões; Champa Patel; Wing-Kin Sung; Charlie W H Lee; Kuan Hon Loh; Marilla Lucero; Hanna Nohynek; Geraldine Nai; Pei Ling Thien; Chee Wee Koh; Yang Sun Chan; Jianmin Ma; Sebastian Maurer-Stroh; Phyllis Carosone-Link; Martin L Hibberd; Christopher W Wong
Journal:  J Clin Microbiol       Date:  2013-01-09       Impact factor: 5.948

8.  Nonparametric methods for the analysis of single-color pathogen microarrays.

Authors:  Omar J Jabado; Sean Conlan; Phenix-Lan Quan; Jeffrey Hui; Gustavo Palacios; Mady Hornig; Thomas Briese; W Ian Lipkin
Journal:  BMC Bioinformatics       Date:  2010-06-28       Impact factor: 3.307

9.  Brief overview of bioinformatics activities in Singapore.

Authors:  Frank Eisenhaber; Chee-Keong Kwoh; See-Kiong Ng; Wing-Kin Sung; Wing-King Sung; Limsoon Wong
Journal:  PLoS Comput Biol       Date:  2009-09-25       Impact factor: 4.475

10.  Modeling formamide denaturation of probe-target hybrids for improved microarray probe design in microbial diagnostics.

Authors:  L Safak Yilmaz; Alexander Loy; Erik S Wright; Michael Wagner; Daniel R Noguera
Journal:  PLoS One       Date:  2012-08-27       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.