Literature DB >> 26067233

Preferential Amplification of Pathogenic Sequences.

Fang Ge¹, Jayme Parker², Sang Chul Choi¹, Mark Layer¹, Katherine Ross³, Bernard Jilly³, Jack Chen².

Abstract

The application of next generation sequencing (NGS) technology in the diagnosis of human pathogens is hindered by the fact that pathogenic sequences, especially viral, are often scarce in human clinical specimens. This known disproportion leads to the requirement of subsequent deep sequencing and extensive bioinformatics analysis. Here we report a method we called "Preferential Amplification of Pathogenic Sequences (PATHseq)" that can be used to greatly enrich pathogenic sequences. Using a computer program, we developed 8-, 9-, and 10-mer oligonucleotides called "non-human primers" that do not match the most abundant human transcripts, but instead selectively match transcripts of human pathogens. Instead of using random primers in the construction of cDNA libraries, the PATHseq method recruits these short non-human primers, which in turn, preferentially amplifies non-human, presumably pathogenic sequences. Using this method, we were able to enrich pathogenic sequences up to 200-fold in the final sequencing library. This method does not require prior knowledge of the pathogen or assumption of the infection; therefore, it provides a fast and sequence-independent approach for detection and identification of human viruses and other pathogens. The PATHseq method, coupled with NGS technology, can be broadly used in identification of known human pathogens and discovery of new pathogens.

Entities: CellLine Chemical Disease Gene Mutation Species

Mesh：

Substances：
DNA, Viral

Year: 2015 PMID： 26067233 PMCID： PMC4464073 DOI： 10.1038/srep11047

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Next generation sequencing (NGS) technologies12, including 2nd and 3rd generation DNA sequencing platforms, have started a revolution in genomics and provided opportunities for its broad application in many other fields345, including the diagnosis of human pathogens678910. Examples of NGS application in the fields of virology and infectious diseases include: 1) epidemiology investigation of infectious disease outbreaks1112; 2) etiologic diagnosis of viral infections using a meta-genomic approach1314; 3) discovery of new human viruses4; and 4) discovery of other new pathogenic viruses15. Detailed reviews offer an introduction to NGS technology applications in virus discovery and clinical/diagnostic virology7810. However, NGS technology is still a research tool, rather than a diagnostic tool, and cannot be used in current infectious disease diagnostic laboratories due to 1) the scarcity of pathogen sequences in human clinical samples; 2) the necessary subsequent requirement of extensive deep sequencing; and 3) the complexity of bioinformatics analysis required in order to identify the pathogenic sequences. For example, the average viral genome in a human clinical sample is about 1-100 per 10 million human genome sequence reads. Many laboratories have developed various strategies, from consensus PCR assays that use degenerate primers to computational subtraction of large sequence data in order to find possible unknown pathogens, with little success. These “search for a needle in a haystack” strategies have proven to be a very difficult task. To make NGS technology a practical tool for detecting human pathogens, the key is to greatly increase the presence of pathogenic sequences in a clinical sample. To address this challenge, we developed a method we called “Preferential Amplification of Pathogenic Sequences (PATHseq)” which can be used to preferentially amplify non-human sequences in a clinical sample. This method is based on the following facts: 1) active infection is the result of pathogenic gene expression, which produces RNAs, or pathogenic transcripts; 2) only about 3% of the human genome generates transcripts. Among these, the top 1,000 and 2,000 most abundant human transcripts comprise more than 65% and 72% of all human transcripts, respectively16; 3) by selectively excluding the amplification of these abundant human transcripts, we can preferentially amplify pathogenic transcripts in human clinical samples; 4) pathogenic transcripts can be further enriched through subtractive hybridization against a reference (normal) human transcription library (human transcriptome). The PATHseq technology, in combination with NGS technology, has the potential to provide comprehensive and unbiased detection of human pathogens responsible for any infectious disease.

Results

The most abundant human transcripts

The recent completion of the Encyclopedia of DNA Elements (ENCODE) project17 provides a genome-wide “landscape of transcription in human cells” in 14 different cell lines. Although the size of the human genome is huge (containing over 3 billion base pairs (bp)), it encodes only about 20,000 protein-coding genes, accounting for a very small fraction (approximately 2%) of the genome. Based on the publicly available ENCODE database16, the total human large transcripts (>200 bp RNAs) in GM12878 (a cell line that contributed most to the ENCODE database) are 161,999. Among these, 86,248 transcripts are reproducible (in a duplicated experiment). These 86,248 transcripts are defined as human transcriptome (Table 1). A recent report found that most protein-coding genes have one major transcript expressed at significantly higher level than others, and in human tissues these major transcripts contribute almost 85 percent to the total mRNA18. Given that the average length of human mRNAs is 1.3 kb19, the complexity can be reduced by 26.8 times (3,000,000/(86,248 × 1.3)), if we sequence cDNA instead of genomic DNA. This strategy has been successfully used by several laboratories in search of human pathogens4520. However, this strategy is still impractical for diagnostic laboratories, because the number of human transcripts is still too large compared to the relative scarcity of pathogenic transcripts.

Table 1

The most abundant human transcripts.

Most abundant human transcripts	RPKM	% of total human transcriptome
Top 1000	23391.45	65.52%
Top 2000	25847.42	72.40%
Top 3000	27355.52	76.62%
Top 4000	28440.66	79.66%
Top 5000	29287.62	82.04%
Top 6000	29973.64	83.96%
Top 7000	30544.67	85.56%
Top 8000	31035.38	86.93%
Top 9000	31463.82	88.13%
Top 10000	31838.97	89.18%
Top 20000	34018.78	95.29%
All 86248	35700.85	100%

RPKM: Reads per kilobase of transcript per million mapped reads.

In order to solve this problem, we developed an alternative strategy using the most abundant human transcripts (Table 1). Our strategy relies on the removal of the most abundant human transcripts from clinical samples to selectively enrich the pathogenic sequences and further reduce the sequencing complexity. As shown in Table 1, we found the most abundant 1,000 and 2,000 transcripts comprised about 65% and 72% of all human transcripts, respectively, based on ENCODE data17.

Non-human oligonucleotides

We further developed a computer program to look for specific patterns in the human transcriptome database. We used the following steps to find the shortest unmatched k-mers of nucleotides in the human transcriptome. First, we collated a set of the Ensembl transcript names (e.g., ENST00000387347), and retrieved DNA sequences corresponding to the transcript names from the human cDNA sequences of the Ensembl release 73 available at ftp://ftp.ensembl.org/pub/release-73/fasta/homo_sapiens/cdna. Secondly, we searched for the shortest unmatched k-mers in the collated set of DNA sequences2122. The computer program counted k-mers for a given k or the size of substring in a DNA sequence. It started from k = 1, and checked if all of the possible k-mers occurred at least once. It stopped when it reached a k value where there was at least one k-mer that was not found as a substring in the set of transcripts. As predicted, we found that human transcript sequences are not randomly distributed. Using this computer program, we were able to generate a set of 88 8-mer oligonucleotides (Table 2) that do not match the sequences of the 2,000 most abundant human transcripts. This set of oligonucleotides is, therefore, named as “non-human oligonucleotides”. In other words, by using this set of oligos as primers in the construction of cDNA library, we can get rid of 72% of human transcripts from clinical samples, greatly increasing the chance of selectively targeting pathogenic sequences. Theoretically, this set of primers has the probability to amplify any sequences larger than 5,958 bp (48 × 8 /88), which should include almost all human pathogens (both viruses and bacteria). To test how likely this set of short oligonucleotides can cover all known human viruses, we performed an in silico analysis and found that among 386 of all known human viruses (Supplementary Information S1), this set of 88 8-mer oligonucleotides can cover 327 (85%) of them. The remaining 59 unmatched human viruses are usually small human viruses (Supplementary Information S2), which includes human immunodeficiency virus 1 (HIV-1). To cover all know human pathogenic viruses, we also developed a new list of 81 9-mer oligos that do not match the top 2,000 most abundant human transcripts while cover all known human pathogenic viruses (see below In silico experiment).

Table 2

A list of 88 8-mer oligonucleotides that do not match the sequences of the 2,000 most abundant human transcripts.

AAACGCGA	ACGAACCG	ATGCGATA	CCGTAGTA	CGCAATAT	CGGGTCGA	CTAATACG	GTTACGCG	TAGCGAAT	TATCCGAC	TCGTCGAT
AACGCATA	ACGAATAA	ATGCGTTA	CGAACGTA	CGCGATAC	CGGTAAGC	CTTAGCGA	TAACCGTT	TAGCGTAC	TATCGCTA	TGTAAGCG
AATAACGC	ACGATAGG	ATTAGCGT	CGAATAAC	CGCGCGTA	CGGTAGAT	GATACGTA	TAACGTAA	TAGCGTAT	TATCGGAC	TTAACGTA
AATATCGT	ACGCGATA	ATTGCGAC	CGACGTAC	CGCGGTTA	CGGTAGTA	GCGAATAT	TAAGCGCG	TAGTAACG	TATCGGTA	TTACGATA
AATATTCG	ATACCGGT	ATTGTACG	CGATAGGT	CGCGTAAT	CGGTCGAT	GCGACGTA	TAAGGTCG	TAGTCGAG	TATCGGTC	TTAGTCGA
AATCGGTA	ATACGTAC	CAATCGCG	CGATAGTA	CGCGTATA	CGTATATC	GCGTAATT	TAATACGT	TAGTCGGT	TCGAATAG	TTATAGCG
ACACGTTA	ATAGCGCA	CCCTAACG	CGATATCC	CGCGTATC	CGTATTCG	GTACCGTA	TAGAGTCG	TATAGCGC	TCGCGTAT	TTATATCG
ACCGGTTA	ATAGCGCG	CCGGTAAT	CGATCGTA	CGCTAAAA	CGTCGAAT	GTATAACG	TAGATCCG	TATCACGC	TCGGTAAC	TTATCGCG

Using the same computer program, we generated several sets of 7-, 8-, 9-, and 10-mer oligonucleotides that selectively amplify non-human sequences during construction of the cDNA library (Supplemental Information S7-10), which is summarized in Table 3. As shown in Table 3, there are 197 9-mer oligonucleotides that do not match the sequence of 500 bp 3’end of all 86,248 human transcripts. To make this strategy work, we introduced ddNTP and used a mixture of ddNTP with normal dNTP (1% ddATP in dNTP solution) in the construction of cDNA library. ddNTP lacks the OH needed to continue the elongation of the DNA strand. When ddATP is added to the reaction, the elongation of the strand stops once the ddATP is added to the new strand. Using this set of 9-mer oligos, the likelihood to find a match in a random sequence is 49 × 9 /197 = 11,976 bp. Most human pathogens have larger genome sizes than this.

Table 3

Numbers of oligonucleotides that do not match the sequences of human transcripts.

	Top 1000 transcripts			Top 2000 transcripts			Top 4000 transcripts			All 86,248 transcripts
	Full transcript	1000bp -3'End	500bp -3'End	Full transcript	1000bp -3'End	500bp -3'End	Full transcript	1000bp -3'End	500bp -3'End	Full transcript	1000bp -3'End	500bp -3'End
7-mer	1	2	65			4			1
8-mer				88	455		1	74
9-mer										1	20	197

Preferential Amplification of Pathogenic Sequences (PATHseq)

As shown in Fig. 1, PATHseq procedure includes: (1) Total mRNAs are extracted and purified from clinical samples; (2) A primer (P1) is designed to specifically transcribe mRNA into first-strand cDNAs (anti-sense) while introducing a T7 promoter/primer sequence into the cDNA; (3) The ribonuclease activity of RNase H cleaves RNA in a DNA/RNA duplex, allowing the synthesis of secondary cDNA strands; (4) A set of 88 specific 8-mer oligonucleotides (Table 2) is used as primers for the synthesis of secondary cDNA strands. Because these primers do not amplify the 2,000 most abundant human mRNAs, about 72% of all human mRNAs is eliminated from amplification, preferentially amplifying non-human (pathogenic) sequences; (5) Using the T7 promoter introduced in step 2, the T7 RNA polymerase synthesizes RNAs with double-stranded DNA as template; (6) Because the T7 promoter is attached to the poly(A) end, newly generated RNAs are anti-sense; (7) Human reference cDNA library was created using the same set of 8-mer primers (Table 2), plus a poly d(T) primer (not P1 primer). Normal (non-pathogenic) human mRNAs are used as templates when the library is constructed. Sense strands of human reference cDNAs are separated using poly d(T) beads. The beads are further used as the solid phase for subtractive hybridization. Newly generated anti-sense human RNAs from step 6 are captured (hybridized) by these cDNAs and specifically degraded by RNase H in RNA-DNA hybrids. RNase H does not digest single or double-stranded DNA; (8) Pathogenic RNAs are greatly enriched because they do not hybridize to human reference cDNAs; (9) A T7 primer is used to synthesize the first cDNA strands; (10) Again, RNase H cleaves RNAs in a DNA/RNA duplex; (11) Synthesized RNAs are anti-sense; (12) Step 6 through12 form a cycle in which pathogenic RNAs are repeatedly enriched.

Figure 1

Schematic representation of the PATHseq (Preferential Amplification of Pathogenic Sequences) method.

(1) Total mRNAs from clinical sample, including human mRNAs and relatively scarce pathogenic mRNAs; (2) Total mRNAs are transcribed into first strand cDNAs with P1 primer; (3) RNase H cleaves RNAs in RNA-DNA duplex; (4) Reverse transcriptase (RT) synthesizes secondary cDNA strands with P2 primers; (5) T7 RNA polymerase synthesizes RNAs in the presence of T7 promoter; (6) Synthesized anti-sense RNAs; (7) Synthesized RNAs are hybridized to human reference (non-pathogenic) cDNA library coated on a solid phase. RNase H cleaves bound RNAs (human RNAs) in RNA-DNA duplex; (8) Pathogenic RNAs are enriched; (9) Reverse transcription; (10) RNase H cleaves RNAs in RNA-DNA duplex; (11) T7 RNA polymerase synthesizes RNAs; (12) New RNAs synthesized from enriched pathogenic RNAs are amplified 100-1000 fold.

Computational subtraction

To further analyze sequence data, we developed a computer program to subtract sequencing reads that match human sequences2324 and assemble them into contiguous sequences for direct comparison with the GenBank databases of nucleic acids using BLASTN software2526. Using this method, any non-matching sequences representing potential pathogenic sequences will be enriched and remain in the final dataset4520. Figure. 2 shows the approach for computational subtraction. In Step 1, to ensure reasonably high quality of the reads, we trimmed the sequences at the 5’-end by 20 base pairs and also trimmed them by quality score from the 3’-end using 30 as the threshold score27. In Step 2, using a short read aligner called STAR28, we aligned the remaining reads on the human genome of primary assembly sequences unmasked that we fetched from ENSEMBL. Reads aligned to multiple loci in the reference human genome were also considered to be unmapped reads and were filtered out to reduce the false positive rate. In Step 3, to obtain longer non-human origin reads, unaligned reads were assembled using a de novo assembler named Trinity29. In Step 4, contigs were blasted against NCBI “nr” nucleotide sequence database for sequences that were similar to the de novo assembled unaligned sequences3031. In Step 5, the assembled unaligned sequences were assigned to species by the taxonomic unit using the NCBI nucleotide sequence database search result and the taxonomy database32.

Figure 2

Workflow of identification of sequences with foreign origin.

Step 1 is of quality check of the reads, removing low quality reads and trimming low quality bases. Step 2 is to filter out unmapped reads on a host genome. Step 3 is of de novo assembling unmapped reads into contig sequences. Step 4 is to search the BLAST nucleotide sequence database for sequences similar to the contig sequences. Step 5 is to select contig sequences that are mapped on foreign origin sequences in the nucleotide sequence database.

Proof of concept 1 – enrichment of viral sequences from human cell line

We tested the PATHseq method for the enrichment of pathogenic (viral) sequences from Kaposi’s sarcoma-associated herpesvirus (KSHV) (also known as human herpesvirus 8 or HHV-8)-infected BCBL-1 cells using two different approaches, quantitative real time PCR (qPCR) and next generation sequencing. For qPCR, we designed and optimized KSHV-specific primer sets to monitor the enrichment of viral transcripts compared with cellular house-keeping genes, beta-actin and GAPDH, as controls of background human sequences (Table 4). We also included two viral genomic DNA controls, KSHV-OriL and KSHV-OriR. These two viral genomic locations do not generate viral transcripts. As shown in Fig. 3A, we measured relative enrichment of various viral transcripts from at least three independent experiments compared with the cellular house-keeping gene control, beta-actin. Up to 164-fold enrichment (ORF50F1) could be enriched by PATHseq method. The different enrichment level among various viral genes can be explained as the result of viral transcript size and viral sequence matches with 8-mer oligonucleotides used in PATHseq method.

Table 4

Primers for HHV-8/KSHV quantitative (real time) PCR.

Name	Forward	Sequence	Reverse	Sequence	Start	Finish	gDNA(bp)	cDNA(bp)
ORFK8	K8S-F	AGACAGCTGCAGCAGGCATT	K8-Rnew	CTGCTGGCACATTCGCATCA	75648	75851	203	122
ORF40	ORF40S-F	GCTTTGGAGCCTGAGCAATG	ORF40-R	ATGCGATGAGAATACAAGAT	61742	61982	240	114
ORF50F1	ORF50-F1	AGGTGTGCCGTGTAGAGATT	ORF50-R1	TGCTTTCGTTTGGGTGTTGT	74140	74208	68	68
ORF50F2	ORF50-F2	CGCGCTGTTGTCCAGTATTC	ORF50-R2	CCACCAGAAGGTGACGGTAT	74491	74613	122	122
ORF50	ORF50-F	GCGCAAGATGACAAGGGTAA	RTApath-R2	CAAGCTTGGAACATTCTTTC	71698	74613	2915	199
ORF57	ORF57path-F	AGGGCATCCTAGAGGACTCT	ORF57-R	GGGTTCGGACAATTGCTCGT	82203	82411	208	101
β-actin	b-actin-F	ACGTGGACATCCGCAAAGAC	b-actin-R	CAAGAAAGGGTGTAACGCAACTA	945	1252		308
GAPDH	GAPDH-F	GAAGGTGAAGGTCGGAGTC	GAPDH-R	GAAGATGGTGATGGGATTTC	180	405		226
KSHV-OriL	KSHV-OriL-F	GCTAGTGAGTCACGGGCCTG	KSHV-OriL-R	GTAACAGTTGGTTAACCCGT	23946	24117	171
KSHV-OriR	KSHV-OriR-F	ATCCGGCCGTCCTGGGCAGC	KSHV-OriR-R	GGGACGAGGAAAAAGTACGC	122167	122228	61

Figure 3

Enrichment of pathogenic sequences by PATHseq method.

(A) Quantitative real time PCR results indicating enrichment of viral transcripts by PATHseq method from KSHV/HHV-8 latently infected BCBL-1 cell line. ORFK8, ORF40, ORF50F1, ORF50F2, ORF50, and ORF57 are individual lytic viral genes. Actin and GAPDH are cellular house-keeping genes for control; KSHV-OriL and KSHV-OriR are two viral genomic DNA controls; Results are shown from at least three repeats with standard deviation. (B) Next generation sequencing results showing enrichments of viral sequencing reads by PATHseq method. Two independent runs using Illumina MiSeq system were performed.

We further tested the PATHseq method for the enrichment of KSHV viral sequences by next generation sequencing. As shown in Fig. 3B, we performed two independent NGS runs, each consisted of a control (non-enriched) and PATHseq-enriched sequencing libraries. For Run 1, 6 and 493 viral reads could be identified from a total of 11,552,534 and 12,356,296 sequencing reads for control and PATHseq libraries, respectively. For Run 2, 76 and 2,682 viral reads could be identified from a total 13,988,804 and 15,103,847 reads, respectively. The overall enrichments of KSHV viral reads among total sequenced reads increased from 0.0052% to 0.399% (76.8-fold) and 0.0543% to 1.7757% (32.7-fold), respectively.

Proof of concept 2 – enrichment of viral sequences from virus-spiked human samples

We tested the PATHseq method for the enrichment of viral sequences from human samples spiked with multiple viruses. Whole blood was obtained from a healthy blood donor and peripheral blood mononuclear cells (PBMC) were isolated from whole blood using standard procedure. Total RNAs from either PBMC or virus-infected cells were extracted and purified using Qiagen RNeasy Mini Kit. Total RNAs from infected cells were spiked into human PBMC RNAs with a ratio of 1 to 1,000 based on Nanodrop readings. After enrichment by the PATHseq method, quantitative real time PCR (qPCR) was performed with individual viruses, along with two cellular controls, beta-actin and GAPDH. As shown in Fig. 4, we measured relative enrichment of individual viral transcripts from at least three independent experiments compared with cellular house-keeping gene control, beta-actin, which was set to 1. Various human viruses were enriched differently from 31 fold (Influenza A virus) to 242 fold (for Human herpesvirus 6B). The different levels of enrichment for different viruses could be attributed to the fact that the list of 88 8-mer nucleotides used in the PATHseq method matches disproportionately to different viruses. As shown in Table 5, There are 66 8-mer oligonucleotides that match the sequence of human herpesvirus 6B, and 57, 3, 2, 4, and 14 match sequences of human herpesvirus 8, hepatitis C virus, influenza A virus, human parainfluenza virus, and human adenovirus C, respectively. In contrast, there are no oligonucleotides from the list of 88 8-mer oligos that match the genome sequence of human immunodeficiency virus. As a result, there was no enrichment (0.98 fold) of HIV viral sequences from spiked human sample as shown in Fig. 4.

Figure 4

Enrichment of spiked viral sequences.

Human mRNAs isolated from peripheral blood mononuclear cells (PBMC) were spiked with viral mRNAs from cell culture infected with human herpesvirus 6B (HHV-6B), human herpesvirus 8 (HHV-8), hepatitis C virus (HCV), Influenza A virus, human Parainfluenza Virus Type 1 (HPIV1), human immunodeficiency virus (HIV), and human Adenovirus Type C. Beta-actin and GAPDH are cellular controls. PATHseq method was performed to enrich the viral sequences. qPCRs were used to monitor the relative enrichments (in fold) of individual virus to cellular gene control, beta-actin. Results are shown from at least three repeats with standard deviation.

Table 5

Match of 88 8-mer oligonucleotides in spiked virus genomes.

Clinical diagnosis of unknown infection

The PATHseq method was further tested by investigating an unknown clinical respiratory infection and successfully identified a new variant of Streptococcus pneumonia (ASVL_JC_001) from a clinical specimen with bronchitis & pulmonary inflammation33. Using PATHseq coupled with NGS, we generated a total of 16,031,250 sequencing reads and eventually identified 118,200 (0.73%) reads as S. pneumonia sequences (Fig. 5). These reads formed clusters enriched by the 88 8-mer oligonucleotides and further assembled into 2067 contig sequences. Sequencing analysis of this strain shows atypical features of S. pneumonia as it shows alpha-hemolytic colonies (Supplemental Information S3) and bile solubility but was resistant to optochin (Supplemental Information S4). Antimicrobial susceptibility testing shows that this strain was sensitive to cefotaxime, chloramphenicol, oxacillin, penicillin, tetracycline, and vancomycin, but resistant to erythromycin and ethyl hydrocupreine (Taxo P), and partially resistant to sulfamethoxazole trimethoprim (Supplemental Information S4). Metabolic biochemical assay indicated that this variant behaved more like Streptococcus pneumonia than to closely related specie Streptococcus mitis (Supplemental Information S5). PCR assay of housekeeping genes for S. pneumonia indicated that this variant harbors genes encoding the virulence factors pneumolysin (ply) and the major autolysin (lytA), both of which are normally associated with pneumococci (Supplemental Information S6). Whole genome sequencing was performed on an Illumina MiSeq version 2 system33. The total genome of this variant is 2,092,532 base pairs long, and has a GC content of 40.3%. We annotated the genome assembly by using the NCBI Prokaryotic Genomes Automatic Annotation Pipeline34. Among a total of 83 contigs, 69 contigs harbor annotations of genes, CDS, and RNAs. We found 2158 putative genes and 2051 CDS. We also found 5 rRNAs, 44 tRNAs and 1 ncRNA for this variant33.

Figure 5

Application of PATHseq method in identifying an unknown infection.

Initial raw sequencing reads were 16,031,250, after subtracting human sequences, 118,200 (0.7%) reads were identified as S. pneumonia sequences.

In silico experiment

We performed in silico experiment to test how likely the PATHseq method can enrich known human viral pathogens. We first generated a set of 199 human pathogenic viruses based on NCBI and ViralZone human viruses databases (viralzone.expasy.org/) (Supplementary Information S11). As summarized in Table 6, we generated the sets of 8-, 9-, and 10-mer oligonucleotides that do not match the most abundant human transcripts at top 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, and all 75987, respectively. For example, there are 329 8-mer oligonucleotides that do not match the top 1000 most abundant human transcripts. Among these 329 oligos, there are 62 that match at least one known human pathogenic virus and cover 86.9% of all human pathogenic viruses. Please note there are some minor differences in the total number of human transcripts between Table 5 (75,987) and Table 1 and 3 (86,248) because Table 5 used the UCSC Genome Browser assembly (ID: hg38), while Table 1 and 3 used ENCODE database16. Please also note that the total number of human viruses (386) (Supplementary Information S1) is different from the total number of human pathogenic viruses, which is 199 as listed in Supplementary Information S11. Overall, this in silico analysis indicates that by recruiting the 81 9-mer oligos (Table 6), we can exclude the top 2000 most abundant human transcripts while still covering 100% of known human pathogenic viruses including HIV, which was missed from coverage by 88 8-mer oligos (Table 5 and Fig. 4). By recruiting the 171 10-mer oligos, we can exclude the top 20,000 most abundant human transcripts while still covering more than 95% of known human pathogenic viruses. Because the rest 55,987 transcripts only count for less than 5% of all human transcripts, none of them would have more abundance than the viral genome.

Table 6

In Silico Analysis of human virome coverage by short non-human primers.

The Most Abundant Human Transscripts
	Top 1000	Top 2000	Top 3000	Top 4000	Top 5000	Top 6000	Top 7000	Top 8000	Top 9000	Top 10000	Top 20000	All 75987
Number of 8-mers unmatched to the most abundant human transcripts	329	44	9	4	2	1	1	0	0	0	0	0
Among above 8-mers, number that match at least one known human virus	62	26	8	4	2	1	1	0	0	0	0	0
human virome coverage by above 8-mers	86.9%	41.2%	16.6%	11.6%	9.0%	8.0%	8.0%	0.0%	0.0%	0.0%	0.0%	0.0%
Number of 9-mers unmatched to the most abundant human transcripts	23473	8883	4402	2411	1493	953	651	455	347	249	28	1
Among above 9-mers, number that match at least one known human virus	57	81	87	91	94	85	83	69	62	51	14	1
human virome coverage by above 9-mers	100%	100%	98.5%	94.0%	87.9%	76.9%	70.4%	61.8%	57.8%	46.7%	16.1%	2.0%
Number of 10-mers unmatched to the most abundant human transcripts	351888	203816	139254	100542	76937	60510	49753	41374	35737	30336	10053	1075
Among above 10-mers, number that match at least one known human virus	164	167	174	179	177	180	180	181	179	179	171	81
human virome coverage by above 10-mers	100%	100%	100%	100%	100%	100%	100%	100%	100%	100%	95.5%	48.2%

Maximal number of sequencing samples

The maximal number of samples that can be loaded onto a sequencing run depends on two factors: 1) the minimal sequencing coverage required for successfully identifying a pathogen, usually the sequencing depth to achieve at least 1 RPKM (Reads per kilobase of transcript per million mapped reads) of pathogen sequences is required; 2) the capacity of the sequencing instrument. Using the same computer program we developed, we generated a total of 40 10-mer oligonucleotides that do not match the sequences to any of the total 161,999 human transcripts (Table 7). These oligos can be divided into two sets and used as adaptors for the construction of sequencing libraries. The maximum combination of these 20 + 20 adaptors can generate 400 (20 × 20 = 400) sample identifiers (barcodes) from A1 to T20. Therefore, a maximum total of 400 samples can be separately labeled by these two sets of adaptors, mixed into one sample run for sequencing and then separated by its own sample identifier (barcode) from A1 to T20 (Table 7).

Table 7

Index/Adaptor sequences for sequencing library.

Adaptor	Sequence	Sequence	Adaptor
A	ACGCGTATGA	CGTAATACGT	1
B	ACGTAGCGTG	CGTAATCGGT	2
C	ATACGCGACT	CGTACAAACG	3
D	ATCGACGCAA	CGTACGAAAC	4
E	ATCGTTCGAC	CGTACGTTAG	5
F	ATTCGATCGC	GCGCGATAGG	6
G	CCGTCGAAGT	GCGCGTAAAT	7
H	CGAACGAATC	GTACGCGACT	8
I	CGACGTATTG	GTCGAACGAG	9
J	CGATACGTTC	TAACGTATCG	10
K	CGATCTAACA	TAACGTCGGC	11
L	CGATTCGGTT	TACGCGATTG	12
M	CGCCCGTTAA	TAGCGAACGC	13
N	CGCGATAGTG	TAGCGACGCA	14
O	CGCGTGTTAT	TATGCGACGC	15
P	CGGATCGTTA	TCGATCGGTG	16
Q	CGGTACGCAT	TCGCGAAATT	17
R	CGGTCGTAGA	TCGCGAATGA	18
S	CGTAACGACT	TCGTTCGTAC	19
T	CGTAACTAGG	TTATCGCGCA	20

Discussion

Next-generation sequencing technology provides broad detection of infectious agents in a sequence independent manner and is rapidly being adapted for the clinical diagnosis of human pathogens. However, NGS technology is still a research tool, rather than a diagnostic tool, and cannot be used in current infectious disease diagnostic laboratories due to the scarcity of pathogen sequences in human clinical samples and necessary subsequent deep sequencing and intensive bioinformatics analysis in order to identify the pathogenic sequences. To solve this problem, many laboratories have developed various strategies. A ribosomal RNA depletion method is wildly used in RNA-sequencing (RNA-Seq) because large ribosomal RNA (rRNA) constitutes approximately 90% RNA species in total RNA35363738. We compared commercial rRNA depletion method (Epicentre’s Ribo-Zero rRNA Removal Core Kit, Cat. No. RZC110424) with our PATHseq method. The rRNA depletion method does not distinguish human RNA from pathogenic RNA, therefore, the amount of human RNA is still overwhelming in the final sequencing library. However, we believe the best way is to combine these two methods together, i.e. to recruit rRNA depletion prior to the application of PATHseq method, instead of polyA enrichment, because some viruses do not produce mRNA. Other methods for enrichment of pathogenic sequences include a capture-based approach using virus-specific DNA fragments as probes through hybridization39, and an approach using methyl-CpG binding domain (MBD) to separate methylated host DNA from microbial DNA based on differences in CpG methylation density40. The PATHseq method provides an innovative way to preferentially amplify pathogenic sequences over host sequences through the use of non-human but pathogenic-specific short primers. Theoretically, these primers are short enough to broadly detect pathogens (viruses and bacteria) above a given threshold genome size. For example, the set of 88 8-mer primers has the probability to amplify any sequences larger than 5,958 bp (48 × 8 /88), which should include almost all human pathogens (both viruses and bacteria). However, we noticed that some small viruses including human immunodeficiency virus, could not be enriched by this set of 88 8-mer primers. In addition, some viruses, which do not produce mRNAs, would be missed by the PATHseq method. To address this problem, we further generated a list of 81 9-mer oligos (Table 6). These 9-mer oligos do not match the top 2,000 most abundant human transcripts while covering all known human pathogenic viruses. Using these short non-human primers instead of random primers in the construction of cDNA library, the PATHseq method enables efficient enrichment of pathogenic sequences. With significant enrichment of pathogenic sequences in the final sequencing library, it is possible that more samples can be put into a single run to reduce overall cost and turnaround time. The PATHseq method, in combination with NGS technology, has the potential to provide comprehensive and unbiased (sequence-independent) detection of human known as well as unknown (novel) pathogens.

Materials and Methods

BCBL-1 cell line

The body-cavity-based lymphoma cell line, BCBL-1, is latently infected with Kaposi’s sarcoma-associated herpesvirus (KSHV)/human herpesvirus 8 (HHV-8) with average virus copy number ~50 per cell41. BCBL-1 is grown in RPMI1640 (HyClone Laboratories, Logan, Utah) supplemented with 10% fetal bovine serum (FBS) at 37 oC in a 5% CO2 condition.

Human specimen with spiked viruses

Single donor human whole blood was purchased from Fisher Scientific (Thermo Fisher Scientific, Waltham, MA, Cat. No. 50-177-224). Peripheral blood mononuclear cells (PBMC) were purified from whole blood using standard procedure42. Infecting cells with human herpesvirus 6B was described previously43. Hepatitis C patient serum was diagnosed and processed in this lab. Supernatants from infected cell cultures for Influenza A virus (Cat. No. VR-1736), Human Parainfluenza Virus Type 1 (HPIV1)(Cat. No. VR-94), Human Adenovirus Type C (Cat. No. VR-1) were purchased from ATCC. Total RNAs from either PBMC or supernatant of virus-infected cells were extracted and purified using Qiagen RNeasy Mini Kit (Qiagen Inc., Valencia, CA, Cat. No. 74104). mRNAs were purified using NEB Magnetic mRNA Isolation Kit (Cat. No. S1550S). Total RNAs from HIV transfected Jurket cells were described previously44. Ratio of spiked viral mRNAs to human PBMC mRNAs were 1 to 1,000 based on Nanodrop reading.

Clinical specimen with unknown respiratory infection

Sputum from the lower respiratory tract (bronchi and lungs) was obtained from a patient with an unknown clinical bronchitis & pulmonary inflammation at Alaska State Public Health Laboratories. All experiments were performed in accordance with relevant guidelines and regulations and were approved by the Institution Review Board of University of Alaska Fairbanks. Informed consent was obtained from the subject. Total RNAs were extracted and purified using Qiagen RNeasy Mini Kit (Qiagen Inc., Valencia, CA) with some modification. Briefly, sputum was suspended in 500 μl of lysis buffer with 20 μl of proteinase K (20 mg/ml) and incubated for 1 hour at 56 oC. The sample was further extracted with phenol/chloroform several times until there was no white protein layer between two phases. The clean aqueous phase was then processed according to Qiagen protocol. Purified total RNA was then used to prepare a sequencing library using the PATHseq method described below.

Procedure for coupling magnetic beads with oligonucleotides

Primer P1 was specifically designed to contain T7 promoter sequence, in addition to poly d(T) sequences (P1: 5’-ACGGCCTAATACGACTCACTATAGGGTTTTTTTTTTTTTTTTTTVN-3’). When synthesized, a primary amino group was attached to the 5’-end of P1 with a standard (C6) spacer arm (Integrated DNA Technologies, Coralville, IA). Modified primer P1 was manually coupled to Pierce NHS-activated magnetic beads (Thermo Fisher Scientific Inc., Rockford, IL), according to the manufacturer’s instruction. The final concentration of P1-coupled magnetic beads was 10 mg/ml.

Poly(A) mRNA isolation and first strand cDNA synthesis with P1-Magnetic Beads

Total RNA was extracted using Qiagen’s RNase Mini Kit (Qiagen, Valencia, CA) according to the manufacturer’s instruction. 5 μg of total RNA was diluted with nuclease-free water to a final volume of 50 μl in a nuclease-free 0.2 ml PCR tube. 10 μl of P1-Magnetic Beads were washed twice with 100 μl of RNA Binding Buffer (1 M LiCl, 40 mM Tris HCl, pH 7.5, 2 mM EDTA, and 0.1% Triton X-100). Beads were re-suspended in 50 μl of RNA Binding Buffer and added to the 50 μl of total RNA sample. Tubes were placed on the thermal cycler and heated at 65 °C for 5 minutes and held at 4 °C to denature the RNA and facilitate binding of the poly(A) mRNA to the beads. Tubes were removed from the thermal cycler and incubated at room temperature for 5 minutes to allow the RNA to bind to the beads. Tubes were then placed on the magnetic rack at room temperature for 2 minutes to separate the poly(A) mRNA bound to the beads from the solution, supernatant was removed and discarded. Beads were washed twice with 100 μl of Wash Buffer (150 mM LiCl, 20 mM Tris HCl, pH 7.5, 1.0 mM EDTA, and 0.01% Triton X-100) to remove unbound RNA, each time with all the supernatant being removed and discarded. Beads were equilibrated with 100 μl 1x reverse transcriptase (RT) buffer and separated by magnetic field. Beads bound with poly(A) mRNA were re-suspended in 12 μl nuclease-free water, 4 μl dNTP mix (10 mM each), 2 μl of 10x RT buffer, 1 μl of RNAse inhibitor, and 1 μl of M-MuLV reverse transcriptase were added to the solution. Mixture was incubated at 42°C for one hour and then inactivated at 90 °C for 3 min. Magnetic beads with first strand cDNA synthesis were separated and the supernatant being removed and discarded.

PATHseq

88 octamer (Table 2) and 197 nonamer (Extended Data Table 4) oligonucleotides were synthesized using Fisher Custom Oligos service (Thermo Fisher Scientific, Waltham, MA). Oligos were suspended in nuclease-free water at final concentration of 10 μM. The set of 88 octamer oligos were used in all experiments in this report. Normal peripheral blood mononuclear cells (PBMCs) were isolated from human whole blood of a single male donor (Thermo Fisher Scientific, Waltham, MA). Total reference RNAs were extracted using Qiagen RNeasy Kit. mRNAs were isolated from total RNAs using NEBNext Poly(A) mRNA Magnetic Isolation Module according to manufacturer’s protocols (NEB). Reference cDNA library was constructed using NEB First Strand Synthesis Protocol with M-MuLV Reverse Transcriptase42. cDNAs were purified using Qiagen MinElute Reaction Cleanup Kit. To perform PATHseq method, magnetic beads with first strand cDNA synthesis were equilibrated with 100 μl 1x PATHseq buffer (40 mM Tris-HCl, pH 7.9, 20 mM MgCl2, 10 mM DTT, and 2 mM spermidine) and separated by magnetic field. Supernatant was removed and discarded. Beads were re-suspended in 8.4 μl nuclease-free water, then the following solutions were added: 2 μl of 10x PATH buffer, 2 μl Octamers (10 μM), 1 μl dNTP mix (25 mM), 3.2 μl rNTP mix (25 mM), 0.4 μl Pyrophosphase (Inorganic (E. coli), 100 units/ml, New England BioLabs, Ipswich, MA), 1 μl reference human cDNAs (0.1 μg/1 μl), 1 μl M-MuLV Reverse Transcriptase (NEB), and 1 μl T7 RNA polymerase (NEB). Final volume was 20 μl. Mixture were incubated at 40 °C for 3 hours with gentle agitation. PATHseq cDNA library was purified using Qiagen MinElute Reaction Cleanup Kit.

Next generation sequencing

Next generation sequencing was performed with Illumina MiSeq Desktop Sequencer version 2 (Illumina, San Diego, CA). Sequencing library was prepared using Nextera DNA Sample Preparation Kit (Illumina) according to the product’s guide. Sequencing run was carried out using Illumina MiSeq Reagent Kit v2 (500-cylces), which generates 24–30 million paired-end reads of 2 × 250 bp length, total output of 7.5-8.5 Gb.

Sequencing data analysis

Raw sequencing data was filtered by in-house scripts: 1) Remove reads with 3 N; 2) Remove reads contaminated by adapter (default: 15 bases overlapped by reads and adapter); 3) Remove reads with a certain proportion of low quality (20) bases (40% as default, parameter setting at 36 bp); 4) Remove duplication contamination. Using a computer program called STAR28, quality sequencing reads were aligned against the human genome primarily assembled from ENSEMBL (http://uswest.ensembl.org/index.html). Reads aligned to multiple loci in the reference human genome were also considered as unmapped reads and filtered out to reduce the false positive rate. To obtain longer non-human origin reads, unaligned reads were further assembled using a de novo assembly computer program named Trinity29, resulting in larger contig sequences. The de novo assembled unaligned sequences were blasted against the nucleotide sequence database known as NCBI “nr” database. Finally, the assembled sequences were identified using the NCBI genomic BLAST database for “Microbes” including bacteria, fungi, and viruses32.

Quantitative real time PCR (qPCR)

qPCRs were performed with ABI’s StepOnePlus Real-Time PCR Systems. Reactions were set up with ABI’s SYBR Select Master Mix (Life Technologies, Carlsbad, CA) according to product’s instruction.

Accession numbers

The genome sequence data of S. pneumonia strain ASVL_JC_0001 identified in this study has been deposited at DDBJ/EMBL/GenBank under the accession number JJMK01000000, and consists of sequences JJMK01000001 - JJMK01000083.

Additional Information

How to cite this article: Ge, F. et al. Preferential Amplification of Pathogenic Sequences. Sci. Rep. 5, 11047; doi: 10.1038/srep11047 (2015).

44 in total

1. A greedy algorithm for aligning DNA sequences.

Authors: Z Zhang; S Schwartz; L Wagner; W Miller
Journal: J Comput Biol Date: 2000 Feb-Apr Impact factor: 1.479

2. Stem cell transcriptome profiling via massive-scale mRNA sequencing.

Authors: Nicole Cloonan; Alistair R R Forrest; Gabriel Kolle; Brooke B A Gardiner; Geoffrey J Faulkner; Mellissa K Brown; Darrin F Taylor; Anita L Steptoe; Shivangi Wani; Graeme Bethel; Alan J Robertson; Andrew C Perkins; Stephen J Bruce; Clarence C Lee; Swati S Ranade; Heather E Peckham; Jonathan M Manning; Kevin J McKernan; Sean M Grimmond
Journal: Nat Methods Date: 2008-05-30 Impact factor: 28.547

Review 3. Sequencing technologies - the next generation.

Authors: Michael L Metzker
Journal: Nat Rev Genet Date: 2009-12-08 Impact factor: 53.242

Review 4. Next-generation sequencing technologies in diagnostic virology.

Authors: Luisa Barzon; Enrico Lavezzo; Giulia Costanzi; Elisa Franchin; Stefano Toppo; Giorgio Palù
Journal: J Clin Virol Date: 2013-03-21 Impact factor: 3.168

5. Clonal integration of a polyomavirus in human Merkel cell carcinoma.

Authors: Huichen Feng; Masahiro Shuda; Yuan Chang; Patrick S Moore
Journal: Science Date: 2008-01-17 Impact factor: 47.728

Review 6. Pathogen discovery from human tissue by sequence-based computational subtraction.

Authors: Yaohui Xu; Nicole Stange-Thomann; Griffin Weber; Ronghai Bo; Sheila Dodge; Robert G David; Karen Foley; Javad Beheshti; Nancy L Harris; Bruce Birren; Eric S Lander; Matthew Meyerson
Journal: Genomics Date: 2003-03 Impact factor: 5.736

7. A new arenavirus in a cluster of fatal transplant-associated diseases.

Authors: Gustavo Palacios; Julian Druce; Lei Du; Thomas Tran; Chris Birch; Thomas Briese; Sean Conlan; Phenix-Lan Quan; Jeffrey Hui; John Marshall; Jan Fredrik Simons; Michael Egholm; Christopher D Paddock; Wun-Ju Shieh; Cynthia S Goldsmith; Sherif R Zaki; Mike Catton; W Ian Lipkin
Journal: N Engl J Med Date: 2008-02-06 Impact factor: 91.245

8. Novel orthobunyavirus in Cattle, Europe, 2011.

Authors: Bernd Hoffmann; Matthias Scheuch; Dirk Höper; Ralf Jungblut; Mark Holsteg; Horst Schirrmeier; Michael Eschbaumer; Katja V Goller; Kerstin Wernike; Melina Fischer; Angele Breithaupt; Thomas C Mettenleiter; Martin Beer
Journal: Emerg Infect Dis Date: 2012-03 Impact factor: 6.883

9. Draft Genome Sequence of an Atypical Strain of Streptococcus pneumoniae Isolated from a Respiratory Infection.

Authors: Sang Chul Choi; Jayme Parker; Vincent P Richards; Katherine Ross; Bernard Jilly; Jack Chen
Journal: Genome Announc Date: 2014-08-14

10. Database indexing for production MegaBLAST searches.

Authors: Aleksandr Morgulis; George Coulouris; Yan Raytselis; Thomas L Madden; Richa Agarwala; Alejandro A Schäffer
Journal: Bioinformatics Date: 2008-06-21 Impact factor: 6.937

6 in total

Review 1. Respiratory Syncytial Virus: Infection, Detection, and New Options for Prevention and Treatment.

Authors: Cameron Griffiths; Steven J Drews; David J Marchant
Journal: Clin Microbiol Rev Date: 2017-01 Impact factor: 26.132

2. Application of next generation sequencing for the detection of human viral pathogens in clinical specimens.

Authors: Jayme Parker; Jack Chen
Journal: J Clin Virol Date: 2016-11-22 Impact factor: 3.168

3. Whole Genome Sequencing of Enterovirus species C Isolates by High-Throughput Sequencing: Development of Generic Primers.

Authors: Maël Bessaud; Serge A Sadeuh-Mba; Marie-Line Joffret; Richter Razafindratsimandresy; Patsy Polston; Romain Volle; Mala Rakoto-Andrianarivelo; Bruno Blondel; Richard Njouom; Francis Delpeyroux
Journal: Front Microbiol Date: 2016-08-26 Impact factor: 5.640

4. Investigation of a Canine Parvovirus Outbreak using Next Generation Sequencing.

Authors: Jayme Parker; Molly Murphy; Karsten Hueffer; Jack Chen
Journal: Sci Rep Date: 2017-08-29 Impact factor: 4.379

5. Enrichment of Viral Nucleic Acids by Solution Hybrid Selection with Genus Specific Oligonucleotides.

Authors: Andrei A Deviatkin; Alexander N Lukashev; Mikhail M Markelov; Larisa V Gmyl; German A Shipulin
Journal: Sci Rep Date: 2017-08-29 Impact factor: 4.379

Review 6. Viral metagenomics and blood safety.

Authors: V Sauvage; M Eloit
Journal: Transfus Clin Biol Date: 2016-01-09 Impact factor: 1.406

6 in total