| Literature DB >> 25830316 |
Ping Qiu1, Richard Stevens2, Bo Wei1, Fred Lahser3, Anita Y M Howe3, Joel A Klappenbach2, Matthew J Marton1.
Abstract
Genotyping of hepatitis C virus (HCV) plays an important role in the treatment of HCV. As new genotype-specific treatment options become available, it has become increasingly important to have accurate HCV genotype and subtype information to ensure that the most appropriate treatment regimen is selected. Most current genotyping methods are unable to detect mixed genotypes from two or more HCV infections. Next generation sequencing (NGS) allows for rapid and low cost mass sequencing of viral genomes and provides an opportunity to probe the viral population from a single host. In this paper, the possibility of using short NGS reads for direct HCV genotyping without genome assembly was evaluated. We surveyed the publicly-available genetic content of three HCV drug target regions (NS3, NS5A, NS5B) in terms of whether these genes contained genotype-specific regions that could predict genotype. Six genotypes and 38 subtypes were included in this study. An automated phylogenetic analysis based HCV genotyping method was implemented and used to assess different HCV target gene regions. Candidate regions of 250-bp each were found for all three genes that have enough genetic information to predict HCV genotypes/subtypes. Validation using public datasets shows 100% genotyping accuracy. To test whether these 250-bp regions were sufficient to identify mixed genotypes, we developed a random primer-based method to sequence HCV plasma samples containing mixtures of two HCV genotypes in different ratios. We were able to determine the genotypes without ambiguity and to quantify the ratio of the abundances of the mixed genotypes in the samples. These data provide a proof-of-concept that this random primed, NGS-based short-read genotyping approach does not need prior information about the viral population and is capable of detecting mixed viral infection.Entities:
Mesh:
Year: 2015 PMID: 25830316 PMCID: PMC4382110 DOI: 10.1371/journal.pone.0122082
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Number of sequences included in the validation set for each gene.
For subtypes with >100 sequences, 100 random sequences were chosen. For the rest of the subtypes, all sequences except the sequences used in the reference set were retained for validation.
List of entries from the validation dataset that were discordant with genotype reported in Los Alamos HCV.
| Accession | Los Alamos Genotype | Predicted Genotype | Comments |
|---|---|---|---|
| EF032895 | 1a | 1b | Annotated as 1b in original NCBI record. Wrong annotation from LA HCV DB. |
| KC967478 | 2b | 2a | NCBI annotation is 2a, which suggests Los Alamos genotype is mis-annotated. Best blast hit in NCBI has no annotation and the second best hit is annotated as 2b/2a. |
| KC967477 | 2b | 2a | NCBI annotation is 2a, which suggests LA genotype is mis-annotated. Data was submitted in same batch by same laboratory as KC967478. |
| KC967479 | 2c | 2a | No annotation in the original NCBI submission. Best blast hit annotated as 2a, which suggests that LA annotation is mis-annotated. |
| JX227953 | 2k | 2i | No subtype annotation in the original NCBI submission. Best blast hit in NCBI annotated as genotype 2. LA annotation is ambiguous. Our full gene phylogenetic analysis shows 2i |
| JX227952 | 2k | 1b | Original annotation in NCBI is ambiguous and is called 2k/1b. Best blast hit in GenBank is HQ537006.1 (95% identity). HQ537006 is annotated as 2k/1b. LA annotation is ambiguous. Full gene phylogenetic analysis shows 1b |
Comments provide rationale for conclusion that discordant genotype are mis-annotated by the LA HCV DB. All entries’ genotypes were confirmed by phylogenetic analysis using full gene sequences.
Genotyping prediction accuracy of 250-bp windows of NS3, NS5A and NS5B subregions.
| NS3 | NS5A | NS5B | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Start on NS3 | End on NS3 | Start on HCV | Genotyping Accuracy | Start on NS5A | End on NS5A | Start on HCV | Genotyping Accuracy | Start on NS5B | End on NS5B | Start on HCV | Genotyping Accuracy |
| 1 | 250 | 3421 | 90.6 | 1 | 250 | 6259 | 100.0 | 1 | 250 | 7603 | 99.8 |
| 101 | 350 | 3521 | 86.5 | 101 | 350 | 6359 | 94.3 | 101 | 350 | 7703 | 100.0 |
| 201 | 450 | 3621 | 85.4 | 201 | 450 | 6459 | 83.3 | 201 | 450 | 7803 | 98.8 |
| 301 | 550 | 3721 | 97.2 | 301 | 550 | 6559 | 85.8 | 301 | 550 | 7903 | 98.6 |
| 401 | 650 | 3821 | 99.6 | 401 | 650 | 6659 | 86.5 | 401 | 650 | 8003 | 99.5 |
| 501 | 750 | 3921 | 99.1 | 501 | 750 | 6759 | 99.7 | 501 | 750 | 8103 | 98.8 |
| 601 | 850 | 4021 | 94.6 | 601 | 850 | 6859 | 90.0 | 601 | 850 | 8203 | 100.0 |
| 701 | 950 | 4121 | 100.0 | 701 | 950 | 6959 | 92.2 | 701 | 950 | 8303 | 97.2 |
| 801 | 1050 | 4221 | 99.8 | 801 | 1050 | 7059 | 84.1 | 801 | 1050 | 8403 | 98.4 |
| 901 | 1150 | 4321 | 99.8 | 901 | 1150 | 7159 | 84.5 | 901 | 1150 | 8503 | 98.1 |
| 1001 | 1250 | 4421 | 92.3 | 1001 | 1250 | 7259 | 91.2 | 1001 | 1250 | 8603 | 99.1 |
| 1101 | 1350 | 4521 | 86.1 | 1101 | 1350 | 7359 | 88.4 | 1101 | 1350 | 8703 | 90.0 |
| 1201 | 1450 | 4621 | 89.5 | 1201 | 1450 | 8803 | 95.1 | ||||
| 1301 | 1550 | 4721 | 92.3 | 1301 | 1550 | 8903 | 91.1 | ||||
| 401 | 1650 | 4821 | 98.1 | 1401 | 1650 | 9003 | 81.1 | ||||
| 1501 | 1750 | 4921 | 94.8 | 1501 | 1750 | 9103 | 97.9 | ||||
| 1601 | 1850 | 5021 | 83.2 | 1601 | 1850 | 9203 | 97.4 | ||||
| 1701 | 1950 | 9303 | 91.4 | ||||||||
250-bp window tiled through the NS3, NS5A and NS5B to identify the regions with the best ability to predict HCV genotype using the validation set. Each window has 100 bp overlaps. See Materials and Methods for details.
Fig 2HCV genome conservation map.
Entire HCV genome is displayed along the X axis. Locations of the NS3, NS5A and NS5B genes are indicated. Highest conservation score is 2 (100% conservation), as described in the Materials and Methods. Subregions with best predictive power tend to be in hypervariable regions; subregions with worst predictive power tend to be in relatively conserved regions (green arrows for good predictors; red arrows for bad predictors).
Fig 3Workflow for HCV genotyping on NGS reads.
Prediction of HCV genotypes from NGS data generated from HCV containing plasma.
| Sample | Total Reads | Total Combined Reads | # of Reads Q>30, Length>250 | # of HCV Reads | Predictor Used | # of Reads Covering Predictor | # of GT1 Predicted | # of GT3 Predicted | Observed GT1 Freq by Predictor | Observed GT1 Freq | Expected GT1 Freq |
|---|---|---|---|---|---|---|---|---|---|---|---|
| HCV-S3 | 2063072 | 1409556 | 77167 | 4504 (5.8%) | NS3 (701–950) | 70 | 67 | 3 | 95.7 | 96.6 | 90 |
| HCV-S3 | NS5A (1–250) | 44 | 43 | 1 | 97.7 | ||||||
| HCV-S3 | NS5B (101–350) | 34 | 33 | 1 | 97.1 | ||||||
| HCV-S4 | 1218965 | 887188 | 63458 | 5533 (8.7%) | NS3 (701–950) | 69 | 64 | 5 | 92.8 | 92.4 | 90 |
| HCV-S4 | NS5A (1–250) | 171 | 159 | 12 | 93.0 | ||||||
| HCV-S4 | NS5B (101–350) | 101 | 92 | 9 | 91.1 | ||||||
| HCV-S5 | 2014690 | 1466363 | 73403 | 6999 (9.5%) | NS3 (701–950) | 60 | 33 | 27 | 55.0 | 67.4 | 50 |
| HCV-S5 | NS5A (1–250) | 159 | 123 | 36 | 77.4 | ||||||
| HCV-S5 | NS5B (101–350) | 124 | 75 | 49 | 60.5 | ||||||
| HCV-S6 | 2930073 | 2309270 | 153118 | 7914 (5.2%) | NS3 (701–950) | 88 | 51 | 37 | 58.0 | 69.2 | 50 |
| HCV-S6 | NS5A (1–250) | 131 | 96 | 35 | 73.3 | ||||||
| HCV-S6 | NS5B (101–350) | 47 | 37 | 10 | 78.7 |
HCV plasma samples of genotype 1b and 3a were mixed as 90:10 for samples HCV-S3 and HCV-S4, and at 50:50 for samples HCV-S5 and HCV-S6.