| Literature DB >> 19586537 |
Ping Qiu1, Xiao-Yan Cai, Wei Ding, Qing Zhang, Ellie D Norris, Jonathan R Greene.
Abstract
The genotype of Hepatitis C Virus (HCV) strains is an important determinant of the severity and aggressiveness of liver infection as well as patient response to antiviral therapy. Fast and accurate determination of viral genotype could provide direction in the clinical management of patients with chronic HCV infections. Using publicly available HCV nucleotide sequences, we built a global Position Weight Matrix (PWM) for the HCV genome. Based on the PWM, a set of genotype specific nucleotide sequence "signatures" were selected from the 5' NCR, CORE, E1, and NS5B regions of the HCV genome. We evaluated the predictive power of these signatures for predicting the most common HCV genotypes and subtypes. We observed that nucleotide sequence signatures selected from NS5B and E1 regions generally demonstrated stronger discriminant power in differentiating major HCV genotypes and subtypes than that from 5' NCR and CORE regions. Two discriminant methods were used to build predictive models. Through 10 fold cross validation, over 99% prediction accuracy was achieved using both support vector machine (SVM) and random forest based classification methods in a dataset of 1134 sequences for NS5B and 947 sequences for E1. Prediction accuracy for each genotype is also reported.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19586537 PMCID: PMC2720937 DOI: 10.1186/1423-0127-16-62
Source DB: PubMed Journal: J Biomed Sci ISSN: 1021-7770 Impact factor: 8.410
Sub regions selected for analysis in this study.
| Genome Region | Range on D90208 | Region Selected | # of Sequences |
| 5' NCR | 1–329 | 73–298 | 611 |
| CORE | 330–889 | 330–700 | 498 |
| E1 | 900–1475 | 900–1475 | 947 |
| NS5B | 7587–9413 | 8200–8600 | 1134 |
The sub regions were selected to maximize the sequence record coverage of each genotype and the sizes were limited to the length of one sequencing read (~500 bp).
Number of HCV sequences used in the study for each genotype.
| Genotype | # of Sequences |
| 1a | 1667 |
| 1b | 5845 |
| 2a | 198 |
| 2b | 406 |
| 2c | 222 |
| 3a | 591 |
| 3b | 168 |
| 4 | 542 |
| 5 | 148 |
| 6 | 227 |
Average error rates over 100 runs on features from four HCV genome regions using two different classification algorithms.
| Classification Method | Region on HCV Genome | |||
| 5' NCR | CORE | E1 | NS5B | |
| SVM | 21.98 | 19.66 | 1.60 | 0.21 |
| Random Forest | 24.28 | 3.98 | 0.56 | 0.19 |
Error rates are computed as average error rates over 100 runs, that is, a cross-validation procedure of training on 90% of the data and testing on the remaining 10% was repeated 100 times and the errors averaged.
Figure 1Average classification error rate (percent) over 100 runs on different genotypes from 10-fold cross-validation.
HCV genotype prediction accuracy using an independent data set (result was reported for models built based on NS5B and E1 only)
| Genotype | E1 | NS5B | ||||||||||
| SVM | RF | SVM | RF | |||||||||
| SN | SP | AC | SN | SP | AC | SN | SP | AC | SN | SP | AC | |
| 1a | 98.9 | 98.3 | 98.8 | 98.4 | 96.7 | 97.4 | 100 | 100 | 100 | 100 | 100 | 100 |
| 1b | 94.8 | 99.7 | 98.8 | 100 | 99.7 | 98.2 | 99.4 | 100 | 99.8 | 99.4 | 99.3 | 99.3 |
| 2a | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 75 | 100 | 99.8 |
| 2b | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| 2c | 100 | 100 | 100 | 55.6 | 99.8 | 99 | 100 | 100 | 100 | 93.3 | 100 | 99.8 |
| 3a | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 99.8 | 99.8 | 100 | 99.8 | 99.9 |
| 3b | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| 4 | 100 | 99.8 | 99.8 | 90.4 | 100 | 99 | 100 | 100 | 100 | 100 | 100 | 100 |
| 5 | 100 | 100 | 100 | 100 | 100 | 100 | 96.3 | 100 | 99.8 | 96.3 | 100 | 99.8 |
| 6 | 100 | 100 | 100 | 84.6 | 100 | 98.4 | 100 | 99.8 | 99.8 | 80 | 100 | 99.8 |
Suggested primer stretches (for sequencing and PCR) based on HCV whole genome PWM for analyzing signature nucleotides selected for NS5B and E1 region.
| Forward Primers | Reverse Primers | |||||||
| Start | End | Conservation Score (%) | Sequence | Start | End | Conservation Score (%) | Sequence | |
| NS5B | 8050 | 8074 | 93.2 | AGCCAGCTCGCCTTATCGTATTCCC | 8629 | 8605 | 94.5 | GCGGAATACCTGGTCATAGCCTCCG |
| 8083 | 8107 | 89.1 | GGGTTCGTGTGTGCGAGAAGATGGC | 8800 | 8776 | 91.1 | ACTGGAGTGTGTCTAGCTGTCTCCC | |
| 8082 | 8106 | 89.0 | GGGGTTCGTGTGTGCGAGAAGATGG | 8634 | 8610 | 89.7 | GGGGGGCGGAATACCTGGTCATAGC | |
| 8125 | 8149 | 85.9 | CCACCCTTCCTCAGGCCGTGATGGG | 8633 | 8609 | 89.7 | GGGGGCGGAATACCTGGTCATAGCC | |
| 8124 | 8148 | 84.3 | TCCACCCTTCCTCAGGCCGTGATGG | |||||
| E1 | 709 | 733 | 94.1 | CATGCGGCTTCGCCGACCTCATGGG | 1612 | 1588 | 89.3 | TTCAGGGCAGTCCTGTTGATGTGCC |
| 708 | 732 | 94.0 | ACATGCGGCTTCGCCGACCTCATGG | 1605 | 1581 | 89.3 | CAGTCCTGTTGATGTGCCAGCTGCC | |
| 733 | 757 | 93.0 | GGTACATTCCGCTCGTCGGCGCCCC | 1629 | 1605 | 83.2 | TGAGGCTGTCATTGCAGTTCAGGGC | |
| 821 | 845 | 91.2 | TGCAACAGGGAACCTTCCTGGTTGC | |||||
To ensure optimal polymerization, the 3' end and the penultimate position were required to be G or C with frequencies of ≥0.98 and the upstream position, (3' -2), a G or C with a frequency of ≥0.90 or alternatively an A or T with a frequency of ≥0.95.