| Literature DB >> 29474353 |
Xin Bai1,2,3, Jian-An Jia4,5, Meng Fang4, Shipeng Chen4, Xiaotao Liang2,6, Shanfeng Zhu6, Shuqin Zhang1,2,7, Jianfeng Feng1,2,8, Fengzhu Sun1,2,3, Chunfang Gao4.
Abstract
Hepatitis B virus (HBV) infection is a common problem in the world, especially in China. More than 60-80% of hepatocellular carcinoma (HCC) cases can be attributed to HBV infection in high HBV prevalent regions. Although traditional Sanger sequencing has been extensively used to investigate HBV sequences, NGS is becoming more commonly used. Further, it is unknown whether word pattern frequencies of HBV reads by Next Generation Sequencing (NGS) can be used to investigate HBV genotypes and predict HCC status. In this study, we used NGS to sequence the pre-S region of the HBV sequence of 94 HCC patients and 45 chronic HBV (CHB) infected individuals. Word pattern frequencies among the sequence data of all individuals were calculated and compared using the Manhattan distance. The individuals were grouped using principal coordinate analysis (PCoA) and hierarchical clustering. Word pattern frequencies were also used to build prediction models for HCC status using both K-nearest neighbors (KNN) and support vector machine (SVM). We showed the extremely high power of analyzing HBV sequences using word patterns. Our key findings include that the first principal coordinate of the PCoA analysis was highly associated with the fraction of genotype B (or C) sequences and the second principal coordinate was significantly associated with the probability of having HCC. Hierarchical clustering first groups the individuals according to their major genotypes followed by their HCC status. Using cross-validation, high area under the receiver operational characteristic curve (AUC) of around 0.88 for KNN and 0.92 for SVM were obtained. In the independent data set of 46 HCC patients and 31 CHB individuals, a good AUC score of 0.77 was obtained using SVM. It was further shown that 3000 reads for each individual can yield stable prediction results for SVM. Thus, another key finding is that word patterns can be used to predict HCC status with high accuracy. Therefore, our study shows clearly that word pattern frequencies of HBV sequences contain much information about the composition of different HBV genotypes and the HCC status of an individual.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29474353 PMCID: PMC5841821 DOI: 10.1371/journal.pgen.1007206
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Fig 1Fraction of genotype B among the 94 HCC patients and 45 CHB patients.
(a) Histograms of the fraction of genotype B based on STAR. (b) The relationship between the ratio of the fraction of HCC individuals in the bin over that of the CHB individuals and the fraction of genotype B sequences based on STAR.
Fig 2PCoA plot based on the 94 HCC patients and 45 CHB individuals.
The distance matrix is based on the Manhattan distance between the frequency vectors of word patterns of length (a) k = 6 and (b) k = 8, respectively. Color shows the fractions of genotype B and C based on the STAR genotyping results. Red represents 100% genotype B and blue represents 100% genotype C. Reference B and C sequences are also added on the figures as references. The relationship between the first principal coordinate and the fraction of genotype B, (c): k = 6, (d): k = 8. The relationship between the ratio of the fraction of CHB individuals in the bin over that of the HCC individuals and the second coordinate, (e): k = 6, (f): k = 8.
Spearman and Pearson correlations coefficients between the first principal coordinate and the fraction of genotype B for the 94 HCC patients and 45 CHB individuals.
Different word lengths are used for computing the Manhattan distance.
| Correlation | |||||||
|---|---|---|---|---|---|---|---|
| Spearman | -0.39 | 0.20 | 0.38 | 0.80 | 0.89 | 0.94 | 0.94 |
| Pearson | -0.37 | 0.19 | 0.42 | 0.92 | 0.97 | 0.97 | 0.97 |
Fig 3Hierarchical clustering results of samples HCC and CHB from the first data set.
There are four different colors of branches: red means HCC samples genotype C dominant, yellow means HCC samples genotype B dominant, green means CHB samples genotype C dominant, blue means CHB samples genotype B dominant. One genotype dominant means the fraction of this genotype is the highest among all genotypes.
Distribution of patients according to genotype fraction and clusters.
Number of overlaps between the clusters (I and II) and groups of individuals with dominant genotypes B and C, respectively.
| cluster I | cluster II | |
|---|---|---|
| Genotype B dominant | 38 | 1 |
| Genotype C dominant | 6 | 94 |
Prediction results from KNN using different word length k.
*CV: cross validation.
| Word length | |||||||
|---|---|---|---|---|---|---|---|
| CV mean AUC | 0.86 | 0.87 | 0.87 | 0.88 | 0.88 | 0.89 | 0.89 |
| Predicting AUC | 0.62 | 0.64 | 0.66 | 0.65 | 0.67 | 0.67 | 0.67 |
| Optimal K | 15 | 10 | 5 | 5 | 5 | 5 | 5 |
Prediction results from SVM using different word length k.
*CV: cross validation.
| Word length | |||||||
|---|---|---|---|---|---|---|---|
| CV mean AUC | 0.86 | 0.90 | 0.91 | 0.93 | 0.93 | 0.93 | 0.92 |
| Predicting AUC | 0.65 | 0.77 | 0.72 | 0.70 | 0.70 | 0.70 | 0.70 |
| Optimal | 16384 | 16384 | 32768 | 32768 | 32768 | 32768 | 16384 |
Fig 4Boxplots of the relationship between AUC values and the number of readsusing different word length k for SVM.
For each word length k and number of reads N, there are 200 random replicates and AUC values.
Fig 5Boxplots of the relationship between AUC values and the number of readsusing different word length k for KNN.
For each word length k and number of readsN,there are 200 random replicates and AUC values.