| Literature DB >> 21824402 |
Roberto Bruni1, Mattia Prosperi, Cinzia Marcantonio, Alessandra Amadori, Umbertina Villano, Elena Tritarelli, Alessandra Lo Presti, Massimo Ciccozzi, Anna R Ciccaglione.
Abstract
BACKGROUND: Occult Hepatitis B Infection (OBI) is characterized by absence of serum HBsAg and persistence of HBV-DNA in liver tissue, with low to undetectable serum HBV-DNA. The mechanisms underlying OBI remain to be clarified. To evaluate if specific point mutations of HBV genome may be associated with OBI, we applied an approach based on bioinformatics analysis of complete genome HBV sequences. In addition, the feasibility of bioinformatics prediction models to classify HBV infections into OBI and non-OBI by molecular data was evaluated.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21824402 PMCID: PMC3170640 DOI: 10.1186/1743-422X-8-394
Source DB: PubMed Journal: Virol J ISSN: 1743-422X Impact factor: 4.099
Summary of sequences included in OBI and non-OBI datasets, by country and genotype
| Country | OBI dataset | non-OBI dataset | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Total patients | HBV sequences by genotype | Total patients | HBV sequences by genotype | |||||||
| A | C | D | E | A | C | D | E | |||
| China | 8 | 8 | 13 | 13 | ||||||
| Spain | 2 | 2 | 1 | 1 | ||||||
| Ghana | 9 | 9 | 9 | 9 | ||||||
| France | 1 | 1 | 3 | 2 | 1 | |||||
| Italy | 13 | 1 | 12 | 7 | 7 | |||||
| India | 8 | 4 | 4 | 7 | 4 | 3 | ||||
| others | 122 | 19 | 27 | 58 | 18 | |||||
| Total | 41 | 5 | 8 | 19 | 9 | 162 | 25 | 40 | 70 | 27 |
Nucleotide positions significantly enriched in OBI dataset by univariable analysis
| Nucleotide positions* | Adjusted p-value | Significant nt changes in OBI | Amino acid change in affected | Affected HBV protein** | |||
|---|---|---|---|---|---|---|---|
| polymerase | surface | core | |||||
| pre-S2 | S | ||||||
| 78 | 0.0144 | G → A | no change | Gly → Glu | L, M | ||
| 233 | 0.0367 | A → G | His → Arg (RT domain) | Thr → Ala | P, L, M, S | ||
| 418 | 0.0367 | G → T | Ala → Ser (RT domain) | no change | P | ||
| 2240 *** | 0.0097 | A → G | Thr → Val (T helper epitope) *** | C | |||
| 2241 *** | 0.0144 | C → T | Thr → Val (T helper epitope) *** | C | |||
| 2435 | 0.0367 | C → A | no change | Gln → Lys (Arg-rich C-terminus) | C | ||
| 2485 | 0.0367 | A → G | Asn → Ser (TP domain) | P | |||
* positions in the well-annotated reference HBV sequence Acc.No. AM282986
** L, M, S: Large, Middle and Small S, respectively; P: Polymerase; C: Core
*** nucleotides 2240 and 2241 co-variated in all observed cases, substituting ACT with GTT or ACA with GTA codons.
Figure 1Average Gini index as a measure of variable importance. Averaged Gini variable importance extracted from 30 random forest models trained on the full data set, compared with the average Gini values obtained by re-shuffling the class variable. Variables are ranked decreasingly by their importance. Only variables with a p-value < 0.05 are depicted.
Performance of different classification models
| input | model | feature | average | average | average | average |
|---|---|---|---|---|---|---|
| bases | none | 0.847 (0.100) | 83.948 (5.024) | 0.220 (0.230) | 0.992 (0.021) | |
| triplet | none | 0.847 (0.097) | 83.952 (4.820) | 0.224 (0.222) | 0.990 (0.022) | |
| bases | Fisher | 0.699 (0.160) * | 81.462 (5.812) | 0.215 (0.226) | 0.962 (0.055) | |
| triplet | Fisher | 0.759 (0.127) | 81.781 (5.306) | 0.234 (0.219) | 0.961 (0.047) * | |
| bases | Fisher/AIC | 0.670 (0.134) * | 81.310 (6.036) | 0.283 (0.220) | 0.943 (0.054) * | |
| triplet | Fisher/AIC | 0.680 (0.147) * | 81.343 (6.878) | 0.324 (0.226) | 0.934 (0.058) * | |
| bases | none | 0.570 (0.137) * | 79.862 (5.904) | 0.143 (0.217) | 0.960 (0.059) | |
| triplet | none | 0.549 (0.106) * | 80.136 (4.850) * | 0.130 (0.203) | 0.967 (0.059) | |
| bases | none | 0.574 (0.094) * | 80.662 (5.054) | 0.190 (0.219) | 0.958 (0.072) | |
| triplet | none | 0.579 (0.110) * | 79.943 (5.442) | 0.215 (0.249) | 0.943 (0.074) |
* worse than RF on whole set of bases at p < 0.05
AUC: area under the receiver operating characteristic; RF: random forest; LR: logistic regression; DT: decision tree; RI: rule induction; TNR: true negative rate; TPR: true positive rate; AIC: Akaike information criterion.
Figure 2True positive . A higher "Area Under the Curve" (AUC) means a higher chance to classify a random isolate from the occult group as a true occult hepatitis, as compared to the chance to classify a random isolate from the wild type group as an occult hepatitis. RF: Random Forest; LR: Logistic Regression; DT: Decision Tree.
Figure 3Variation of the Small S protein of genotype D OBI and non-OBI sequences. A: variation along the complete Small S protein of genotype D OBI and non-OBI sequences shown by entropy plots (see Materials and Methods). Entropy values (Hx) are a measure of variation at each amino acid position in the set of aligned sequences. It can vary from 0 (i.e. no variation) to 3.04 (i.e. all the possible 20 amino acids or a gap occur in equal frequency in a position. B: amino acid variations in the MHR of HBsAg in 19 genotype D OBI sequences vs. a consensus from genotype D non-OBI sequences. Accession number and country of origin of each sequence are reported.