| Literature DB >> 32629900 |
Alexandre Lomsadze1, Tengguo Li2, Mangalathu S Rajeevan2, Elizabeth R Unger2, Mark Borodovsky1,3.
Abstract
We recently developed a test based on the Agilent SureSelect target enrichment system capturing genomic fragments from 191 human papillomaviruses (HPV) types for Illumina sequencing. This enriched whole genome sequencing (eWGS) assay provides an approach to identify all HPV types in a sample. Here we present a machine learning algorithm that calls HPV types based on the eWGS output. The algorithm based on the support vector machine (SVM) technique was trained on eWGS data from 122 control samples with known HPV types. The new algorithm demonstrated good performance in HPV type detection for designed samples with 25 or greater HPV plasmid copies per sample. We compared the results of HPV typing made by the new algorithm for 261 residual epidemiologic samples with the results of the typing delivered by the standard HPV Linear Array (LA). The agreement between methods (97.4%) was substantial (kappa= 0.783). However, the new algorithm identified additionally 428 instances of HPV types not detectable by the LA assay by design. Overall, we have demonstrated that the bioinformatics pipeline is an accurate tool for calling HPV types by analyzing data generated by eWGS processing of DNA fragments extracted from control and epidemiological samples.Entities:
Keywords: HPV typing; HPV whole genome sequencing; bioinformatics pipeline; h classification; target enrichment
Mesh:
Year: 2020 PMID: 32629900 PMCID: PMC7412107 DOI: 10.3390/v12070710
Source DB: PubMed Journal: Viruses ISSN: 1999-4915 Impact factor: 5.048
Description of control/designed samples (shaded cells) and epidemiological samples used to generate Data Sets 1–4.
| Data Set 1 ( | Data Set 2 ( | ||||
|---|---|---|---|---|---|
| Sample ID | Sample Description | Input/Reaction | Experiment 1 Sample ID | Sample Description | Input/Reaction |
| 1 | HPV-plasmid-45 | 50,000 copies | 1 | Pool of HPV plasmid-11,16,31,45,52 | 625 copies |
| 2 | HPV-plasmid-58 | 50,000 copies | 2 | Pool of HPV plasmid-11,16,31,45,52 | 125 copies |
| 3 | HPV-plasmid-31 | 50,000 copies | 3 | Pool of HPV plasmid-11,16,31,45,52 | 25 copies |
| 4 | HPV-plasmid-33 | 50,000 copies | 4 | Pool of HPV plasmid-11,16,31,45,52 | 5 copies |
| 5 | HPV-plasmid-52 | 50,000 copies | 5 | Pool of HPV plasmid-11,16,31,45,52 | 1 copy |
| 6 | HPV-plasmid-6 | 50,000 copies | 6 | Pool of HPV plasmid-6,18,33,58 | 625 copies |
| 7 | HPV-plasmid-18 | 50,000 copies | 7 | Pool of HPV plasmid-6,18,33,58 | 125 copies |
| 8 | HPV-plasmid-11 | 50,000 copies | 8 | Pool of HPV plasmid-6,18,33,58 | 25 copies |
| 9 | H2O (HPV negative) | 0 ng | 9 | Pool of HPV plasmid-6,18,33,58 | 5 copies |
| 10 | Placenta (HPV negative) | 100 ng | 10 | Pool of HPV plasmid-6,18,33,58 | 1 copy |
| 11 | CaSki (HPV-16) | 100 ng | 11 | HPV plasmid-16 | 10,000 copies |
| 12 | CaSki (HPV-16) | 10 ng | 12 | HPV plasmid-18 | 10,000 copies |
| 13 | SiHa (HPV-16) | 100 ng | 13 | H2O (HPV negative) | 0 ng |
| 14 | SiHa (HPV-16) | 10 ng | 14 | Placenta (HPV negative) | 100 ng |
| 15 | HeLa (HPV-18) | 100 ng | 15 | SiHa (HPV-16) | 10 ng |
| 16 | HeLa (HPV-18) | 10 ng | 16 | HeLa (HPV-18) | 10 ng |
| 17–31 | Epidemiological samples (50 extracts from male genital swab) | 100 ng | 17–32 | Replicate of 1–16 | |
| 32 | HPV-plasmid-16 | 50,000 copies | Experiment 2 Sample ID | Repeated as in Experiment 1 | |
|
|
| ||||
|
|
|
|
|
|
|
| 15 | H2O (HPV negative) | 0 ng | 15 | H2O (HPV negative) | 0 ng |
| 16 | SiHa (HPV-16) | 10 ng | 16 | SiHa (HPV-16) | 10 ng |
| 31 | Placenta (HPV negative) | 100 ng | 31 | Placenta (HPV negative) | 100 ng |
| 32 | Pool of HPV plasmid-11,16,31,45,52 | 625 copies | 32 | HeLa (HPV-18) | 10 ng |
| 44 | Pool of HPV plasmid-5,8,23,36 | 625 copies | 47 | H2O (HPV negative) | 0 ng |
| 45 | Pool of HPV plasmid-6,16,20,24,36,58 | 625 copies | 48 | SiHa (HPV-16) | 10 ng |
| 46 | Pool of HPV plasmid-5,11,15,45,52 | 625 copies | 63 | Placenta (HPV negative) | 100 ng |
| 47 | H2O (HPV negative) | 0 ng | 64 | HeLa (HPV-18) | 10 ng |
| 48 | SiHa (HPV-16) | 10 ng | IDs: 1–14, 17–30, 33–46, 49–62 | Epidemiological samples (56 extracts from cervical cells in PreservCyt) | Total 196 epidemiological samples in data set 4; sample input ranged from 25–100 ng |
| 61 | Pool of HPV plasmid-15,20,24,48 | 625 copies | IDs 65–128 * | The same order as in 1–64 | |
| 62 | Pool of HPV plasmid-8,18,23,31,33,48,53 | 625 copies | IDs 129–192 * | The same order as in 1–64 | |
| 63 | Placenta (HPV negative) | 100 ng | IDs 193–224 * | The same order as in 1–32 | |
| 64 | Pool of HPV plasmid-6,18,33,53,58 | 625 copies | |||
| IDs: 1–14, 17–30, 33–43, 49–60 | Epidemiological samples (50 extracts from male genital swab) | 10–100 ng | |||
* Control samples included as part of these replicates.
Figure 1Flowchart illustrating logical steps of the human papillomavirus (HPV) typing pipeline.
Figure 2Distribution of the rate of distinct read pairs vs percentage of HPV genome coverage observed for 1286 instances of HPV types from 122 control samples in Sets 1–4.
Figure 3Dependence between the number of read pairs and the depth value observed for 256 instances of HPV types from 122 control samples in Sets 1–4. The dependence of the depth from the number of reads becomes linear if there are more than 100 read pairs mapped to a genome of particular HPV type.
Figure 4Dependence between the number of read pairs and the coverage percentage observed for 256 instances of HPV types from 122 control samples in Sets 1–4. If there are more than 100 read pairs mapped to a genome of particular HPV type, then the coverage reaches maximum value, 100%, and does not change. The points observed at near 10,000 read pairs, showing ~65% coverage may correspond to integration of HPV into human genome in HeLa originated samples.
Figure 5The values of two SVM features derived for 1286 HPV types present in control/designed samples. There were 186 types with >100 (de-duplicated) reads mapped to genome, 400 types with >10 read pairs mapped to genome, and 886 types with <10 read pairs mapped to genome. The vertical red line shows the separation between zones of operation of four features SVM and two features SVM. The horizontal dashed line is a separation line defined by two features linear kernel SVM for classification of true and false HPV types. The right part of the graph where the number of read pairs is >100 shows separation of HPV types classified as true and false. In the left part of the graph the separation is impossible to view in 2D plane, as it requires four-dimensional space.
Typing accuracy for replicates of control samples from Set 2. The numbers in the table show how many times a particular HPV type was correctly identified in the four replicas of the experiment. There were no false positive predictions.
| HPV Types in Samples | HPV Copy Number in Samples | ||||
|---|---|---|---|---|---|
| 625 | 125 | 25 | 5 | 1 | |
| HPV-11 | 4 | 4 | 4 | 2 | 0 |
| HPV-16 | 4 | 4 | 4 | 3 | 0 |
| HPV-31 | 4 | 4 | 4 | 2 | 0 |
| HPV-45 | 4 | 4 | 4 | 4 | 0 |
| HPV-52 | 4 | 4 | 4 | 2 | 0 |
| False positives | − | − | − | − | − |
| HPV-6 | 4 | 4 | 4 | 0 | 0 |
| HPV-18 | 4 | 4 | 4 | 1 | 0 |
| HPV-33 | 4 | 4 | 4 | 3 | 0 |
| HPV-58 | 4 | 4 | 4 | 3 | 0 |
| False positives | − | − | − | − | − |
HPV typing accuracy in control samples in Sets 1–4.
| Dataset ID | Ture HPV Instances | Correctly Detected | False Positives |
|---|---|---|---|
| Set 1 | 15 | 15 | 0 |
| Set 2 | 196 | 144 | 0 |
| Set 3 | 38 | 38 | 2 |
| Set 4 | 14 | 14 | 9 |
| Total | 263 | 211 | 11 |
Figure 6The values of two SVM features derived for 6848 HPV types present in all epidemiological samples. There were 1124 types with >100 (de-duplicated) reads mapped to genome, 2631 types with >10 read pairs mapped to genome and 4217 types with <10 read pairs mapped to genome. The vertical red line shows the separation between zones of operation of four features SVM and two features SVM. The horizontal dashed line is a separation line defined by two features linear kernel SVM.
Type-specific concordance between the results of HPV identification by the support vector machine (SVM) pipeline and the LA method for epidemiologic samples (the analysis is restricted to the 37 LA types).
| Data Set | LA Results | Total | Agreement (%, K) | Sensitivity (%) | Specificity (%) | |||
|---|---|---|---|---|---|---|---|---|
| + | − | |||||||
| Set 1 ( | NGS Results | + | 48 | 11 | 59 | 93.33 (518/555); | 65 (48/74) | 97.7 (470/481) |
| − | 26 | 470 | 496 | |||||
| Total | 74 | 481 | 555 | |||||
| Set 3 ( | NGS Results | + | 43 | 21 | 64 | 97.5 (7072/7252); | 78 (43/55) | 98.8 (1774/1795) |
| − | 12 | 1774 | 1786 | |||||
| Total | 55 | 1795 | 1850 | |||||
| Set 4 ( | NGS Results | + | 399 | 101 | 500 | 98.2 (1817/1850); | 83.4 (399/478) | 98.5 (6673/6774) |
| − | 79 | 6673 | 6752 | |||||
| Total | 478 | 6774 | 7252 | |||||
| All epidemiological samples combined ( | NGS Results | + | 490 | 133 | 623 | 97.4 (9407/9657); | 80.7 (490/607) | 98.5 (8917/9050) |
| − | 117 | 8917 | 9034 | |||||
| Total | 607 | 9050 | 9657 | |||||