| Literature DB >> 30350345 |
Hezhao Ji1,2, Eric Enns3, Chanson J Brumme4, Neil Parkin5, Mark Howison6, Emma R Lee1, Rupert Capina1, Eric Marinier3, Santiago Avila-Rios7, Paul Sandstrom1,2, Gary Van Domselaar2,3, Richard Harrigan8, Roger Paredes9, Rami Kantor10, Marc Noguera-Julian9.
Abstract
INTRODUCTION: Next-generation sequencing (NGS) has several advantages over conventional Sanger sequencing for HIV drug resistance (HIVDR) genotyping, including detection and quantitation of low-abundance variants bearing drug resistance mutations (DRMs). However, the high HIV genomic diversity, unprecedented large volume of data, complexity of analysis and potential for error pose significant challenges for data processing. Several NGS analysis pipelines have been developed and used in HIVDR research; however, the absence of uniformity in data processing strategies results in lack of consistency and comparability of outputs from different pipelines. To fill this gap, an international symposium on bioinformatic strategies for NGS-based HIVDR testing was held in February 2018 in Winnipeg, Canada, convening laboratory scientists, bioinformaticians and clinicians involved in four recently developed, publicly available NGS HIVDR pipelines. The goal of this symposium was to establish a consensus on effective bioinformatic strategies for NGS data management and its use for HIVDR reporting. DISCUSSION: Essential functionalities of an NGS HIVDR pipeline were divided into five analytic blocks: (1) NGS read quality control (QC)/quality assurance (QA); (2) NGS read alignment and reference mapping; (3) HIV variant calling and variant QC; (4) NGS HIVDR reporting; and (5) extended data applications and additional considerations for data management. The consensuses reached among the participants on all major aspects of these blocks are summarized here. They encompass not only recommended data management and analysis strategies, but also detailed bioinformatic approaches that help ensure accuracy of the derived HIVDR analysis outputs for both research and potential clinical use.Entities:
Keywords: HIV drug resistance test; Winnipeg Consensus; bioinformatics; guideline; next-generation sequencing; pipeline
Mesh:
Year: 2018 PMID: 30350345 PMCID: PMC6198166 DOI: 10.1002/jia2.25193
Source DB: PubMed Journal: J Int AIDS Soc ISSN: 1758-2652 Impact factor: 5.396
Currently available pipeline/software for automated NGS‐based HIVDR data analysis
| Pipeline/software | Reference information | Resources | Technical characteristics | HIV drug resistance | HIVDR data analysis features | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| URL | Year | Cost | Time | Bioinformatic IT needs | Compatible NGS platform | Cloud based | Web interface | Designed for HIVDR | Ref DB and algorithm | Output (nt/ | QA checks | InDel | |
| V‐Phaser 2 | https://www.broadinstitute.org/viral‐genomics/v‐phaser‐2 | 2013 | Free | N/A | Yes | N/A | No | No | No | No | csv/ | E | Yes |
| ShoRah | https://github.com/cbg‐ethz/shorah | 2013 | Free | N/A | Yes | N/A | No | No | No | No | csv/ | E | N/A |
| VirVarSeq | https://sourceforge.net/projects/virtools/ | 2015 | Free | N/A | Yes | Illumina | No | No | No | No | fasta/ | Q/E | Yes |
| MinVar |
| 2016 | Free | <1 hour | Yes | Illumina | Yes | No | Yes | HIVdb | vcf/ | Q | Yes |
| V‐pipe | https://cbg‐ethz.github.io/V‐pipe/ | 2017 | Free | N/A | Yes | Illumina | No | No | No | No | fasta/csv | Q | Yes |
| Hivmmer | https://github.com/kantorlab/hivmmer | 2017 | Free | <1 hour | Yes | Illumina | No | No | Yes | No | csv/ | L | Yes |
| Geno2Pheno[ngs‐freq] |
| 2018 | Free | <1 minute | Yes | N/A | No | Yes | Yes | g2p[res] | fasta/ | N/A | Yes |
| MiCall | https://github.com/cfe‐labMiCall | 2016 | Free | <1 hour | No | Illumina | Yes | Yes | Yes | HIVdb | Csv/ | Q/E | Yes |
| HyDRA |
| 2016 | Free | <1 hour | No | Illumina Ion Torrent | No | Yes | Yes | HIVdb | fasta,vcf/ | Q/E | Yes |
| PASeq.org |
| 2016 | Free | <1 hour | No | Illumina | Yes | Yes | Yes | HIVdb | fasta/ | Q/C/A | Yes |
| DeepChek HIV |
| 2014 | $65 | <1 hour | No | Illumina Ion Torrent | Yes | Yes | Yes | HIVdb | csv/ | Q | Yes |
| Smart GeneHIV |
| 2016 | N/A | N/A | No | N/A | No | Yes | Yes | HIVdb | N/A | N/A | Yes |
| Vela sentosa HIV |
| 2016 | $200 | N/A | No | Ion Torrent | No | Yes | Yes | HIVdb | fasta/ | Q | Yes |
| Hyrax Exatype | https://exatype.com/ | 2018 | N/A | <1 hour | No | Illumina IonTorrent | Yes | Yes | Yes | HIVdb | fasta/ | Q | Yes |
The pipelines/software are categorized as: (1) freely available software for bioinformaticians (top block); (2) freely available software suitable for non‐bioinformaticians (middle block); and (3) commercial software (bottom block). Within each block, the chronological order was followed. N/A, Information not available or not applicable. NGS, next‐generation sequencing; HIVDR, HIV drug resistance.
aGeno2pheno[ngs‐freq] pipeline can only use a codon frequency table as an input which needs to be obtained separately; bpending approval and release on Illumina BaseSpace Sequence Hub. For early access, please micalldev@cfenet.ubc.ca; cyear of publication/public availability; dapproximate per sample cost of bioinformatic data analysis only; etime range for single sample data analysis (data transfer time excluded); frefers to the need of on‐site computational infrastructure or expert staff; gRef DB and algorithm: reference HIV resistance database and/or algorithms for HIV resistance interpretation (HIVdb: Stanford HIV Database); g2p[res] refers to the Geno2pheno[resistance] statistical engine; houtput: format of output files reporting nucleotide (nt) and amino acid (aa) variations; iQA check strategies incorporated for NGS read quality assurance (Q: Quality Control; C: Contamination Control; E: Sequencing Error Model; L: Alignment Quality Filter; A: ApoBEC Hypermutation Detection); jindels are recognizable by default but no codon‐aware strategies are implemented for reporting insertion/deletion mutations specifically associated to HIV resistance; kcan be ported to Cloud; lcost based on general access through Illumina basespace; mapproximate cost of whole sample analysis (sample preparation, sequencing, data analysis).
Outcomes of the Winnipeg Consensus for recommended bioinformatic strategies for an NGS‐based HIVDR data analysis pipeline
| Functional blocks | Objectives | Consensus on strategies | Notes and comments |
|---|---|---|---|
| 1. NGS read quality control/quality assurance | To ensure only quality reads are applied in downstream data processing | 1. Average quality score (QS) of the read: 25 | A QS at 25 corresponds to an estimated sequencing error rate of 0.3% |
| 2. Minimum read length: 75 bases | This is based on Illumina 300‐cycle paired‐end sequencing and it may vary if another NGS platform or sequencing protocol is applied | ||
| 3. Contamination check: recommended | External non‐viral contamination may be interfering with HIV NGS efficiency. HIV cross‐sample contamination or “index hopping” implies errors in laboratory sample processing which may lead to erroneous minority variant detection (see strategies implemented in V‐Pipe and ViCroSeq[ | ||
| 4. APOBEC mutation check: recommended | Presence of APOBEC‐edited DNA templates in the sequenced sample may result in the artefactual detection of minority variant related to APOBEC activity. Filtering this non‐viable sequence content is beneficial especially when significant amounts of HIV proviral DNA may be present in the input specimen (i.e. PBMC, dried blood spots) | ||
| 2. NGS read alignment and reference mapping | To ensure the efficiency and accuracy of NGS read alignment to reference sequences | 1. HIV‐1 reference: HXB‐2 | For conserved regions such as HIV‐1 |
| 2. NGS read aligner: short‐read aligner is recommended | Bowtie2 is thus far the most commonly used NGS short‐read aligner due to its speed, availability, documentation, ease of installation and active maintenance | ||
| 3. Analysis of whole | Coverage of the entire | ||
| 4. Indel management: required | Effective indel management strategy (i.e. codon‐aware alignments) is not available in existing pipelines. However, with several indel variations contributing to HIVDR, full‐codon indels should be properly identified and reported | ||
| 3. HIV variant calling and variant quality control | To ensure the accuracy of variant calling | 1. QC/QA of nucleotide variant calling: recommended | Additional QA/QC procedures may be incorporated to further ensure the variant call accuracy. For instance, HyDRA calls variation only when minimum allele counts is ≥5, minimum QS of variant allele is ≥30 and read depth at the relevant variation site is ≥100 |
| 2. Amino acid variation calling based on NGS reads, but NOT consensus sequence: required | Consensus sequence‐based DRM analysis renders inevitable assumption while ≥2 mixed bases present in the codon, which diminishes NGS values in quantitative DRM detections | ||
| 3. Secondary QC for minority variant calling: optional | It helps to exclude erroneous variant calls via statistical estimation based on gross platform‐specific error rate, as determined by parallel sequencing of pedigreed plasmid in the NGS run, which sums all potential errors from any involved assay procedures | ||
| 4. HIVDR interpretation and reporting | HIVDR interpretation | Query reference database and algorithms: HIVdb ( | Although minor discrepancies exist with other alike databases, Stanford database (HIVdb) is recommended for better general adoption |
| Concise report | The report should contain the following:
Patient and sample information if provided (optional) Exportable/printable HIVDR report with DRMs and colour‐coded resistance interpretations Two‐column summary on DRMs at reporting threshold of 5% and 15% respectively with no detailed frequencies Pipeline and software version applied Comment on the accreditation status of the assay for clinical use |
Concise HIVDR reports from NGS data should simulate Sanger sequencing output for easier adoption and interpretation by clinicians, to be used for clinical care | |
| Comprehensive report | The report should contain the following:
All contents included in clinical reports. Summary on filtering statistics, quality metrics and coverage plots Quantitative report on all HIV‐1 DRMs with exact frequencies Consensus sequence with threshold of 15% for mixed base call |
Comprehensive reports should contain all NGS‐derived data that researchers may utilize for various application purposes | |
| Other exportable data |
Consensus sequences with user‐defined threshold: recommended Variant reports on all nucleotide loci: recommended Variant reports on all amino acid loci: recommended Codon usage at all amino acid variations loci: recommended | Standard VCF/BCF format is recommended for nucleotide variant reports. To facilitate comparisons and merging of data from different pipelines, a new standard .aavf reporting format is proposed (Appendix | |
| 5. General Analysis data management | Data storage |
Raw NGS data: to be stored by data generator while data analysis providers may transiently store it for reviewing purpose Intermediate files Versioning data files for the applied pipeline: recommended | Automated versioning of all analysis results, reports and intermediate data files is required for retroactive data assessments when necessary |
| Data disposal | Analysis provider disposes data after a defined holding period | Deposition of data into public archives (e.g. NCBI Sequence Read Archive) requires informed consent from the data generator | |
| Data ownership |
Policy varies among different institutes and countries Clear data ownership statement should be included in Terms of Service and Conditions | In general, data generators own all the data while unidentified data may be used by data analysis provider for evaluation or research purposes providing mutual agreement is in place |
DRM, drug resistance mutation; NGS, next‐generation sequencing; PBMC, peripheral blood mononuclear cell.