| Literature DB >> 28484262 |
Chun Hang Au1, Dona N Ho1, Ava Kwong2,3,4, Tsun Leung Chan1, Edmond S K Ma5.
Abstract
Amplicon-based next-generation sequencing (NGS) has been widely adopted for genetic variation detection in human and other organisms. Conventional data analysis paradigm includes primer trimming before read mapping. Here we introduce BAMClipper that removes primer sequences after mapping original sequencing reads by soft-clipping SAM/BAM alignments. Mutation detection accuracy was affected by the choice of primer handling approach based on real NGS datasets of 7 human peripheral blood or breast cancer tissue samples with known BRCA1/BRCA2 mutations and >130000 simulated NGS datasets with unique mutations. BAMClipper approach detected a BRCA1 deletion (c.1620_1636del) that was otherwise missed due to edge effect. Simulation showed high false-negative rate when primers were perfectly trimmed as in conventional practice. Among the other 6 samples, variant allele frequencies of 5 BRCA1/BRCA2 mutations (indel or single-nucleotide variants) were diluted by apparently wild-type primer sequences from an overlapping amplicon (17 to 82% under-estimation). BAMClipper was robust in both situations and all 7 mutations were detected. When compared with Cutadapt, BAMClipper was faster and maintained equally high primer removal effectiveness. BAMClipper is implemented in Perl and is available under an open source MIT license at https://github.com/tommyau/bamclipper.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28484262 PMCID: PMC5431517 DOI: 10.1038/s41598-017-01703-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Amplicon library design and bioinformatics approaches of handling gene-specific primers. (A) Gene-specific primer sequences are present as part of NGS reads. The observed primer sequences are usually identical to reference genome sequence but may be slightly different due to errors in primer synthesis and/or sequencing. Common read mapping tools are not aware of the amplicon library design and thus map the primers as if they are part of region of interest that lies in between. Although sequencing adapters may exist as part of NGS reads (depending of amplicon length and sequencing read length), they will become soft-clipped after mapping due to the lack of similarity to reference genome by design. Soft-clipped part of alignments is usually ignored by downstream processing. (B) In primer handling approach 1, primer sequence was trimmed from sequencing reads (in FASTQ format) and the shorter trimmed reads are mapped to give BAM alignments for downstream variant calling and quality control such as sequencing depth statistics. In approach 2, original reads are directly mapped that primers are present in BAM alignments as if they are part of region of interest. In approach 3 represented by BAMClipper, reads are also directly mapped but BAM alignments are further processed to soft-clip primer sequences as if they were sequencing adapters.
Figure 2A BRCA1 deletion escaped from variant calling when primers were trimmed before mapping. NGS read alignments of BRCA1 c.1620_1636del allele from three primer handling approaches are shown in conjunction with the amplicon design and reference genome sequence. Individual forward and reverse sequencing reads after any soft-clipping were represented by red and purple horizontal lines, respectively. The expected deletion event (black box) was present in the alignments from approaches 2 and 3 only.
Figure 3Indels are susceptible to variant calling edge effects as shown by simulation. (A) Simulation scheme of 420 insertions and 420 deletions with 20 different lengths at 21 different positions. (B) Venn diagram of called insertions or deletions in 3 approaches of primer handling. (C) Heat map of length and position of called insertions or deletions.
Variant allele frequency underestimation due to overlapping primer site.
| Sample | Mutation | VAF | Overlap with other primer site? | VAF underestimation | |
|---|---|---|---|---|---|
| Approach 2 | Approach 3 | ||||
| NDH1 |
| 9% | 51% | Yes | 82% |
| PMH1 |
| 19% | 49% | Yes | 61% |
| TWH1 |
| 50% | 74% | Yes | 32% |
| TWH2 |
| 51% | 51% | No | 0% |
| TWH3 |
| 17% | 46% | Yes | 63% |
| QMH1 |
| 60% | 72% | Yes | 17% |
Figure 4Dilution of variant allele frequency when primers are not clipped after mapping. NGS read alignments in BRCA1 c.4372C>T region from primer handling approaches 2 and 3 are shown in conjunction with the amplicon design and reference genome sequence. The c.4372C>T mutation is located in the region of interest of one amplicon and gene-specific primer site of another amplicon. Since primer sequences retained in approach 2 contributed to wild-type allele frequency, VAF of c.4372C>T was in turn underestimated by 82% (9% in approach 2 and 51% in approach 3).