| Literature DB >> 22965131 |
Michael Forster1, Peter Forster, Abdou Elsharawy, Georg Hemmrich, Benjamin Kreck, Michael Wittig, Ingo Thomsen, Björn Stade, Matthias Barann, David Ellinghaus, Britt-Sabina Petersen, Sandra May, Espen Melum, Markus B Schilhabel, Andreas Keller, Stefan Schreiber, Philip Rosenstiel, Andre Franke.
Abstract
Scientists working with single-nucleotide variants (SNVs), inferred by next-generation sequencing software, often need further information regarding true variants, artifacts and sequence coverage gaps. In clinical diagnostics, e.g. SNVs must usually be validated by visual inspection or several independent SNV-callers. We here demonstrate that 0.5-60% of relevant SNVs might not be detected due to coverage gaps, or might be misidentified. Even low error rates can overwhelm the true biological signal, especially in clinical diagnostics, in research comparing healthy with affected cells, in archaeogenetic dating or in forensics. For these reasons, we have developed a package called pibase, which is applicable to diploid and haploid genome, exome or targeted enrichment data. pibase extracts details on nucleotides from alignment files at user-specified coordinates and identifies reproducible genotypes, if present. In test cases pibase identifies genotypes at 99.98% specificity, 10-fold better than other tools. pibase also provides pair-wise comparisons between healthy and affected cells using nucleotide signals (10-fold more accurately than a genotype-based approach, as we show in our case study of monozygotic twins). This comparison tool also solves the problem of detecting allelic imbalance within heterozygous SNVs in copy number variation loci, or in heterogeneous tumor sequences.Entities:
Mesh:
Year: 2012 PMID: 22965131 PMCID: PMC3592472 DOI: 10.1093/nar/gks836
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Flow chart showing the standard NGS sequencing and bioinformatic analysis (gray). The ‘essential workflow’ of pibase (yellow): pibase_bamref reads a list of genomic coordinates from a tab-separated text file, a VCF file, a SAMtools pileup SNV file or a Bioscope gff3 SNV file. It then extracts data from a reference sequence file and a sequence alignment (BAM) file and outputs extracted, computed and filtered information as a tab-separated text file (output 1). pibase_consensus reads one single or several pibase_bamref files. For each coordinate, a ‘best genotype’ with quality and strand support, as well as two genotypes for each filter level are inferred, and these data are appended to the pibase_bamref data (output 2). pibase_fisherdiff is a tool for association testing or sample comparison, requiring a pair of pibase_consensus files as input data (e.g. case/control, germline/tumor or affected/unaffected twin). The tool appends P-values and a filter label to the pibase_consensus data (output 3). Further workflows address annotation and phylogenetic analysis.
Remaining reads after successive filtering at four positions in a public BAM file
| Genomic coordinate | Raw | Filter 0 | Filter 1b | Filter 2c | Filter 3d | Filter 4e | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| CV | CV | SP | CV | SP | CV | SP | CV | SP | CV | SP | |
| chr22:19969075 | 6 | 3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| chr22:19969495 | 14 | 11 | 8 | 8 | 6 | 3 | 2 | 3 | 2 | 3 | 2 |
| chr22:30857373 | 8 | 5 | 5 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 |
| chr22:31491295 | 17 | 7 | 7 | 4 | 4 | 3 | 3 | 2 | 2 | 2 | 2 |
aReads without indels; bFilter 0 and base quality ≥ 20; cFilter 1 and read length ≥ 34; dFilter 2 and mismatches ≤ 1; eFilter 3 and uniquely mappable reads. CV: number of (all) reads covering this genomic coordinate; SP: remaining reads after filtering away reads with the same start points.
Stable and instable genotypes resulting from the filtering in Table 1
| Genomic coordinate | Filter 0 | Filter 2 | Filter 4 | End resultb | Three platformse | ||||
|---|---|---|---|---|---|---|---|---|---|
| CV | SP | CV | SP | CV | SP | BGc | Qualityd | ||
| chr22:19969075 | aaa | aaa | AA | FAIL | AG | ||||
| chr22:19969495 | GG | GG | gg | gg | gg | gg | GG | PASS | GG |
| chr22:30857373 | ac | ac | cc | cc | cc | cc | AC | FAIL | AC |
| chr22:31491295 | cg | cg | cc | cc | CG | FAIL | CG | ||
aLow coverage; brule-based consensus over all filter levels; cpibase consensus genotype; dpibase PASS/FAIL tag; ethe 1000 Genomes Project’s consensus of three sequencing platforms (Illumina, SOLiD, FLX/454) is shown for comparison.
Discrimination of non-identical SNVs in BAM file pairs using Fisher’s exact test
| Genomic coordinate | Best genotype | ||
|---|---|---|---|
| NA12878 | NA12891 | ||
| chr22:19968971 | 0.0464 | AG | GG |
| chr22:30953295 | 8.4 × 10−6 | TT | CC |
| chr22:39440149 | 0.0161 | CT | TT |
| chr22:40417780 | 0.0009 | CC | CT |
aP-values obtained from Fisher’s exact test on the number of unique-start-points for each filter level, indicating the probability of the sample pair having the same genotype at this specific genomic coordinate.
False discovery rate of differences in exomes of identical, monozygotic twins
| Comparison method/stringency | Pair 1 | FDR | Pair 2 | FDR |
|---|---|---|---|---|
| SNVs called | 65654 | 67997 | ||
| bGenotype differences (?2) | 2047 | 3.12 | 1864 | 2.74 |
| bGenotype differences (?1) | 527 | 0.80 | 470 | 0.69 |
| bGenotype differences (stable) | 55 | 0.08 | 72 | 0.11 |
| cFET differences ( | 135 | 0.21 | 169 | 0.28 |
| cFET differences ( | 92 | 0.14 | 125 | 0.18 |
| cFET differences ( | 48 | 0.07 | 74 | 0.11 |
| cFET differences ( | 25 | 0.04 | 51 | 0.08 |
| cFET differences ( | 5 | 0.01 | 15 | 0.02 |
aNumber of computed differences divided by the number of SNVs called; bnumbers of SNV differences including instable genotype pairs (labels ‘?2’ and ‘?1’) and using only the stable genotype pairs; cFisher's exact test-based differences computed by pibase_fisherdiff and filtered for P-value thresholds of 0.01, 0.02, 0.03, 0.04. 0.05; *recommended setting.
Categorization of instable SNV-calls using SNV label (BestQual)
| Label | Explanation |
|---|---|
| ?1 | Mapping stringency versus reference sequence context class is good. Not all 10 genotyping filter stages lead to the same genotype. However, for the high mapping stringency filter stages, at least |
| ?2 | Mapping stringency versus reference sequence context class is good. This genotype is supported by less than five filter stages, but by at least two filter stages, of which one stage is in the unique start points category, and the other stage is in the coverage category. |
| ?3 | Poor quality. Low complex reference sequence context (homopolymeric run > 4, or STRs) and low mapping stringency, but at least one stringent filter supports this genotype. |
| ?4 | Very poor quality. Low complex reference sequence context (homopolymeric run > 4, or STRs) and mapping stringency was low. But at least one of the unique-start-point filters supports this genotype. |
| ?5 | Highly problematic quality. The best unique-start-point derived genotype is in conflict with the best coverage-derived genotype. |
| ?6 | Highly problematic quality. The best unique-start-point-derived genotype is in conflict to the best coverage-derived genotype, and the best coverage-derived genotype is ‘superior’ to the best unique-start-point-derived genotype. |
| ?7 | Low-coverage guess. The coverage is less than |
| ?8 | Low-coverage guess. The coverage is less than |
STR, short tandem repeats
Figure 2.Median joining network showing the differences between the five examples of BAM files of the CEU trio. Data are currently available at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/pilot2_high_cov_GRCh37_bams/ (23 August 2012, date last accessed). The daughter (NA12878) was sequenced on Illumina, SOLiD and FLX, the father (NA12891) and the mother (NA12892) on Illumina only. The links between the nodes show the IDs of the discriminating SNVs. As median-joining networks use discriminating differences to construct an evolutionary tree, they can often successfully work with just minimal data input, such as classically the mitochondrial DNA (mtDNA) control region. This is relevant for inexpensive targeted sequencing of populations.