| Literature DB >> 20147302 |
Abstract
MOTIVATION: Next-generation sequencing captures sequence differences in reads relative to a reference genome or transcriptome, including splicing events and complex variants involving multiple mismatches and long indels. We present computational methods for fast detection of complex variants and splicing in short reads, based on a successively constrained search process of merging and filtering position lists from a genomic index. Our methods are implemented in GSNAP (Genomic Short-read Nucleotide Alignment Program), which can align both single- and paired-end reads as short as 14 nt and of arbitrarily long length. It can detect short- and long-distance splicing, including interchromosomal splicing, in individual reads, using probabilistic models or a database of known splice sites. Our program also permits SNP-tolerant alignment to a reference space of all possible combinations of major and minor alleles, and can align reads from bisulfite-treated DNA for the study of methylation state.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20147302 PMCID: PMC2844994 DOI: 10.1093/bioinformatics/btq057
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Examples of complex variants detected by GSNAP. (A) A 17 nt indel with mismatches in reads (below) relative to a genomic region (above), matching a known polymorphism in dbSNP. (B) Splicing found using probabilistic models reveals an intron within exon 1 of HOXA9, supported experimentally (Dintilhac et al., 2004). (C) Splicing found using known splice sites, despite low probabilistic model scores. Ellipses indicate ‘half intron’ alignments, where reads have insufficient sequence to determine the distal exon. (D) Interchromosomal splicing between BCAS4 and BCAS3 found in the MAQC universal human reference RNA sample and observed in MCF7 cell lines (Hampton et al., 2009). (E) SNP-tolerant alignment near a splice site allows both genotypes to align equally well.
Fig. 2.Representing a reference sequence and a reference space for genomic alignment. (A) A hash table consists of an offset file of possible 12mers and a position file containing a sorted list of genomic positions for each 12mer. (B) SNPs in a genomic 12mer are represented by duplicating the position in the lists for all combinations of major and minor alleles within the 12mer. (C) Major alleles are represented in one compressed genome, while minor alleles are represented in another compressed genome.
Fig. 3.Spanning set method for generating and filtering mismatch candidates. (A) A read of length 62 nt is analyzed at shifts of 0, 1 and 2 nt, with spanning sets each consisting of five elements. Elements at the ends may overhang the ends by 1 or 2 nt. Spanning set elements are shown in detail for the shift of 2 nt. (B) An overhanging 12mer is represented by a union of lists obtained from hash table lookups of all possible substitutions for the overhang. (C) Overlapping 12mers are represented by taking the intersection of their position lists. (D) Elements used for generating candidates (gray). (E) Elements used for filtering candidates (white). A candidate region (black) is supported by two of the generating elements, and is checked for support in the remaining filtering elements.
Fig. 4.Complete set method for generating and filtering mismatch candidates. (A) Patterns of supporting (gray) and non-supporting (white) 12mers induced by a single mismatch, by two close mismatches, and by two distant mismatches (crosses). These patterns indicate a lower bound of ⌊(Δp+6)/12⌋ mismatches, where Δp is the distance between start locations of consecutive supporting 12mers. (B) Pattern-based lower bound calculation for a read of 51 nt, shown on top with actual mismatches. Supporting 12mers (gray) start at read locations 5, 8, 11 and 29, with end locations at −3 and L−9=42. The lower bound formula is summed over successive supporting 12mers to give a total lower bound of four mismatches.
Fig. 5.Efficient detection of indels and splice pairs. (A) The complete set method generates candidate regions with supporting 12mers (gray). (B) Pairs of candidates within the allowed distance are tested for middle indels and short-distance splice pairs. The constraint on number of mismatches (shown for the value 1) determines a range of crossover points. (C) End indels are tested in the distal 14 nt when the long region of the read has a sufficiently low number of mismatches. (D) Long-distance splicing is detected by identifying known or novel splice sites in single candidate regions within areas defined by constraints on number of mismatches. Candidate regions with donor and acceptor splice sites are then paired to reveal splice junctions.
Results of read alignment algorithms on simulated reads
| (Percent misses) Time | ||||||
|---|---|---|---|---|---|---|
| Variant | GSNAP | BWA | Bowtie | SOAP2 | SOAP | MAQ |
| 36 nt reads | ||||||
| Exact | 51 | 17 | 9 | 70 | 869 | 2248 |
| 1 mm | 55 | 60 | 33 | 11 | 1157 | 2106 |
| 2 mm | (1.0) 304 | 64 | 46 | 39 | (2.9) 1470 | 6008 |
| 3 mm | (11.9) 405 | 551 | 544 | − | (15.6) 1369 | 19 523 |
| Ins, 1–3 | 640 | (4.9) 767 | − | − | (5.1) 5534 | − |
| Del, 1–3 | 653 | (3.3) 1016 | − | − | (3.7) 4308 | − |
| Ins, 4–9 | (0.1) 507 | − | − | − | 31420 | − |
| Del, 4–30 | (0.1) 887 | − | − | − | − | − |
| 70 nt reads | ||||||
| Exact | 15 | 23 | 9 | 13 | 1205 | 2180 |
| 1 mm | 23 | 25 | 15 | 12 | (0.1) 1564 | 2120 |
| 2 mm | 45 | 33 | 48 | 67 | (0.9) 2363 | 6175 |
| 3 mm | 95 | 83 | 542 | − | (3.3) 2272 | 20 316 |
| 4 mm | 325 | 373 | − | − | (7.8) 2098 | (2.4) 20 002 |
| Ins, 1–3 | 245 | (2.0) 323 | − | − | (4.3) 15 516 | − |
| Del, 1–3 | 263 | (1.3) 425 | − | − | (4.7) 14 645 | − |
| Ins, 4–9 | (0.1) 288 | − | − | − | − | − |
| Del, 4–30 | (0.1) 292 | − | − | − | − | − |
| 100 nt reads | ||||||
| Exact | 15 | 29 | 11 | 10 | − | 2211 |
| 1 mm | 21 | 30 | 16 | 13 | − | 2168 |
| 2 mm | 33 | 35 | 56 | 73 | − | 6330 |
| 3 mm | 50 | 52 | 620 | − | − | 20 697 |
| 4 mm | 82 | 137 | − | − | − | (0.5) 20 503 |
| 5 mm | 155 | 543 | − | − | − | (2.1) 20 283 |
| Ins, 1–3 | 269 | (1.3) 218 | − | − | − | − |
| Del, 1–3 | 273 | (0.8) 360 | − | − | − | − |
| Ins, 4–9 | (0.1) 335 | − | − | − | − | − |
| Del, 4–30 | (0.1) 312 | − | − | − | − | − |
Times (in seconds) are for each set of 100 000 reads. For BWA, times include conversion to genomic coordinates (∼8 s per dataset). For SOAP2, times exclude loading of indices (∼35 s per dataset). Sensitivity was computed over reads that were unique (mapping to one location in the genome) and non-upgradeable (not mapping to another genomic location with a better variant type than the expected alteration). Misses, if any, are represented by percentages in parentheses before the corresponding running time. Dashes indicate variant types that could not be detected by the corresponding program. Variants: mm, mismatch(es); ins, insertion; and del, deletion. Parameter flags used, where n is number of mm in dataset: GSNAP (mm): -t 1 -m n. GSNAP (indel): -t 1 -m 0 -i 0. BWA (mm): aln -o 0 -n n. BWA (indel): aln -n 3 -o 1 -O 1 -E 1. Bowtie: -f -k 10 –quiet -p 1 -v n. SOAP2: -r 2 -v n. SOAP (mm): -s 12 -r 2 -w 10 -v n. SOAP (indel): -s 12 -r 2 -w 10 -v 0 -g 3. MAQ: map -C 10 -e 200 -n n. For the 3-mismatch dataset, Bowtie was also run in its MAQ mode, by removing the -v flag for limiting the number of mismatches and adding ‘-e 200’ to permit more mismatches. In that mode, times for the 36, 70 and 100 nt datasets were 46, 142 and 750 s, but miss rates were 57.2, 13.4 and 6.4%.
Effect of splicing, indels and SNP tolerance on transcriptional dataset
| Alignment yield (%) | SNP effect on alignment (%) | Time (s) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Variants allowed | Non-SNP | SNP | New | Same | Superset | Subset | Diff | Total | Non-SNP | SNP |
| ≤2 mismatches | 69.7 | 70.2 | 0.5 | 1.7 | 4.9 | 0.4 | 0.1 | 7.5 | 52 | 57 |
| Above plus known splicing | 78.7 | 79.1 | 0.5 | 1.8 | 4.8 | 0.3 | 0.2 | 7.6 | 381 | 457 |
| Above plus novel splicing | 79.0 | 79.5 | 0.5 | 1.9 | 4.8 | 0.2 | 0.3 | 7.7 | 457 | 522 |
| Above plus indels | 80.2 | 80.7 | 0.5 | 1.8 | 4.9 | 0.2 | 0.5 | 8.0 | 725 | 905 |
| ≤3 mismatches | 73.9 | 74.3 | 0.4 | 2.0 | 5.2 | 0.4 | 0.1 | 8.1 | 161 | 214 |
| Above plus known splicing | 82.2 | 82.6 | 0.4 | 2.1 | 5.1 | 0.3 | 0.2 | 8.0 | 530 | 668 |
| Above plus novel splicing | 82.8 | 83.1 | 0.4 | 2.2 | 5.1 | 0.3 | 0.3 | 8.3 | 624 | 755 |
| Above plus indels | 83.8 | 84.2 | 0.4 | 2.1 | 5.1 | 0.3 | 0.6 | 8.5 | 949 | 1262 |
Effect of simulated BS treatment
| Alignment uniqueness (%) | ||||
|---|---|---|---|---|
| Length (nt) | Variant | Genomic | BS | Difference |
| 36 | Exact | 87.1 | 81.6 | 5.5 |
| 1 mismatch | 86.7 | 80.6 | 6.0 | |
| 2 mismatches | 85.3 | 78.7 | 6.5 | |
| 70 | Exact | 95.0 | 92.4 | 2.6 |
| 1 mismatch | 94.9 | 92.1 | 2.8 | |
| 2 mismatches | 94.7 | 91.7 | 3.0 | |
| 100 | Exact | 96.6 | 95.3 | 1.3 |
| 1 mismatch | 96.6 | 95.2 | 1.4 | |
| 2 mismatches | 96.5 | 95.1 | 1.4 | |