| Literature DB >> 21088025 |
Timo Lassmann1, Yoshihide Hayashizaki, Carsten O Daub.
Abstract
MOTIVATION: The sequence alignment/map format (SAM) is a commonly used format to store the alignments between millions of short reads and a reference genome. Often certain positions within the reads are inherently more likely to contain errors due to the protocols used to prepare the samples. Such biases can have adverse effects on both mapping rate and accuracy. To understand the relationship between potential protocol biases and poor mapping we wrote SAMstat, a simple C program plotting nucleotide overrepresentation and other statistics in mapped and unmapped reads in a concise html page. Collecting such statistics also makes it easy to highlight problems in the data processing and enables non-experts to track data quality over time.Entities:
Mesh:
Year: 2010 PMID: 21088025 PMCID: PMC3008642 DOI: 10.1093/bioinformatics/btq614
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Overview of SAMstat output
| Reported statistics |
|---|
| Mapping rate |
| Read length distribution |
| Nucleotide composition |
| Mean base quality at each read position |
| Overrepresented 10mers |
| Overrepresented dinucleotides along read |
| Mismatch, insertion and deletion profile |
aOnly reported for SAM files.
Fig. 1.A selection of SAMStat's html output. (a) Mapping statistics. More than half of the reads are mapped with a high mapping accuracy (red) while 9.9% of the reads remain unmapped (black). (b) Barcharts showing the distribution of mismatches and insertions along the read for alignments with the highest mapping accuracy [shown in red in (a)]. The colors indicate the mismatched nucleotides found in the read or the nucleotides inserted into the read. (c,d and e) Frequency of mismatches at the start of reads with mapping accuracies 1e−3 ≤ P < 1e−2, 1e−2 ≤ P < 0.5 and 0.5 ≤ P < 1, respectively (shown in orange, yellow and blue in panel a). The fraction of mismatches involving G's at position 2–5 increases. (f) Percentage of ‘GG’ dinucleotides at positions 1–5 in reads split up by mapping quality intervals. The background color highlights large percentages. The first and last row for nucleotides ‘GT’ and ‘GC’ are shown for comparison.