| Literature DB >> 19505943 |
Heng Li1, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin.
Abstract
SUMMARY: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. AVAILABILITY: http://samtools.sourceforge.net.Entities:
Mesh:
Year: 2009 PMID: 19505943 PMCID: PMC2723002 DOI: 10.1093/bioinformatics/btp352
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Example of extended CIGAR and the pileup output. (a) Alignments of one pair of reads and three single-end reads. (b) The corresponding SAM file. The ‘@SQ’ line in the header section gives the order of reference sequences. Notably, r001 is the name of a read pair. According to FLAG 163 (=1 + 2 + 32 + 128), the read mapped to position 7 is the second read in the pair (128) and regarded as properly paired (1 + 2); its mate is mapped to 37 on the reverse strand (32). Read r002 has three soft-clipped (unaligned) bases. The coordinate shown in SAM is the position of the first aligned base. The CIGAR string for this alignment contains a P (padding) operation which correctly aligns the inserted sequences. Padding operations can be absent when an aligner does not support multiple sequence alignment. The last six bases of read r003 map to position 9, and the first five to position 29 on the reverse strand. The hard clipping operation H indicates that the clipped sequence is not present in the sequence field. The NM tag gives the number of mismatches. Read r004 is aligned across an intron, indicated by the N operation. (c) Simplified pileup output by SAMtools. Each line consists of reference name, sorted coordinate, reference base, the number of reads covering the position and read bases. In the fifth field, a dot or a comma denotes a base identical to the reference; a dot or a capital letter denotes a base from a read mapped on the forward strand, while a comma or a lowercase letter on the reverse strand.
Mandatory fields in the SAM format
| No. | Name | Description |
|---|---|---|
| 1 | Query NAME of the read or the read pair | |
| 2 | Bitwise FLAG (pairing, strand, mate strand, etc.) | |
| 3 | Reference sequence NAME | |
| 4 | 1-Based leftmost POSition of clipped alignment | |
| 5 | MAPping Quality (Phred-scaled) | |
| 6 | Extended CIGAR string (operations: | |
| 7 | Mate Reference NaMe (‘=’ if same as | |
| 8 | 1-Based leftmost Mate POSition | |
| 9 | Inferred Insert SIZE | |
| 10 | Query SEQuence on the same strand as the reference | |
| 11 | Query QUALity (ASCII-33=Phred base quality) |