| Literature DB >> 20230626 |
Aakrosh Ratan1, Yu Zhang, Vanessa M Hayes, Stephan C Schuster, Webb Miller.
Abstract
BACKGROUND: The most common application for the next-generation sequencing technologies is resequencing, where short reads from the genome of an individual are aligned to a reference genome sequence for the same species. These mappings can then be used to identify genetic differences among individuals in a population, and perhaps ultimately to explain phenotypic variation. Many algorithms capable of aligning short reads to the reference, and determining differences between them have been reported. Much less has been reported on how to use these technologies to determine genetic differences among individuals of a species for which a reference sequence is not available, which drastically limits the number of species that can easily benefit from these new technologies.Entities:
Mesh:
Year: 2010 PMID: 20230626 PMCID: PMC2851604 DOI: 10.1186/1471-2105-11-130
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Outline of the DIAL pipeline to call SNPs.
Numbers of heterozygous positions correctly identified, and numbers of false-positive predictions from low-copy-number repeats, expressed as a percentage of the identified SNPs.
| 4 | 5 | 6 | 7 | 8 | |
|---|---|---|---|---|---|
| 0.5 | 473/54.8% | 552/65.0% | 561/68.3% | 561/69.0% | 561/69.1% |
| 1.0 | 4,598/28.6% | 6,131/37.9% | 6,450/43.5% | 6,501/45.9% | 6,508/46.7% |
| 1.5 | 14,119/14.9% | 21,179/21.4% | 23,386/26.8% | 23,915/30.3% | 24,021/32.0% |
| 2.0 | 27,067/7.8% | 45,111/11.8% | 52,630/16.0% | 55,036/19.5% | 55,675/21.8% |
| 2.5 | 40,080/4.1% | 73,481/6.5% | 90,877/9.4% | 97,835/12.2% | 100,145/14.6% |
The values are given as a function of fold coverage (λ, row labels) and the upper bound on the number of overlapping reads (x, column labels). For instance, at λ = 1.0 and x = 5, there are 6,131 correct SNP calls and 37.9% as many duplication-induced erroneous ones. This is a theoretical analysis based on informally fitting a model (see text) to data from the genome of Dr. James Watson.
Figure 2Variation with coverage (Watson genome dataset). A plot showing the variation of several results as a function of sequence coverage. "NumSNPs" is the number of SNPs called by DIAL. "NumMapped" is the number of SNPs and their assembled flanking regions that could be uniquely mapped with greater than 98% identity to NCBI Build 36 of the human genome. "NumVerified" is the number of DIAL SNP calls that were reported for the Watson genome by Cold Spring Harbor Lab, ENSEMBL, or dbSNP.
Figure 3Magnification of the data in Figure 2 for 1-fold coverage or less.
Figure 4Rates of erroneous SNP calls associated with various causes (Watson genome dataset). "Not verified" means that the SNP was not found in the databases of Watson's heterozygous positions that we consulted. "Not alignable" means that the assembled neighborhood of the SNP did not align to the human reference at 98% identity or higher over at least 200 bp, whereas "Not unique" means that it aligned more than once.
Figure 5Variation with coverage (Orangutan dataset). A plot showing the variation of several results as a function of sequence coverage. "NumSNPs" is the number of SNPs called by DIAL. "NumMapped" is the number of SNPs and their assembled flanking regions that could be uniquely mapped with greater than 96% identity to the Pongo_abelii-2.0 assembly of the orangutan genome. "NumVerified" is the number of DIAL SNP calls that were verified as present in the read mappings (see text).