| Literature DB >> 22435069 |
Pablo Cingolani1, Viral M Patel, Melissa Coon, Tung Nguyen, Susan J Land, Douglas M Ruden, Xiangyi Lu.
Abstract
This paper describes a new program SnpSift for filtering differential DNA sequence variants between two or more experimental genomes after genotoxic chemical exposure. Here, we illustrate how SnpSift can be used to identify candidate phenotype-relevant variants including single nucleotide polymorphisms, multiple nucleotide polymorphisms, insertions, and deletions (InDels) in mutant strains isolated from genome-wide chemical mutagenesis of Drosophila melanogaster. First, the genomes of two independently isolated mutant fly strains that are allelic for a novel recessive male-sterile locus generated by genotoxic chemical exposure were sequenced using the Illumina next-generation DNA sequencer to obtain 20- to 29-fold coverage of the euchromatic sequences. The sequencing reads were processed and variants were called using standard bioinformatic tools. Next, SnpEff was used to annotate all sequence variants and their potential mutational effects on associated genes. Then, SnpSift was used to filter and select differential variants that potentially disrupt a common gene in the two allelic mutant strains. The potential causative DNA lesions were partially validated by capillary sequencing of polymerase chain reaction-amplified DNA in the genetic interval as defined by meiotic mapping and deletions that remove defined regions of the chromosome. Of the five candidate genes located in the genetic interval, the Pka-like gene CG12069 was found to carry a separate pre-mature stop codon mutation in each of the two allelic mutants whereas the other four candidate genes within the interval have wild-type sequences. The Pka-like gene is therefore a strong candidate gene for the male-sterile locus. These results demonstrate that combining SnpEff and SnpSift can expedite the identification of candidate phenotype-causative mutations in chemically mutagenized Drosophila strains. This technique can also be used to characterize the variety of mutations generated by genotoxic chemicals.Entities:
Keywords: Drosophila melanogaster; next-generation DNA sequencing; personal genomes; whole-genome SNP analysis
Year: 2012 PMID: 22435069 PMCID: PMC3304048 DOI: 10.3389/fgene.2012.00035
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Mapping X1 to the reference genome. The reference genome used was the latest FlyBase version (dm5.30). The quality score was arbitrarily set at 70 and above for this table. The numbers indicate the numbers of reads mapped to the indicated genomic region. U, unmapped regions. Het, heterochromatic regions.
Figure 2Single nucleotide polymorphism calling for X1 SNPs with a quality score greater than or equal to 70. We performed SNP calling using Samtools, which produced 1,943,047 SNPs with a quality score > 1. Out of these, 1,036,435 are homozygous SNPs. The low quality SNPs were filtered out using an arbitrary threshold of 70 (the peak of the distribution) leaving 204,205 homozygous SNPs. A summary of the remaining homozygous SNPs found in each category is shown in the numbers above the bars.
Figure 3Flowchart for finding the causative SNPs in X1 and X2. (A) SnpEeff identified 16,921 “class 1” SNPs (see text) with a quality score > 1 in both X1 and X2 (zero quality scores are usually resulted from reads mapping to multiple genomic regions). There are 558 SNPs that are only present in X1 and 447 SNPs that are only present in X2. (B) Since we know that X1 and X2 are on chromosome 3, we focused on the 141 strong SNPs on chromosome 3 that are present in X1 or X2 but not both. There are only eight genes that are commonly affected by unique SNPs in both X1 and X2 (note that the eight genes have at least two SNPs at different bases). (C) List of the eight genes with SNPs in both X1 and X2. See Table 1 for more details. (D) Only one gene, CG12069/Pka-like, contained SNPs with scores > 60. These SNPs were validated by capillary sequencing of PCR-amplified DNA from the genetic interval of the male-sterile locus as defined by meiotic and deletion mapping data (see text). ca.
Gene candidates for X1 and X2.
| Gene Name | X1 SNPs | Score | X2 SNPs | Score |
|---|---|---|---|---|
| Ank2 | 15 | All < 5 | 14 | All < 5 |
| Hsromega | 4 | All < 5 | 4 | All < 5 |
| CG12069 (Pka-like) | 1 | 102 (W308/*) | 1 | 66 (Q9/*) |
| prc | 2 | 1, 10 | 2 | 2, 21 |
| CG13826 | 1 | 36 (I70/F) | 1 | 30 (I70/L) |
| Muc68Ca | 1 | 1 | 1 | 2 |
| Rgl | 1 | 30 (N8/T) | 1 | 33 (N8/S) |
| sls | 1 | 1 | 1 | 1 |
X1 SNPs and X2 SNPs, the number of SNPs in the indicated gene in X1 and X2. Score, the SNP quality score produced by the alignment and variant call software (e.g., SamTools and BcfTools).
Figure 4The candidate gene mutated in X1 and X2 is . (A) Map of the CG12069/Pka-like region on chromosome 3R. The image is adapted from the FlyBase genome browser. The genomic location (26,520 k) is indicated in kilobase pairs. (B) Location of X1 and X2 SNPs. (C) Conserved domains in CG12069/Pka-like.
Operators allowed in SnpSift filter.
| Operand | Description | Data type | Example |
|---|---|---|---|
| = | Equality test | FLOAT, INT or STRING | (REF = ‘A’) |
| > | Greater than | FLOAT or INT | (DP > 20) |
| ≥ | Greater or equal than | FLOAT or INT | (DP ≥ 20) |
| < | Less than | FLOAT or INT | (DP < 20) |
| ≤ | Less or equal than | FLOAT or INT | (DP ≤ 20) |
| =~ | Match regular expression | STRING | (REL =~ ‘AC’) |
| !~ | Does not match regular expression | STRING | (REL!~ ‘AC’) |
| & | AND operator | Boolean | (DP > 20) & (REF = ‘A’) |
| | | OR operator | Boolean | (DP > 20) | (REF = ‘A’) |
| ! | NOT operator | Boolean | ! (DP > 20) |
| exists | The variable exists (not missing) | Any | (exists INDEL) |
Functions implemented in SnpSift filter.
| Function | Description | Data type | Example |
|---|---|---|---|
| countHom | Count number of homozygous genotypes | No arguments | ( |
| countHet | Count number of heterozygous genotypes | No arguments | ( |
| countVariant | Count number of genotypes that are variants (i.e., not reference 0/0) | No arguments | ( |
| countRef | Count number of genotypes that are NOT variants (i.e., reference 0/0) | No arguments | ( |