Literature DB >> 25512690

SNPAAMapperT2K: A genome-wide SNP downstream analysis and annotation pipeline for species annotated with NCBI.tbl data files.

Abstract

UNLABELLED: SNPAAMapper, a genome-wide SNP downstream analysis and annotation pipeline, was designed to classify detected variants according to genomic regions and report the mutation class by processing whole-genome and/or whole-exome sequencing data. A widely used sequence and data annotation table format "knownGene.txt" has not yet been created for many popular model organisms (e.g. Arabidopsis). Instead, NCBI .tbl annotation format files are provided for these species. Therefore, it is of interest to describe SNPAAMapperT2K, a genome-wide SNP downstream analysis and annotation pipeline for species annotated with NCBI .tbl data files (e.g. Arabidopsis). The pipeline is tested with a deeply sequenced Arabidopsis thaliana strain (Seattle-0). The SNPAAMapperT2K can also annotate and report SNP classes for other species, whose chromosome files are annotated as NCBI .tbl format, but do not have their annotated knownGene.txt files available. AVAILABILITY: Perl scripts and required input files are available on the web at http://isu.indstate.edu/ybai2/SNPAAMapperT2K.

Entities: Disease Gene Species

Year: 2014 PMID： 25512690 PMCID： PMC4261118 DOI： 10.6026/97320630010711

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

Exome sequencing technology is being employed to identify single nucleotide polymorphisms (SNPs) and/or insertions and deletions (INDELs) in genetic disease research. The schema for UCSC Genes (knownGene.txt) [1] has been widely employed for use in both standard and customized downstream analysis tools and scripts. However, even for many popular model organisms (e.g. Arabidopsis), sequence and annotation data tables (including knownGene.txt) have not yet been made available to the public. SNPAAMapper [2], a genome-wide SNP analysis and annotation pipeline using whole-genome and/or whole-exome sequencing data, has been developed to perform the downstream annotation for detected variants; this tool can classify variants by regions and report the hit class and requires knownGene.txt as one of its input files. We have developed a tool - Tbl2KnownGene [3], a .tbl file parser that can process the contents of a National Center for Biotechnology Information (NCBI) .tbl file (e.g. the one for Arabidopsis genome (TAIR10)) [4, 5] and produce a UCSC Known Genes annotation feature table. Arabidopsis chromosomes are annotated as .tbl files by TAIR, so their knownGene.txt format files are not available. In this study, we have developed SNPAAMapperT2K, a genome-wide SNP analysis pipeline for species that has .tbl but not knownGene.txt files available. We have generated annotation files for Arabidopsis and users can easily download them onto their computers and run their sequence read files against the supporting files. Our pipeline can be easily extended to analyze SNP annotation for other species which were annotated using .tbl files, but do not have annotated knownGene.txt files available.

Methodology

The SNPAAMapperT2K algorithm consists of two major modules: the first module converts NCBI .tbl file to UCSC knownGene.txt file format, and the second module uses converted KnownGene files and calls BWA [6] and SAMTools [7] and custom scripts to report the hit class. The workflow of SNPAAMapperT2K is shown in Figure 1.

Figure 1

The Workflow of SNPAAMapperT2K

SNPAAMapperT2K Input and Output:

The inputs are NCBI .tbl files (e.g. the chromosome files of Arabidopsis), TAIR10 sequence annotation files, and short read sequence files. The outputs are annotated variant files. A subset (non-synonymous SNPs) of annotated variants by SNPAAMapperT2K is shown in Table 1 (see supplementary material).

Conclusions

Efficient pipelines/tools are needed for downstream genome wide variant analyses for next-generation sequencing data. We developed a bioinformatics pipeline − SNPAAMapperT2K that parses the contents of a NCBI .tbl annotation table, produces a UCSC Known Genes annotation table, and finally calls customized scripts to classify variants and annotate their hit classes. The pipeline was tested with a deeply sequenced Arabidopsis thaliana strain (Seattle-0) from 1001 Genomes Data Center [8].

5 in total

SNPAAMapperT2K: A genome-wide SNP downstream analysis and annotation pipeline for species annotated with NCBI.tbl data files.

Background

Methodology

SNPAAMapperT2K Input and Output:

Conclusions

1. The human genome browser at UCSC.

2. The Sequence Alignment/Map format and SAMtools.

3. SNPAAMapper: An efficient genome-wide SNP variant analysis pipeline for next-generation sequencing data.

4. Tbl2KnownGene: A command-line program to convert NCBI.tbl to UCSC knownGene.txt data file.

5. Fast and accurate short read alignment with Burrows-Wheeler transform.