Literature DB >> 25512690

SNPAAMapperT2K: A genome-wide SNP downstream analysis and annotation pipeline for species annotated with NCBI.tbl data files.

Yongsheng Bai1.   

Abstract

UNLABELLED: SNPAAMapper, a genome-wide SNP downstream analysis and annotation pipeline, was designed to classify detected variants according to genomic regions and report the mutation class by processing whole-genome and/or whole-exome sequencing data. A widely used sequence and data annotation table format "knownGene.txt" has not yet been created for many popular model organisms (e.g. Arabidopsis). Instead, NCBI .tbl annotation format files are provided for these species. Therefore, it is of interest to describe SNPAAMapperT2K, a genome-wide SNP downstream analysis and annotation pipeline for species annotated with NCBI .tbl data files (e.g. Arabidopsis). The pipeline is tested with a deeply sequenced Arabidopsis thaliana strain (Seattle-0). The SNPAAMapperT2K can also annotate and report SNP classes for other species, whose chromosome files are annotated as NCBI .tbl format, but do not have their annotated knownGene.txt files available. AVAILABILITY: Perl scripts and required input files are available on the web at http://isu.indstate.edu/ybai2/SNPAAMapperT2K.

Entities:  

Year:  2014        PMID: 25512690      PMCID: PMC4261118          DOI: 10.6026/97320630010711

Source DB:  PubMed          Journal:  Bioinformation        ISSN: 0973-2063


Background

Exome sequencing technology is being employed to identify single nucleotide polymorphisms (SNPs) and/or insertions and deletions (INDELs) in genetic disease research. The schema for UCSC Genes (knownGene.txt) [1] has been widely employed for use in both standard and customized downstream analysis tools and scripts. However, even for many popular model organisms (e.g. Arabidopsis), sequence and annotation data tables (including knownGene.txt) have not yet been made available to the public. SNPAAMapper [2], a genome-wide SNP analysis and annotation pipeline using whole-genome and/or whole-exome sequencing data, has been developed to perform the downstream annotation for detected variants; this tool can classify variants by regions and report the hit class and requires knownGene.txt as one of its input files. We have developed a tool - Tbl2KnownGene [3], a .tbl file parser that can process the contents of a National Center for Biotechnology Information (NCBI) .tbl file (e.g. the one for Arabidopsis genome (TAIR10)) [4, 5] and produce a UCSC Known Genes annotation feature table. Arabidopsis chromosomes are annotated as .tbl files by TAIR, so their knownGene.txt format files are not available. In this study, we have developed SNPAAMapperT2K, a genome-wide SNP analysis pipeline for species that has .tbl but not knownGene.txt files available. We have generated annotation files for Arabidopsis and users can easily download them onto their computers and run their sequence read files against the supporting files. Our pipeline can be easily extended to analyze SNP annotation for other species which were annotated using .tbl files, but do not have annotated knownGene.txt files available.

Methodology

The SNPAAMapperT2K algorithm consists of two major modules: the first module converts NCBI .tbl file to UCSC knownGene.txt file format, and the second module uses converted KnownGene files and calls BWA [6] and SAMTools [7] and custom scripts to report the hit class. The workflow of SNPAAMapperT2K is shown in Figure 1.
Figure 1

The Workflow of SNPAAMapperT2K

SNPAAMapperT2K Input and Output:

The inputs are NCBI .tbl files (e.g. the chromosome files of Arabidopsis), TAIR10 sequence annotation files, and short read sequence files. The outputs are annotated variant files. A subset (non-synonymous SNPs) of annotated variants by SNPAAMapperT2K is shown in Table 1 (see supplementary material).

Conclusions

Efficient pipelines/tools are needed for downstream genome wide variant analyses for next-generation sequencing data. We developed a bioinformatics pipeline − SNPAAMapperT2K that parses the contents of a NCBI .tbl annotation table, produces a UCSC Known Genes annotation table, and finally calls customized scripts to classify variants and annotate their hit classes. The pipeline was tested with a deeply sequenced Arabidopsis thaliana strain (Seattle-0) from 1001 Genomes Data Center [8].
  5 in total

1.  The human genome browser at UCSC.

Authors:  W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal:  Genome Res       Date:  2002-06       Impact factor: 9.043

2.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

3.  SNPAAMapper: An efficient genome-wide SNP variant analysis pipeline for next-generation sequencing data.

Authors:  Yongsheng Bai; James Cavalcoli
Journal:  Bioinformation       Date:  2013-10-16

4.  Tbl2KnownGene: A command-line program to convert NCBI.tbl to UCSC knownGene.txt data file.

Authors:  Yongsheng Bai
Journal:  Bioinformation       Date:  2014-08-30

5.  Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2009-05-18       Impact factor: 6.937

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.