Literature DB >> 21697123

In-depth annotation of SNPs arising from resequencing projects using NGS-SNP.

Jason R Grant¹, Adriano S Arantes, Xiaoping Liao, Paul Stothard.

Abstract

SUMMARY: NGS-SNP is a collection of command-line scripts for providing rich annotations for SNPs identified by the sequencing of whole genomes from any organism with reference sequences in Ensembl. Included among the annotations, several of which are not available from any existing SNP annotation tools, are the results of detailed comparisons with orthologous sequences. These comparisons can, for example, identify SNPs that affect conserved residues, or alter residues or genes linked to phenotypes in another species. AVAILABILITY: NGS-SNP is available both as a set of scripts and as a virtual machine. The virtual machine consists of a Linux operating system with all the NGS-SNP dependencies pre-installed. The source code and virtual machine are freely available for download at http://stothard.afns.ualberta.ca/downloads/NGS-SNP/. CONTACT: stothard@ualberta.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Gene Species

Mesh：

Year: 2011 PMID： 21697123 PMCID： PMC3150039 DOI： 10.1093/bioinformatics/btr372

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The latest sequencing instruments in conjunction with SNP discovery tools can be used to identify huge numbers of putative SNPs. Whether the SNPs are discovered through genome or transcriptome sequencing the next problem after identification is often annotating and choosing functionally important SNPs. Here, we describe a collection of scripts called NGS-SNP (next-generation sequencing SNP), for performing in-depth annotation of SNPs identified by popular SNP discovery programs such as Maq (Li ) and SAMtools (Li ). NGS-SNP can be applied to data from any organism with reference sequences in Ensembl, and provides numerous annotation fields, several of which are not available from other tools.

2 IMPLEMENTATION

The main component of NGS-SNP is a Perl script called ‘annotate_SNPs.pl’ that accepts a SNP list as input and generates as output a SNP list with annotations added (Table 1). Information used for SNP annotation is retrieved from Ensembl (Hubbard ), NCBI (Maglott ) and UniProt (UniProt Consortium, 2011). Using a locally installed version of Ensembl the annotation script can process 4 million SNPs in about 2 days on a standard desktop system. Users analyzing many SNP lists, from different individuals of the same species for example, can take advantage of the script's ability to create a local database of annotation results. This database allows all the annotations and the flanking sequence for any previously processed SNPs to be obtained much more quickly. Additional components of NGS-SNP include a script for merging, filtering and sorting SNP lists as well as scripts for obtaining reference chromosome and transcript sequences from Ensembl that can be used with SNP discovery tools such as Maq.

Table 1.

Annotation fields provided by the NGS-SNP annotation script

Field	Description
Functional_Class	Type of SNP (e.g. nonsynonymous)
Chromosome	Chromosome containing the SNP
Chromosome_Position	Position of the SNP on the chromosome
Chromosome_Strand	Strand corresponding to the reported alleles
Chromosome_Reference	Base found in the reference genome
Chromosome_Reads	Base in genome supported by the reads
Gene_Description	Short description of the relevant gene
Ensembl_Gene_ID	Ensembl Gene ID of the relevant gene
Entrez_Gene_Name	Entrez Gene name of the relevant gene
Entrez_Gene_ID	Entrez Gene ID of the relevant gene
Ensembl_Transcript_ID	Ensembl Transcript ID of the transcript
Transcript_SNP_Position	Position of the SNP on the transcript
Transcript_SNP_Reference	Base found in the reference transcript
Transcript_SNP_Reads	Base in transcript according to the reads
Transcript_To_Chr_Strand	Chromosome strand matching transcript
Ensembl_Protein_ID	Ensembl Protein ID of the affected protein
UniProt_ID	UniProt ID of the relevant protein
Amino_Acid_Position	Position of the affected amino acid
Overlapping_Protein_Features	Protein features, obtained from UniProt, that overlap with the affected amino acid
Amino_Acid_Reference	Amino acid encoded by the reference
Amino_Acid_Reads	Amino acid encoded by the reads
Amino_Acids_In_Orthologues	Amino acids from orthologous sequences that align with the reference amino acid
Alignment_Score_Change	Effect of SNP on protein conservation
C_blosum	Conservation score when reference amino acid compared to orthologues using an amino acid scoring matrix
Context_Conservation	Average percent identity of the SNP region
Orthologue_Species	Source species of the orthologues used for previous four columns
Gene_Ontology	GO slim IDs and terms for the transcript
Model_Annotations	Functional information obtained from a model species, in the form of key-value pairs
Comments	Various annotations in the form of key-value pairs, such as protein sequence lost because of stop codon
Ref_SNPs	rs IDs of known SNPs sharing alleles with this SNP
Is_Fully_Known	Whether existing SNP records completely describe this SNP

Fields present in the input SNP list are also included in the output, preceding the fields described above.

Annotation fields provided by the NGS-SNP annotation script Fields present in the input SNP list are also included in the output, preceding the fields described above. When the annotation script identifies an amino acid-changing SNP it calculates an ‘alignment score change’ value a. This process involves comparing the reference amino acid and the non-reference amino acid to each orthologue. Briefly, the amino acid encoded by the variant (i.e. non-reference) allele v is compared to each available orthologous amino acid o using a log-odds scoring matrix (BLOSUM62 by default). This provides a score s(v,o) for each of the n orthologues. Similarly, the amino acid encoded by the reference allele r is compared to the orthologues. Any set of species in Ensembl can be used as the source of orthologous sequences. The average score for the reference amino acid is subtracted from the average score for the variant amino acid (1), and the result is scaled to between –1 and 1, by dividing by the maximum possible value for the scoring matrix. A positive value indicates that the variant amino acid is more similar to the orthologues than the reference amino acid, whereas a negative value indicates that the reference amino acid is more similar to the orthologues. SNPs with large positive or negative values may be of more initial interest as candidates for further study. The annotation script includes a ‘model’ option that can be used to specify a well-studied species to use as an additional annotation source. When a SNP is located near or within a gene, annotations describing the model species orthologue of the gene are obtained from Ensembl, Entrez Gene and UniProt. These annotations are used to generate values that appear in a ‘Model_Annotations’ field, in the form of key-value pairs. Examples of information provided in this field include KEGG pathway names (Kanehisa ), the number of interacting proteins, phenotypes associated with the orthologue, the names of protein features overlapping with the SNP site in the orthologue, and phenotypes associated with mutations affecting the SNP site in the orthologue. The sample output given in Supplementary File 1 begins with the results for a contrived SNP designed to change a residue in the bovine HBB protein, to resemble a mutation responsible for sickle-cell disease in humans. The annotation script can optionally provide the genomic flanking sequence for each SNP, for use in the design of validation assays. Known SNP sites in the flanking sequence and at the SNP position can be included in the output, as lowercase IUPAC characters in the flanking, and as potentially additional alleles at the SNP site. Supplementary File 2 contains the flanking sequences provided by the annotation script (with known SNPs indicated in lowercase) for the 10 SNPs described in Supplementary File 1.

3 DISCUSSION

Many existing SNP annotation tools work only for human SNPs or SNPs already present in dbSNP, or can only be used to process a few thousand SNPs at a time (Chelala ; Johnson ; Schmitt ). Apart from NGS-SNP we are aware of two tools designed to annotate the very large SNP lists generated by whole-genome resequencing of humans and non-human species. ANNOVAR (Wang ) is a command-line tool that uses information from the UCSC Genome Browser to provide annotations. SeqAnt (Shetty ) is web-based and can be downloaded, and also relies on resources from the UCSC Genome Browser. Both can place SNPs into functional classes, describe nearby genes, and indicate which SNPs are already described in dbSNP. Neither compares affected residues to orthologous sequences, reports overlapping protein features or domains, provides gene ontology information, or provides flanking sequence. The ability to map SNP-altered residues to a protein in another species to retrieve additional information is also not supported. However, ANNOVAR and SeqAnt provide a measure of DNA conservation at the SNP site, can handle indels, and return annotations much more quickly than NGS-SNP. These features and others give each tool some unique advantages. The option to submit SNPs to SeqAnt online may be particularly appealing to some users. In summary, NGS-SNP can be used to annotate the SNP lists returned from programs such as Maq and SAMtools. SNPs are classified as synonymous, non-synonymous, 3′ -UTR, etc., regardless of whether or not they match existing SNP records. Numerous additional fields of information are provided, several of which are not available from other tools. Funding: Alberta Livestock and Meat Agency; the Natural Sciences and Engineering Research Council of Canada. Conflict of Interest: none declared.

11 in total

1. CandiSNPer: a web tool for the identification of candidate SNPs for causal variants.

Authors: Armin O Schmitt; Jens Assmus; Ralf H Bortfeldt; Gudrun A Brockmann
Journal: Bioinformatics Date: 2010-02-19 Impact factor: 6.937

2. Mapping short DNA sequencing reads and calling variants using mapping quality scores.

Authors: Heng Li; Jue Ruan; Richard Durbin
Journal: Genome Res Date: 2008-08-19 Impact factor: 9.043

3. SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap.

Authors: Andrew D Johnson; Robert E Handsaker; Sara L Pulit; Marcia M Nizzari; Christopher J O'Donnell; Paul I W de Bakker
Journal: Bioinformatics Date: 2008-10-30 Impact factor: 6.937

4. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

5. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data.

Authors: Kai Wang; Mingyao Li; Hakon Hakonarson
Journal: Nucleic Acids Res Date: 2010-07-03 Impact factor: 16.971

6. SeqAnt: a web service to rapidly identify and annotate DNA sequence variations.

Authors: Amol Carl Shetty; Prashanth Athri; Kajari Mondal; Vanessa L Horner; Karyn Meltz Steinberg; Viren Patel; Tamara Caspary; David J Cutler; Michael E Zwick
Journal: BMC Bioinformatics Date: 2010-09-20 Impact factor: 3.169

7. Ongoing and future developments at the Universal Protein Resource.

Authors:
Journal: Nucleic Acids Res Date: 2010-11-04 Impact factor: 16.971

8. Entrez Gene: gene-centered information at NCBI.

Authors: Donna Maglott; Jim Ostell; Kim D Pruitt; Tatiana Tatusova
Journal: Nucleic Acids Res Date: 2010-11-28 Impact factor: 16.971

9. SNPnexus: a web database for functional annotation of newly discovered and public domain single nucleotide polymorphisms.

Authors: Claude Chelala; Arshad Khan; Nicholas R Lemoine
Journal: Bioinformatics Date: 2008-12-19 Impact factor: 6.937

10. Ensembl 2009.

Authors: T J P Hubbard; B L Aken; S Ayling; B Ballester; K Beal; E Bragin; S Brent; Y Chen; P Clapham; L Clarke; G Coates; S Fairley; S Fitzgerald; J Fernandez-Banet; L Gordon; S Graf; S Haider; M Hammond; R Holland; K Howe; A Jenkinson; N Johnson; A Kahari; D Keefe; S Keenan; R Kinsella; F Kokocinski; E Kulesha; D Lawson; I Longden; K Megy; P Meidl; B Overduin; A Parker; B Pritchard; D Rios; M Schuster; G Slater; D Smedley; W Spooner; G Spudich; S Trevanion; A Vilella; J Vogel; S White; S Wilder; A Zadissa; E Birney; F Cunningham; V Curwen; R Durbin; X M Fernandez-Suarez; J Herrero; A Kasprzyk; G Proctor; J Smith; S Searle; P Flicek
Journal: Nucleic Acids Res Date: 2008-11-25 Impact factor: 16.971

62 in total

1. Rich annotation of DNA sequencing variants by leveraging the Ensembl Variant Effect Predictor with plugins.

Authors: Michael Yourshaw; S Paige Taylor; Aliz R Rao; Martín G Martín; Stanley F Nelson
Journal: Brief Bioinform Date: 2014-03-12 Impact factor: 11.622

2. Whole genome sequencing of Guzerá cattle reveals genetic variants in candidate genes for production, disease resistance, and heat tolerance.

Authors: Izinara C Rosse; Juliana G Assis; Francislon S Oliveira; Laura R Leite; Flávio Araujo; Adhemar Zerlotini; Angela Volpini; Anderson J Dominitini; Beatriz C Lopes; Wagner A Arbex; Marco A Machado; Maria G C D Peixoto; Rui S Verneque; Marta F Martins; Roney S Coimbra; Marcos V G B Silva; Guilherme Oliveira; Maria Raquel S Carvalho
Journal: Mamm Genome Date: 2016-11-16 Impact factor: 2.957

3. Power Analysis for Genetic Association Test (PAGEANT) provides insights to challenges for rare variant association studies.

Authors: Andriy Derkach; Haoyu Zhang; Nilanjan Chatterjee
Journal: Bioinformatics Date: 2018-05-01 Impact factor: 6.937

4. Ancestry inference and admixture component estimations of Chinese Kazak group based on 165 AIM-SNPs via NGS platform.

Authors: Tong Xie; Chunmei Shen; Chao Liu; Yating Fang; Yuxin Guo; Qiong Lan; Lingxiang Wang; Jianye Ge; Yongsong Zhou; Shaoqing Wen; Qing Yang; Bofeng Zhu
Journal: J Hum Genet Date: 2020-02-21 Impact factor: 3.172

Review 5. High throughput sequencing approaches to mutation discovery in the mouse.

Authors: Michelle M Simon; Ann-Marie Mallon; Gareth R Howell; Laura G Reinholdt
Journal: Mamm Genome Date: 2012-09-19 Impact factor: 2.957

6. Exome sequencing finds a novel PCSK1 mutation in a child with generalized malabsorptive diarrhea and diabetes insipidus.

Authors: Michael Yourshaw; R Sergio Solorzano-Vargas; Lindsay A Pickett; Iris Lindberg; Jiafang Wang; Galen Cortina; Anna Pawlikowska-Haddal; Howard Baron; Robert S Venick; Stanley F Nelson; Martín G Martín
Journal: J Pediatr Gastroenterol Nutr Date: 2013-12 Impact factor: 2.839

7. Development and validation of a small SNP panel for feed efficiency in beef cattle.

Authors: M K Abo-Ismail; N Lansink; E Akanno; B K Karisa; J J Crowley; S S Moore; E Bork; P Stothard; J A Basarab; G S Plastow
Journal: J Anim Sci Date: 2018-03-06 Impact factor: 3.159

8. Panel-based next generation sequencing as a reliable and efficient technique to detect mutations in unselected patients with retinal dystrophies.

Authors: Nicola Glöckle; Susanne Kohl; Julia Mohr; Tim Scheurenbrand; Andrea Sprecher; Nicole Weisschuh; Antje Bernd; Günther Rudolph; Max Schubach; Charlotte Poloschek; Eberhart Zrenner; Saskia Biskup; Wolfgang Berger; Bernd Wissinger; John Neidhardt
Journal: Eur J Hum Genet Date: 2013-04-17 Impact factor: 4.246

9. The identification of candidate genes and SNP markers for classical bovine spongiform encephalopathy susceptibility.

Authors: Jennifer M Thomson; Victoria Bowles; Jung-Woo Choi; Urmila Basu; Yan Meng; Paul Stothard; Stephen Moore
Journal: Prion Date: 2012-08-23 Impact factor: 3.931

10. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle.

Authors: Hans D Daetwyler; Aurélien Capitan; Hubert Pausch; Paul Stothard; Rianne van Binsbergen; Rasmus F Brøndum; Xiaoping Liao; Anis Djari; Sabrina C Rodriguez; Cécile Grohs; Diane Esquerré; Olivier Bouchez; Marie-Noëlle Rossignol; Christophe Klopp; Dominique Rocha; Sébastien Fritz; André Eggen; Phil J Bowman; David Coote; Amanda J Chamberlain; Charlotte Anderson; Curt P VanTassell; Ina Hulsegge; Mike E Goddard; Bernt Guldbrandtsen; Mogens S Lund; Roel F Veerkamp; Didier A Boichard; Ruedi Fries; Ben J Hayes
Journal: Nat Genet Date: 2014-07-13 Impact factor: 38.330