| Literature DB >> 19523221 |
Helen I Field1, Serena A Scollen, Craig Luccarini, Caroline Baynes, Jonathan Morrison, Alison M Dunning, Douglas F Easton, Paul D P Pharoah.
Abstract
BACKGROUND: In moderate-throughput SNP genotyping there was a gap in the workflow, between choosing a set of SNPs and submitting their sequences to proprietary assay design software, which was not met by existing software. Retrieval and formatting of sequences flanking each SNP, prior to assay design, becomes rate-limiting for more than about ten SNPs, especially if annotated for repetitive regions and adjacent variations. We routinely process up to 50 SNPs at once. IMPLEMENTATION: We created Seq4SNPs, a web-based, walk-away software that can process one to several hundred SNPs given rs numbers as input. It outputs a file of fully annotated sequences formatted for one of three proprietary design softwares: TaqMan's Primer-By-Design FileBuilder, Sequenom's iPLEX or SNPstream's Autoprimer, as well as unannotated fasta sequences. We found genotyping assays to be inhibited by repetitive sequences or the presence of additional variations flanking the SNP under test, and in multiplexes, repetitive sequence flanking one SNP adversely affects multiple assays. Assay design software programs avoid such regions if the input sequences are appropriately annotated, so we used Seq4SNPs to provide suitably annotated input sequences, and improved our genotyping success rate. Adjacent SNPs can also be avoided, by annotating sequences used as input for primer design.Entities:
Mesh:
Year: 2009 PMID: 19523221 PMCID: PMC2711078 DOI: 10.1186/1471-2105-10-180
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Formatting comparison for three medium-throughput genotyping platforms
| File Builder Assay-by-Design | Autoprimer | iPLEX | |
| Letters, numbers, – (hyphen) only | Letters, numbers, – only | Letters, number, – only | |
| Tab | Space | Space | |
| 200–600 | 200 | 200 | |
| Common first | - | - | |
| 1 = 201 (assay position) | Masked sequence | - | |
| N | IUPAC | Lower case | |
| N | N | Lower case | |
| First sequence | Second sequence, not 25 bases either side of SNP | First sequence | |
The low-medium genotyping platforms Taqman®, Beckman SNPstream® and Sequenom iPlex™ come with assay design software packages that have different input requirements: each takes a file containing multiple SNP sequences, one per line; each line begins with a name and is followed by a sequence, separated by a space or tab, succeeded by either the position of the assay SNP (Taqman®) or any masked sequence (SNPstream®) or nothing (iPLEX™). Additional SNPs flanking the SNP to be assayed can also cause suboptimal assay performance if designed into an oligonucleotide binding site. Proprietary assay design software avoids additional variations, if the input sequences are suitably annotated by changing the variant nucleotide in the sequence to N, to the IUPAC code or to lower case. Basic sequence formats are summarised in the table, and follow the convention:
Name
Examples for annotation of a synthetic SNP where the flanking sequence contains a masked sequence (repetitive element shown as NNNNNN/catgga distal to allele) and one adjacent SNP (N/Y/c before allele):
e.g. for Taqman:
SNP-01_rs14352008 ATGCANCAGG [A/G]ATGNNNNNNCAGG 1=11
e.g. for SNPstream:
SNP-01_rs14352008 ATGCAYCAGG [A/G]ATGCATGGACAGG ATGCAYCAGG [A/G]ATGNNNNNNCAGG
e.g. for Sequenom:
SNP-01_rs14352008 ATGCAcCAGG [A/G]ATGcatggaCAGG
Figure 1Reduction of the cost of genotyping by reducing the number of sequences in a multiplex having repetitive elements. The cost of genotyping using SNPstream 48-plex assays was calculated for a 100% pass rate (1 on y-axis): this cost increases non-linearly, as the pass rate decreases. Here we show how the cost increases with the number of assays containing repetitive sequences, per multiplex. Some repetitive sequences clearly affect the pass rate in the whole multiplex. We now use Seq4SNPs with Repeat Masker output, put masked sequences output from Seq4SNPs into SNP-IT primer design software for SNPstream, and replace SNPs rejected by SNP-IT. This improves assay success and reduces the cost of genotyping significantly.
Figure 2Using Seq4SNPs software as part of the Workflow in SNPstream genotyping sequence preparation. After choosing 48 or more SNPs to assay (START), input to Seq4SNPs software (grey box) is a text file (white, top) containing a list of rs numbers and assay names and the desired assay size (e.g. 200 nucleotides each side of the assay SNP). The first process (blue box, top) outputs a file of sequences in fasta format (white, centre). The user submits this to Repeat Masker which quickly produces a masked fasta file (white, right); at this point the user might reject some assays and start again if insufficient assays remain. In the second process (blue box, bottom) the Seq4SNPs fasta file is automatically used to annotate and reformat. The user inputs the masked file for this step. Final outputs (white, bottom) are a formatted sequence file for assay design software, and a report for Excel, containing warnings, error flags, minisequences for adjacent SNPs and links to dbSNP and is used for assisted error checking. User time is mainly confined to the delay points (green). The formatted assays are submitted to the SNP-IT at Autoprimer (Assay Designer), which rejects sequences that are insufficiently repeat-free, when the user may choose alternatives again (upward arrows). Legend: Seq4SNPs (grey box): times (sec ") are per SNP and are improving. Start and end points (circles); processes (blue rectangles); file/stored output (white rectangles); additional software decision aids (yellow diamonds).
Data sources
| SNP rs or ss number* | User input | File or text input | 1 |
| Trivial name | User input | Same file as above | 1 |
| Size of assay sequence | User input | e.g. 200 specifies 200 nucleotides each side of assay SNP (401 altogether) | 1 |
| New rs number | NCBI dbSNP cluster page* | New rs retrieved when rs no longer in use** or if ss number submitted*** | 2 |
| Fasta sequence, allele, | ditto | Fasta output with allele in header (major allele first) | 2 |
| Major allele, validation of assay, heterozygosity | ditto | 'Allele' report. | |
| Fasta sequence (second attempt) | NCBI contig fasta sequence**** | If sequence in cluster page too short: contig reference from cluster page* | 2 |
| Gene, chromosome | NCBI cluster page* | 'Gene' report | 2 |
| Masked sequences | RepeatMasker (see text) | Takes fasta output above and produces fasta for next step. | 3 |
| Platform | User input | Choose TaqMan, SNPstream or Sequenom | 3 |
| Chromosome position, adjacent SNP list, with 21 nucleotide sequence etc. | Mysql local database with dbSNP data | Annotation of assay sequence using Seq4SNP algorithm | 4 |
| Validation, heterozygosity | Ditto | Part of | 4 |
| SNP assay sequences | Final output compatible with assay designers | 4 |
Data used by Seq4SNPs is drawn from various sources, listed here: Seq4SNPs inputs (italics), outputs (bold). Some items are taken from web pages accessed by the universal resource locator (URL), or FTP download sites, shown below.
Example URLs:
*dbSNP rs cluster page:
**New rs number: (if cluster page not available)
***rs number for ss:
****NCBI contig download:
dbSNP downloads (human): fasta sequences and chromosome positions respectively from
Note that to extend from the organisms folder and put into the MySQL database with the human SNPs
Figure 3Final output: web page from . (1) Link to file containing the formatted, linear assay sequence (not shown, as Table 1 legend). (2) Link to the fasta file (not shown, 50 nucleotides per line). (3) Report of SNPs for this assay, showing rs number, chromosome and name (if given): for each adjacent/assay SNP: (4) rs with link to dbSNP (NCBI); (5) chromosome position (and relative position in the assay sequence, numbered 1 – 401); (6) allele, IUPAC code for the allele and 21 nucleotide sequence; validation, heterozygosity and standard error. (7) Flag indicating how SNP sequences were matched, indicating that both adjacent and assay SNP sequences were reversed. Flags indicate complications in sequence matching, reporting both of the sequences which it attempted to match. (8) When a single SNP is requested, the assay sequence sequence is shown in a format useful for checking positions. (9) GC content is given.