| Literature DB >> 31045209 |
Peng Zhang1, Bertrand Boisson1,2,3, Peter D Stenson4, David N Cooper4, Jean-Laurent Casanova1,2,3,5,6, Laurent Abel1,2,3, Yuval Itan7,8.
Abstract
Human whole-genome-sequencing reveals about 4 000 000 genomic variants per individual. These data are mostly stored as VCF-format files. Although many variant analysis methods accept VCF as input, many other tools require DNA or protein sequences, particularly for splicing prediction, sequence alignment, phylogenetic analysis, and structure prediction. However, there is no existing webserver capable of extracting DNA/protein sequences for genomic variants from VCF files in a user-friendly and efficient manner. We developed the SeqTailor webserver to bridge this gap, by enabling rapid extraction of (i) DNA sequences around genomic variants, with customizable window sizes and options to annotate the splice sites closest to the variants and to consider the neighboring variants within the window; and (ii) protein sequences encoded by the DNA sequences around genomic variants, with built-in SnpEff annotator and customizable window sizes. SeqTailor supports 11 species, including: human (GRCh37/GRCh38), chimpanzee, mouse, rat, cow, chicken, lizard, zebrafish, fruitfly, Arabidopsis and rice. Standalone programs are provided for command-line-based needs. SeqTailor streamlines the sequence extraction process, and accelerates the analysis of genomic variants with software requiring DNA/protein sequences. It will facilitate the study of genomic variation, by increasing the feasibility of sequence-based analysis and prediction. The SeqTailor webserver is freely available at http://shiva.rockefeller.edu/SeqTailor/.Entities:
Year: 2019 PMID: 31045209 PMCID: PMC6602489 DOI: 10.1093/nar/gkz326
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The workflow of data collection and pre-processing in the SeqTailor webserver.
List of organisms supported by SeqTailor webserver, including the assembly version of each reference genome, the supported genome regions, and the number of genes, transcripts, splice sites and protein sequences for all transcripts and canonical transcripts respectively
| All Transcripts | Canonical Transcripts | |||||||
|---|---|---|---|---|---|---|---|---|
| Organism | Genome assembly | Genome regions | # Genes | # Transcripts | # Splice sites | # Proteins | # Genes /transcripts /proteins | # Splice sites |
| Human | GRCh37 | Chr1–22, X, Y, MT | 35 091 | 138 613 | 2 018 124 | 95 304 | 20 676 | 413 814 |
| Human | GRCh38 | Chr1–22, X, Y, MT, 329 alternate loci | 36 100 | 155 341 | 2 259 292 | 107 498 | 20 528 | 421 390 |
| Chimpanzee | Pan-tro-3.0 | Chr1–22, X, Y, MT | 17 381 | 41 504 | 947 134 | 41 468 | 17 348 | 373 978 |
| Mouse | GRCm38 | Chr1–19, X, Y, MT | 36 148 | 94 462 | 1 350 332 | 65 679 | 22 504 | 420 830 |
| Rat | Rnor_6.0 | Chr1–20, X, Y, MT | 23 671 | 31 110 | 588 376 | 28 897 | 21 935 | 404 576 |
| Cow | ARS-UCD1.2 | Chr1–29, X, MT | 16 515 | 31 214 | 785 250 | 31 188 | 16 489 | 353 780 |
| Chicken | GRCg6a | Chr1–33, W, Z, MT | 12 303 | 22 158 | 573 360 | 22 158 | 12 303 | 278 736 |
| Lizard | AnoCar2.0 | Chr1–6, MT | 6105 | 6321 | 155 556 | 6321 | 6105 | 149 500 |
| Zebrafish | GRCz11 | Chr1–25, MT | 25 864 | 49 308 | 908 190 | 45 633 | 24 568 | 483 174 |
| Fruitfly | BDGP6 | Chr2–4, X, Y, MT | 14 226 | 30 804 | 363 118 | 30 478 | 13 926 | 119 752 |
|
| TAIR10 | Chr1–5, MT | 12 702 | 23 376 | 336 118 | 23 376 | 12 702 | 159 070 |
| Rice | IRGSP-1.0 | Chr1–12, MT | 2779 | 12 455 | 134 960 | 12 452 | 2 776 | 36 886 |
Figure 2.The framework of the SeqTailor webserver for DNA sequence extraction (upper) and protein sequence extraction (lower).
Figure 3.An example showing the input, output and functionality in extracting DNA sequence for genomic variants.
Figure 4.An example showing the input, output and functionality in extracting protein sequence for genomic variants.
List of pathogenic genomic variants with different consequences used in the case study
| ClinVar submission ID | Gene | Genomic variant in VCF | Effects | Diseases | |||
|---|---|---|---|---|---|---|---|
| SCV000107433.2 |
| chr2 | 47635062 | T | G | Intronic, new donor site | Lynch syndrome |
| SCV000616361.3 |
| chr7 | 140481402 | C | T | Missense | Cardio-facio-cutaneous syndrome |
| SCV000840535.3 |
| chr13 | 20763554 | AG | G | Deletion, frameshift | Nonsyndromic hearing loss and deafness |
| SCV000635728.2 |
| chr13 | 32954282 | GG | TA | Essential splicing | Hereditary breast-ovarian cancer |
| SCV000637244.1 |
| chrX | 70330553 | T | C | Intronic, new acceptor site | X-linked severe combined immunodeficiency |
Figure 5.Runtime performance in extracting DNA sequences (left: online webserver, middle: standalone program), and protein sequences (right: online webserver), from varying sizes of input VCF data.
Comparison of SeqTailor with other existing relevant tools
| Tool Name | Interface | Ref. genome | Window size | Input (format) | Output | Splice site Annotation | neighboring Variation |
|---|---|---|---|---|---|---|---|
| SeqTailor | Webserver | Built-in 11 species, or user-defined in standalone | Scalable | Genomic variants (VCF), or genomic ranges (BED) | ref./alt. DNA or protein sequences, with user-defined window size, in browser and to a file | Yes | Yes |
| UCSC Genome Browser | Webserver | Built-in >50 species | Scalable | Genomic ranges (BED) | ref. DNA or protein sequences overlapped with defined genomic regions, in the browser and to a file | No | No |
| BEDTools | Script | User-defined | Scalable | Genomic variants (VCF), or genomic ranges (BED) | ref. DNA alleles or sequences to a file | No | No |
| samtools | Script | User-defined | Scalable | Genomic ranges (BED) | ref. DNA sequences to a file | No | No |
| IGV | Software | User-defined | Scalable | variant positions | Copy-paste to extract DNA or protein sequences from IGV, one at a time | No | No |