| Literature DB >> 24324324 |
Emanuel Maldonado1, Imran Khan, Siby Philip, Vítor Vasconcelos, Agostinho Antunes.
Abstract
The rapid advances in genome sequencing technologies have increased the pace at which biological sequence databases are becoming available to the broad scientific community. Thus, obtaining and preparing an appropriate sequence dataset is a crucial first step for all types of genomic analyses. Here, we present a script that can widely facilitate the easy, fast, and effortless downloading and preparation of a proper biological sequence dataset for various genomics studies. This script retrieves Ensembl defined genomic features, associated with a given Ensembl identifier. Coding (CDS) and genomic sequences can be easily retrieved based on a selected relationship from a set of relationship types, either considering all available organisms or a user specified subset of organisms. The script is very user-friendly and by default starts with an interactive mode if no command-line options are specified.Entities:
Keywords: bioinformatics; data curation; databases; genomics; molecular evolution; sequence analysis
Year: 2013 PMID: 24324324 PMCID: PMC3855309 DOI: 10.4137/EBO.S11335
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Available sequence feature types and corresponding options.
| SEQUENCE FEATURES | OPTION | SYNTAX | NOTES |
|---|---|---|---|
| Genomic | g | -d g | Complete gene |
| Coding (CDS) | c | -d c | Sequence feature used by default. Without stop codon |
| Peptide | p | -d p | – |
| Exon | e | -d e | These are numbered in their order from Ensembl |
| Intron | i | -d i | These are numbered in their order from Ensembl |
| UTR 5′ | u5 | -d u5 | – |
| UTR 3′ | u3 | -d u3 | – |
| UTR 5′ and 3′ | u53 | -d u53 | Both 5′ and 3′ UTRs |
| Flanking 5′ (upstream) | f5;YSIZE | -d f5;YSIZE | YSIZE is any positive integer chosen by the user for the length of extended upstream region in bp. |
| Flanking 3′ (downstream) | f3;ZSIZE | -d f3;ZSIZE | ZSIZE is any positive integer chosen by the user for the length of the extended downstream region in bp |
| Flanking 5′ and 3′ (up and downstream) | f53; YSIZE; ZSIZE | -d f53; YSIZE; ZSIZE | Both upstream and downstream. YSIZE and ZSIZE are any positive integer chosen by the user. YSIZE for upstream region extension and ZSIZE for downstream region extension; both in bp |
Ensembl defined genomic feature relationships and corresponding options.
| RELATIONSHIP | OPTION | SYNTAX | NOTES |
|---|---|---|---|
| Orthologs | |||
| apparent_ortholog_one2one | 0 | -R 0 | Single gene from each species, related to the duplication node |
| ortholog_one2one | 4 | -R 4 | Depending on the number of genes found in each species. Default option |
| ortholog_one2many | 3 | -R 3 | Depending on the number of genes found in each species |
| ortholog_many2many | 2 | -R 2 | Depending on the number of genes found in each species |
| possible_ortholog | 6 | -R 6 | When the duplication have species-intersection-score ≤ 0.25 |
| Paralogs | |||
| within_species_paralog | 10 | -R 10 | Relation between two genes of the same species with ancestor duplication node. |
| other_paralog | 5 | -R 5 | Related as member of a broader “super-family” |
| Projection | |||
| projection_altered | 7 | -R 7 | Gene with one or more novel transcripts, with a known gene from Human or Mouse as ortholog |
| projection_unchanged | 8 | -R 8 | Gene with one or more novel transcripts, with a known gene from Human or Mouse as ortholog |
| Gene Split | |||
| contiguous_gene_split | 1 | -R 1 | Little or no overlap between the gene fragments present in same strand close to each other (<1MB) |
| putative_gene_split | 9 | -R 9 | Little or no overlap between the gene fragments present in different sequence regions in the assembly. |