| Literature DB >> 23445305 |
Amanda K Gibson1, Zach Smith, Clay Fuqua, Keith Clay, John K Colbourne.
Abstract
BACKGROUND: Genomic resources within the phylum Arthropoda are largely limited to the true insects but are beginning to include unexplored subphyla, such as the Crustacea and Chelicerata. Investigations of these understudied taxa uncover high frequencies of orphan genes, which lack detectable sequence homology to genes in pre-existing databases. The ticks (Acari: Chelicerata) are one such understudied taxon for which genomic resources are urgently needed. Ticks are obligate blood-feeders that vector major diseases of humans, domesticated animals, and wildlife. In analyzing a transcriptome of the lone star tick Amblyomma americanum, one of the most abundant disease vectors in the United States, we find a high representation of unannotated sequences. We apply a general framework for quantifying the origin and true representation of unannotated sequences in a dataset and for evaluating the biological significance of orphan genes.Entities:
Mesh:
Year: 2013 PMID: 23445305 PMCID: PMC3616916 DOI: 10.1186/1471-2164-14-135
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Summary of the annotation of the secondary EST assembly
| Input sequences | 14,310 |
| Short (< 33 amino acids) returns | 171 |
| Weak (e-value > 1E-5) returns | 145 |
| Total unannotated returns | 10,192 |
| Annotated singletons | 2,875 |
| Annotated contigs | 1,243 |
| Total annotated sequences | 4,118 |
| 2,842 | |
| 143 | |
| 208 | |
| 107 | |
| 225 | |
| 140 | |
| 218 | |
| 28 | |
| 49 | |
| 39 | |
| 46 | |
| 52 | |
| 4,099 (28.6%) |
Statistics given for annotation against (A) the UniProtKB protein database and (B) a collection of arthropod datasets. A BLAST search of A. americanum ESTs against a compilation of 12 arthropod peptide datasets was conducted. Species are arranged to reflect increasing phylogenetic distance from A. americanum.
Figure 1Summary of the UniProtKB annotation of the secondary assembly of the EST library. (A) The e-value distribution of all annotated returns and the taxonomic distribution of (B) all annotated returns n = 4,118 and (C) returns annotated as invertebrate n = 3,627.
Comparison of quality measures for annotated and unannotated ESTs
| Mean ORF length (nts) | Sequences | 339.9 | 183.4 |
| | Contigs | 376.7 | 213.1 |
| Mean length (nts) | Sequences | 594.9 | 536.0 |
| | Contigs | 746.5 | 735.8 |
| Number w/o start codon | Sequences | 0 | 398 |
| | Contigs | 0 | 38 |
| Mean GC-content | Sequences | 50 | 45.2 |
| Contigs | 49.7 | 45.1 |
Comparisons are included for mean open-reading frame (ORF) length, mean nucleotide length, number of ESTs lacking a start codon, and mean GC-content. Statistics were obtained using OrfPredictor (ORF length and start codons) and Geneious software (nucleotide length and GC-content).
Figure 2Distribution of the nucleotide lengths of open reading frames for annotated and unannotated EST sequences. Annotated ESTs are shown in black and unannotated in white. Open-reading frame lengths were predicted using the OrfPredictor component of the Transcriptome Analysis Pipeline of the Integrative Services for Genomic Analysis (ISGA) at Indiana University’s Center for Genomics and Bioinformatics. Annotation was determined by a BLAST search of the A. americanum EST library against the UniProtKB protein database.
Summary of results from microarray validation of functionality of ESTs of
| Total ESTs matching to microarray probes | 13,962 |
| ESTs retained: expression above 0.5% FDR threshold | 8,962 (64.2%) |
| Retained ESTs with UniProtKB annotation | 3,105 (34.6%) |
| Retained ESTs with | 3,623 (40.4%) |
| Total retained ESTs with annotation | 3,710 (41.3%) |
A subset of ESTs were matched to microarray probes and expression was measured under 12 different conditions. Sequences with expression that fell below a 0.5% false discovery rate in 11 or 12 of these conditions were rejected. The remaining ESTs were characterized as annotated or unannotated according to matches to the UniProtKB protein database and to four I. scapularis genomic databases.
Summary of BLAST searches of against
| | ||||
| 2,997 | 457 | 3,454 (24.1%) | ||
| | 1,121 | 9,735 | 10,856 | |
| 3,413 | 179 | 3,592 (25.1%) | ||
| | 705 | 10,013 | 10,718 | |
| 1,724 | 642 | 2,365 (16.5%) | ||
| | 2,394 | 9,550 | 11,944 | |
| 3,279 | 845 | 4124 (28.8%) | ||
| | 839 | 9,347 | 10,186 | |
| | | | | |
| | | | ||
| 128,738 (66.2%) | | | ||
| | 70,722 | | | |
| 72,465 (37.3%) | | | ||
| | 121,995 | | | |
| 105,183 (54.1%) | | | ||
| 89,277 | ||||
(A) Amblyomma americanum EST sequences matching Ixodes scapularis genomic datasets and (B) I. scapularis EST sequences matching other I. scapularis datasets. Ixodes scapularis datasets include ESTs, predicted peptides, singletons, and assembled contigs. (A) Amblyomma americanum results are categorized according to UniProtKB annotation status (annotation, no annotation) and I. scapularis match status (match, no match). The proportion of the A. americanum ESTs with a match is provided in parentheses for each I. scapularis dataset. (B) BLAST searches of I. scapularis ESTs (N = 194,460) against the three other I. scapularis datasets were then conducted to evaluate the quality of these datasets. These are classified according to match status. The proportion of I. scapularis ESTs with a match is provided in parentheses.