| Literature DB >> 27561358 |
Alfredo Rago1, Donald G Gilbert2, Jeong-Hyeon Choi3, Timothy B Sackton4, Xu Wang5, Yogeshwar D Kelkar6, John H Werren7, John K Colbourne8.
Abstract
BACKGROUND: Nasonia vitripennis is an emerging insect model system with haplodiploid genetics. It holds a key position within the insect phylogeny for comparative, evolutionary and behavioral genetic studies. The draft genomes for N. vitripennis and two sibling species were published in 2010, yet a considerable amount of transcriptiome data have since been produced thereby enabling improvements to the original (OGS1.2) annotated gene set. We describe and apply the EvidentialGene method used to produce an updated gene set (OGS2). We also carry out comparative analyses showcasing the usefulness of the revised annotated gene set.Entities:
Keywords: Alternative gene splicing; Gene duplication; Genome annotation; Histones; Hymenoptera; Parasitoid wasp; Protein evolution; Transcriptome
Mesh:
Substances:
Year: 2016 PMID: 27561358 PMCID: PMC5000498 DOI: 10.1186/s12864-016-2886-9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Gene evidence sources for Nasonia vitripennis OGS2. Mapping results of ESTs and RNA-Seq reads with >95 % coverage of length >100 bp to the assembled N. vitripennis genome (Nvit_1.0) using three mapping software and six parameters. An average of 2.5 % of reads are multiply mapped by GSNAP, measured over 8 RNA-Seq libraries. Number of constructed transcript assemblies matching the final gene model by 10 % and 95 % sequence overlap is also indicated
| RNA assemblies | Mapped to genome | 10 % of gene | 95 % of gene | |
|---|---|---|---|---|
| Cufflink 10 | 46,259 | 40,853 | 12,386 | 4902 |
| Cufflink 08 | 71,761 | 56,640 | 14,317 | 5287 |
| Velvet p2 | 121,672 | 95,360 | 16,190 | 7706 |
| Velvet p3 | 151,038 | 116,591 | 17,556 | 7851 |
| Velvet p4 | 242,217 | 122,194 | 16,406 | 6874 |
| PASA | 69,805 | 69,805 | 13,099 | 6253 |
| All genes | 21,601 | 10,426 | ||
| Alt. Transcripts | 7,837 | |||
| RNA read counts | EST paired | 124,188 | ||
| EST single | 51,665 | |||
| RNA-Seq single | 187,823,326 |
Summary of the improved Official Gene Set (OGS2) comparing all gene constructions to good constructions having expression and/or homology evidence and to the previous OGS1.2 gene models. Percentages are of the total number of genes for the set
| Summary Statistics | OGS2 | OGS2 | OGS1.2 |
|---|---|---|---|
| Genes | 36,327 | 24,388 | 18,850 |
| Protein coding genes | 25,725 (71 %) | 24,388 | 15,566a |
| Non-coding genes | 3,997 (11 %) | 0 | 0 |
| Transposon protein genes | 6,605 (20 %) | 385a | 2,935a |
| Single transcript genes | 32,079 (88 %) | 20,243 (83 %) | 18,759 (99.5 %) |
| Genes assigned to orthologb | 15,176 (42 %) | 15,173 (62 %) | -- |
| Transcripts | 44,164 | 32,101 | 18,941 |
| Alternative transcripts | 7837 | 7712 | 91 |
| Mean isoforms per gene | 1.22 | 1.32 | 1 |
| Complete proteins | 41,256 (93 %) | 30,521 (95 %) | 18,941 (100 %) |
| Median transcript length | 1571 bp | 1603 bp | 1176 bp |
| Median CDS length | 777 bp | 981 bp | 1032 bp |
| Transcripts with UTR | 41,313 (94 %) | 30,512 (95 %) | 5264 (28 %) |
a2,935 OGS1.2 models are classified with strong homology to transposon proteins during OGS2 work, 385 models with expression and other insect homology but also transposon homology were retained in OGS2 “good” model set
b5,763 additional genes of OGS2 have significant protein homology, but are not assigned as orthologs in OrthoMCL orthology analysis, 3,454 of 24,388 “good” models lack significant homology, but have expression evidence
The types of evidence and levels of support for Nasonia vitripennis gene sets (OGS2 and others). Sequence-level statistics for the different types of evidence are given as proportions of the gene sets that are validated. Gene structure level statistics (ESTgene, Progene, RNAgene) are counts of the number of models that reach three structure level agreements. Homology level statistics are counts of the number of models and proportions matching proteins of reference species and paralogous (same species) proteins. See Methods section for details on the evidence types and the statistics that were measured
| Evidence | Available evidence | Statistic | OGS1.2 | Evidence-prediction set | OGS2 | OGS2 Good genes | NCBI RefSeq | Full-length RNA-Seq assembly |
|---|---|---|---|---|---|---|---|---|
| EST | 18 Mb | Seq. Overlap | 0.506 | 0.814 | 0.768 | 0.715 | 0.672 | 0.724 |
| Protein | 26 Mb | Seq. Overlap | 0.674 | 0.696 | 0.729 | 0.693 | 0.616 | 0.612 |
| RNA | 46 Mb | Seq. Overlap | 0.381 | 0.551 | 0.599 | 0.54 | 0.468 | 0.571 |
| RefSeq | 17 Mb | Seq. Overlap | 1 | 0.934 | 0.958 | 0.908 | 0.857 | 0.839 |
| Intron | 66,593 | Splices Hit | 0.846 | 0.965 | 0.981 | 0.969 | 0.903 | 0.975 |
| TAR | 75 Mb | Seq. Overlap | 0.292 | 0.850 | 0.533 | 0.443 | 0.37 | 0.386 |
| Transposon | 28 Mb | Seq. Overlap | 0.168 | 0.282 | 0.406 | 0.099 | 0.009 | 0.039 |
| ESTgene | 10,194 | Perfect | 2737 | 3996 | 4952 | 4900 | 3631 | 4293 |
| ESTgene | 10,194 | Equal 66 % | 3491 | 5059 | 6283 | 6198 | 4284 | 5187 |
| ESTgene | 10,194 | Some | 6263 | 9940 | 11,313 | 11,157 | 7123 | 8373 |
| Progene | 44,040 | Perfect | 4808 | 6713 | 8048 | 8010 | 6215 | 4935 |
| Progene | 44,040 | Equal 66 % | 7759 | 12,217 | 14,046 | 13,837 | 9003 | 8567 |
| Progene | 44,040 | Some | 11,563 | 18,173 | 21,759 | 19,718 | 10,861 | 18,457 |
| RNAgene | 28,016 | Perfect | 6004 | 9531 | 14,899 | 13,804 | 8502 | 28,016 |
| RNAgene | 28,016 | Equal 66 % | 8173 | 13,552 | 18,829 | 17,608 | 10,202 | 28,016 |
| RNAgene | 28,016 | Some | 11,933 | 19,602 | 24,936 | 22,179 | 12,258 | 28,016 |
| Homolog | 11,683 | Matches | 16,174 | 16,669 | 23,994 | 17,341 | 11,950 | 13,187 |
| Homolog | 11,683 | Found | 10,426 | 10,593 | 11,683 | 11,683 | 9323 | 9650 |
| Homolog | 11,683 | Bits/Amino Acid | 0.449 | 0.424 | 0.416 | 0.455 | 0.562 | 0.558 |
| Paralog | Matches | 12,843 | 14,503 | 19,423 | 12,576 | 7904 | 10,520 | |
| Paralog | Bits/Amino Acid | 0.459 | 0.45 | 0.564 | 0.517 | 0.554 | 0.635 | |
| Genome | Coding Seq. | 28 Mb | 31 Mb | 36 Mb | 29 Mb | 10 Mb | 16 Mb | |
| Genome | Exon Seq. | 29 Mb | 52 Mb | 70 Mb | 45 Mb | 24 Mb | 24 Mb | |
| Genome | Gene count | 18,941 | 23,605 | 36,327 | 24,388 | 12,989 | 20,926 |
Fig. 1Number of genes with strong (>2/3 overlap) or medium (>1/3 overlap) support from sequence orthology, evidence of transcription, or both. Panels show the source of evidence for genes within the ortholog and paralog subsets and the whole OGS2
Number of insect genes classified to gene families (GF) that are common among the arthropods by OrthoMCL (ARP9, version arp11u11). Five out of nine insect species are summarized. Dupl and Singl designate the proportion duplicated and singleton genes relative to the median found among insects (Dupl:5000, Singl:10000)
| Gene Families (GF) | Gene Counts | Proportions | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Gene Sets | GF | Ortholog GF | GF missing genes | Genes | Species specific genes | Species specific paralogs | Single ortholog genes | Duplicated ortholog genes | Dupl | Singl |
|
| 10,293 | 8983 | 92 | 24,296 | 5446 | 6686 | 8239 | 3925 | 2.1 | 1.4 |
|
| 8591 | 8560 | 170 | 10,145 | 987 | 88 | 8182 | 888 | 0.2 | 0.9 |
|
| 9633 | 9291 | 107 | 15,029 | 2943 | 1567 | 8710 | 1809 | 0.7 | 1.2 |
|
| 8893 | 8388 | 116 | 16,985 | 4586 | 2163 | 7608 | 2628 | 1.0 | 1.2 |
|
| 8464 | 7636 | 187 | 14,289 | 2824 | 2556 | 6994 | 1915 | 0.9 | 1.0 |
Gene set quality measurements, including deviation of protein size from the group median, and maximal bit score per species in pairwise comparisons within the arthropod orthology groups. The bit score measures both gene model artefacts of alternative gene sets within species, and evolutionary divergence. Protein sizes may be more evolutionarily conserved, and may detect artefacts across and within speciesa
| Gene set | Average homology bitscore | Protein size deviation from median | Percent shorter than 2 standard deviations from median |
|---|---|---|---|
|
| 727.6 | −7.7 | 3.2 |
|
| 722.3 | −7.8 | 2.7 |
|
| 683.5 | −12.7 | 4 |
|
| 733.9 | −0.3 | 2.4 |
|
| 694.3 | −30 | 7.3 |
|
| 552 | −26.1 | 4.5 |
|
| 508.7 | 54.5 | 1.3 |
aFor each orthology group, the median protein size of all genes among the species within the group is determined. Then for each species gene set, the maximal BLASTp bit score of a gene within that group is recorded as metric #1, and the protein size difference from the group median of that maximal match is recorded as metric #2. These metrics are averaged for all groups per species, and reported as average bit score, as average size deviation, and as percentage of size outliers (2 standard deviations below median sizes). These gene set quality measurements are provided by the Evigene scripts: “eval_orthogroup_genesets.pl” and “orthomcl_tabulate.pl”. Partial gene models are a common artefact of draft gene sets, indicated by both a negative deviation from group median sizes, and larger percentage of outliers. A similar calculation is part of the OrthoDB methodology [108]
Fig. 2Protein divergence of OGS2 genes against orthologs in other Hymenoptera. Every point represents a gene mapped on three coordinates originating from the corners. Each gene’s distance from a corner is proportional to the average amino-acid distance of orthologs between the two clades. AB = ant to bee distance; AN = ant to Nasonia distance; BN = bee to Nasonia distance. Diverging genes are highlighted in orange (fast) and blue (slow) as detected by the compound ratio (A) and intersection of ratios (B). See materials and methods for full description
Fig. 3Alternate spliced, expressed introns for gene longitudinalis lacking (lola) in Apis (blue) and Nasonia (red). Graph shows intron spans from a common hub exon, in bases on their genomes. The observed 181 introns in Nasonia cover 325 kilobases (kbp), and up to 200 kbp in the 58 observed introns in Apis. These are regularly spaced 1400 bases apart, related by divergent 3′ exons (one or two) of 500 to 900 bp, which produce different coding sequences and protein isoforms. The tiny blue and red bars at top of figure are short introns that join pairs of 3′ end exons in lola gene span. Introns are displayed in size order (y axis), but for a plotting mistake at Apis long end
Fig. 4Number of genes with alternative isoforms in OGS2 (a) split by presence of paralogs and (b) split by methylation in adult females
Fig. 5Effect of different factors on the probability of observing alternate isoforms of OGS2 gene models. Factors are ranked by relative importance (y axis). Factors with complete support and levels of the same factor were adjusted for plotting. Effect sizes are shown as the fold change in probability from the intercept (with 95 % confidence intervals). Numeric variables were log transformed prior to analysis
Genome tiling array expression gene evidence. TAR = Transcriptionally Active Regions representing runs of adjacently expressed 50 bp isothermal probes on a genome-wide tiling path microarray [4]
| Expression group | TAR exons | Unique TARs | Exonerate gene models |
|---|---|---|---|
| Adult female | 1,139,061 | 29,626 | 46,402 |
| Adult male | 1,165,881 | 20,625 | 49,344 |
| Embryo 10 h old female | 700,773 | 21,704 | 33,286 |
| Embryo 10 h old male | 677,712 | 6788 | 31,408 |
| Embryo 18 h old female | 781,163 | 13,268 | 31,342 |
| Embryo 18 h old male | 813,130 | 15,662 | 33,612 |
| Larva female | 670,292 | 7173 | 29,442 |
| Larva male | 667,030 | 3814 | 28,284 |
| Pupa female | 1,246,557 | 16,563 | 51,858 |
| Pupa male | 1,322,223 | 15,769 | 54,119 |
| Ovaries | 631,449 | 7113 | 27,483 |
| Testes | 658,960 | 21,449 | 30,348 |