| Literature DB >> 31623555 |
Jeanne Wilbrandt1,2, Bernhard Misof3, Kristen A Panfilio4, Oliver Niehuis5.
Abstract
BACKGROUND: The location and modular structure of eukaryotic protein-coding genes in genomic sequences can be automatically predicted by gene annotation algorithms. These predictions are often used for comparative studies on gene structure, gene repertoires, and genome evolution. However, automatic annotation algorithms do not yet correctly identify all genes within a genome, and manual annotation is often necessary to obtain accurate gene models and gene sets. As manual annotation is time-consuming, only a fraction of the gene models in a genome is typically manually annotated, and this fraction often differs between species. To assess the impact of manual annotation efforts on genome-wide analyses of gene structural properties, we compared the structural properties of protein-coding genes in seven diverse insect species sequenced by the i5k initiative.Entities:
Keywords: Gene prediction; exon-intron structure; insects; manual annotation; manual curation; structural annotation
Mesh:
Year: 2019 PMID: 31623555 PMCID: PMC6798390 DOI: 10.1186/s12864-019-6064-8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Summary statistics of the genomes, automatically annotated and manually annotated gene sets, and gene model properties for the seven analyzed species
| Holometabolous | Hemimetabolous | |||||||
|---|---|---|---|---|---|---|---|---|
| Coleoptera | Hymenoptera | Hemiptera | Thysanoptera | |||||
|
|
|
|
|
|
|
| ||
| Assembly size [Mbp] (% determined nucleotides) | 707.7 (85.1) | 1170.2 (58.0) | 163.8 (95.7) | 201.2 (92.7) | 650.5 (79.0) | 1098.7 (70.4) | 415.8 (63.4) | |
| AUTO | 22,253 | 24,732 | 11,956 | 10,966 | 14,085 | 19,587 | 18,021 | |
| OGS | 22,035 | 24,671 | 11,894 | 10,959 | 13,953 | 19,615 | 17,553 | |
| AUTO- | 749 | 972 | 805 | 659 | 795 | 1013 | 1118 | |
| SUB | ||||||||
| AUTO-SUB % of AUTO | 3.4 | 3.9 | 6.7 | 6.0 | 5.6 | 5.2 | 6.2 | |
| MAN-SUB | 770 | 933 | 825 | 670 | 778 | 945 | 1127 | |
| MAN-SUB % of OGS | 3.5 | 3.8 | 6.9 | 6.1 | 5.6 | 4.8 | 6.4 | |
| MAN-ADD | 216 | 98 | 50 | 30 | 221 | 161 | 381 | |
| MAN-ADD % of OGS | 1.0 | 0.4 | 0.4 | 0.3 | 1.6 | 0.8 | 2.2 | |
| Median transcript length [bp] | AUTO-SUB | 6183 | 8562.5 | 4340 | 5200 | 4362 | 9324 | 5001.5 |
| MAN-SUB | 5789.5 | 9280 | 3208 | 3996 | 4360 | 11,244 | 4064 | |
| Median protein length [aa] | AUTO-SUB | 358 | 255 | 445 | 430 | 358 | 257 | 419.5 |
| MAN-SUB | 389 | 300 | 423 | 419 | 372.5 | 320 | 419 | |
| Median exon count p.t. | AUTO-SUB | 4 | 4 | 6 | 5 | 5 | 4 | 6 |
| MAN-SUB | 4 | 4 | 5 | 5.5 | 5 | 4 | 6 | |
| Median median exon length p.t. [bp] | AUTO-SUB | 1210 | 984 | 2220 | 2151 | 1200 | 1086 | 1807.5 |
| MAN-SUB | 1345.5 | 1127 | 1786 | 1828 | 1194.5 | 1347 | 1755 | |
| Median median intron length p.t. [bp] | AUTO-SUB | 354.75 | 1192 | 107.5 | 1278.25 | 75 | 126.75 | 108 |
| MAN-SUB | 359 | 1363 | 100.5 | 1434 | 74 | 123 | 100.75 | |
Summary statistics on assemblies and manual annotation actions for each species and selected set-wide property values of MAN-SUB and AUTO-SUB
aa amino acids, bp base pairs, det. Nucs. determined nucleotides (i.e., not N), Mbp mega base pairs, OGS official gene set, p.t.: per transcript
Fig. 1Comparison of property distributions: AUTO-SUB vs. MAN-SUB. Distributions (violin plots) of five gene structure properties per genome (semi-logarithmic) comparing AUTO-SUB (top, red) and MAN-SUB (bottom, blue): unspliced transcript length [bp], protein length [aa], exon count p.t., median exon length p.t. [bp], median intron length p.t. [bp] in facet columns. Additionally, box plots indicate the quartiles of the data distributions; lower and upper hinges correspond to the first and third quartiles. Samples sizes are given (n, AUTO-SUB: red, MAN-SUB: blue). Values are derived from the longest predicted transcript per gene. Adjusted p-values of Bonferroni-corrected two-sample Kolmogorov-Smirnov (KS) tests (black) and two-sample Wilcoxon (W) tests (gray) are indicated for each combination of AUTO-SUB vs. MAN-SUB (per species and property) and displayed with gray background if one of these is significant. ns: not significant. Facet rows contain seven species (Anoplophora glabripennis [Coleoptera], Athalia rosae [Hymenoptera], Cimex lectularius [Hemiptera], Frankliniella occidentalis [Thysanoptera], Leptinotarsa decemlineata [Coleoptera], Oncopeltus fasciatus [Hemiptera], Orussus abietinus [Hymenoptera]). Taxonomic orders are color-coded, color codes represent the insect orders Coleoptera (yellow), Hymenoptera (orange), Hemiptera (burgundy), and Thysanoptera (brown). The left side tree illustrates the order-level phylogenetic relationships (after [36])
Fig. 2Comparison of AUTO-SUB and MAN-SUB subsets regarding correlations of... a) ... structural property medians (in rows from top to bottom): median unspliced transcript length [bp], median protein length [aa], median exon count p.t., median median exon length p.t. [bp], and median median intron length p.t. [bp] of AUTO-SUB (circles) and MAN-SUB (triangles) (semi-logarithmic). Notably, manual annotation of genes in two genomes with the largest assemblies (L. decemlineata, 1170 Mbp and O. fasciatus, 1099 Mbp) led to an increase (from AUTO-SUB to MAN-SUB, W-test) of the median transcript length (L. decemlineata: + 717.5 bp, p adj. = 1; O. fasciatus: + 1920 bp, p adj. = 0.07) and of the median protein length (L. decemlineata: + 45 aa, p adj. = 0.28; O. fasciatus: + 63 aa, p adj. = 0.003). In the three species with the smallest genome sizes in our sample (A. rosae, 163.8 Mbp; O. abietinus, 201.2 Mbp; F. occidentalis, 415.8 Mbp), manual annotation resulted in slight decreases of median transcript length (A. rosae: − 1132 bp, p adj. = 0.008; O. abietinus: − 1204 bp, p adj. = 1; F. occidentalis: − 937.5 bp, p adj. = 1) and median protein length (A. rosae: − 21 aa, p adj. = 1; O. abietinus: − 11 aa, p adj. = 1; F. occidentalis: − 0.5 aa, p adj. = 1). The two species with intermediate assembly sizes (A. glabripennis, 707.7 Mbp; C. lectularius, 650.5 Mbp), manual annotation resulted in a negligible decrease in median transcript length (A. glabripennis: − 393.5, p adj. = 1; C. lectularius: − 2 bp, p adj. = 1) and a slight increase in median protein length (A. glabripennis: + 31, p adj. = 1; C. lectularius: + 14.5 aa, p adj. = 1). b) … summary metrics (in rows from top to bottom): coding proportion [%] (i.e., the summed lengths of all exonic sequences in the annotation in relation to genome size), intronic proportion [%], total gene count, total exon count, and assembly GC content without ambiguity [%] of AUTO-SUB (circles) and MAN-SUB (triangles) (semi-logarithmic). Values are derived from the longest predicted transcript per gene. Line types indicate the smoothed conditional mean for AUTO-SUB (solid) and MAN-SUB (dashed). aa: amino acids; bp: base pairs; Mbp: mega base pairs; p.t.: per transcript; W-test: Bonferroni-corrected two-sample Wilcoxon test
Fig. 3Selected gene structure property correlations in all sets. a) Exon GC vs. count: Logarithmic display of median exon GC content [%] vs. exon count per transcript. b) Exon length vs. count: Semi-logarithmic display of median exon length [bp] vs. exon count per transcript. c) Intron GC vs. count Logarithmic display of median intron GC content [%] vs. intron count per transcript per transcript. d) Intron length vs. count: Semi-logarithmic display of median intron length [bp] vs. intron count per transcript. Facet columns show the two automatically generated sets (AUTO & AUTO-SUB; left) and three OGS-based sets (OGS & MAN-SUB & MAN-ADD; right). Values are given for the longest transcript per gene. Spearman’s rank correlation coefficients (r) of each property combination are given above each pair of plots (AUTO: orange, AUTO-SUB: red, MAN-ADD: dark green, MAN-SUB: dark blue, OGS: light blue). Facet rows show the seven species (Anoplophora glabripennis [Coleoptera], Athalia rosae [Hymenoptera], Frankliniella occidentalis [Thysanoptera], Leptinotarsa decemlineata [Coleoptera], Oncopeltus fasciatus [Hemiptera], Orussus abietinus [Hymenoptera]) with color coding according to Fig. 1