| Literature DB >> 24479613 |
Christine G Elsik1, Kim C Worley, Anna K Bennett, Martin Beye, Francisco Camara, Christopher P Childers, Dirk C de Graaf, Griet Debyser, Jixin Deng, Bart Devreese, Eran Elhaik, Jay D Evans, Leonard J Foster, Dan Graur, Roderic Guigo, Katharina Jasmin Hoff, Michael E Holder, Matthew E Hudson, Greg J Hunt, Huaiyang Jiang, Vandita Joshi, Radhika S Khetani, Peter Kosarev, Christie L Kovar, Jian Ma, Ryszard Maleszka, Robin F A Moritz, Monica C Munoz-Torres, Terence D Murphy, Donna M Muzny, Irene F Newsham, Justin T Reese, Hugh M Robertson, Gene E Robinson, Olav Rueppell, Victor Solovyev, Mario Stanke, Eckart Stolle, Jennifer M Tsuruda, Matthias Van Vaerenbergh, Robert M Waterhouse, Daniel B Weaver, Charles W Whitfield, Yuanqing Wu, Evgeny M Zdobnov, Lan Zhang, Dianhui Zhu, Richard A Gibbs.
Abstract
BACKGROUND: The first generation of genome sequence assemblies and annotations have had a significant impact upon our understanding of the biology of the sequenced species, the phylogenetic relationships among species, the study of populations within and across species, and have informed the biology of humans. As only a few Metazoan genomes are approaching finished quality (human, mouse, fly and worm), there is room for improvement of most genome assemblies. The honey bee (Apis mellifera) genome, published in 2006, was noted for its bimodal GC content distribution that affected the quality of the assembly in some regions and for fewer genes in the initial gene set (OGSv1.0) compared to what would be expected based on other sequenced insect genomes.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24479613 PMCID: PMC4028053 DOI: 10.1186/1471-2164-15-86
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Additional sequence data for the improved honey bee genome
| Number of reads | 2.9 M | 5.24 M | 90 M |
| Read length | 290 bp | 56 bp | 50 bp |
| Pairs | No | Yes | Yes |
| Sequence coverage | 3.6× | 1.3× | 20× |
Assembly statistics for the improved honey bee genome
| Amel_4.5 | Anchored | 340 | 1,209 | 203,000 | 200,000 |
| Scaffolds | 5,644 | 997 | 250,271 | 229,734 | |
| Contigs | 16,501 | 46 | 229,734 | 229,734 | |
| Amel_4.0 | Anchored | 626 | 621/135 | 217,195 | 183,323 |
| Scaffolds | 10,742 | 359 | 315,719 | 231,029 | |
| Contigs | 18,944 | 40 | 231,029 | 231,029 | |
aFor Amel_4.0 Anchored scaffolds, N50 was calculated separately for 320 oriented and 306 non-oriented scaffolds.
Assembly comparison to 454 transcriptome data
| Abdomen | 14,614 | 13,980 | 95.6% | 13,987 | 95.7% |
| Brain + ovary | 27,412 | 26,341 | 96.0% | 26,342 | 96.0% |
| Embryo | 19,616 | 18,565 | 94.6% | 18,565 | 94.6% |
| Larvae | 18,050 | 9,061 | 50.1% | 9,041 | 50.0% |
| Mixed antennae | 14,891 | 13,868 | 93.1% | 13,865 | 93.1% |
| Ovary | 28,451 | 27,500 | 96.6% | 26,929 | 94.6% |
| Testes | 10,557 | 9,234 | 87.4% | 9,060 | 85.8% |
| Total | 133,591 | 118,549 | 88.7% | 117,789 | 88.2% |
Note: BLAT alignments of the assembled transcripts to the genome assemblies using default parameters and counting matches of any length at 95% identity.
Figure 1Distribution of mapped 454 reads with respect to AT content. The genomic reads were mapped to the Amel_4.5 assembly (scaffolds and contigs) using BLAT. With relatively stringent filtering (at least 80% of total length matched and gap size < 30%), 242,284 reads (93% of all reads) were aligned to the assembly. Most reads (236,090, 93%) aligned to fewer than 10 locations, and had unique alignments (210,625, 87%). The AT content for each alignment (adding 10% extension on either end) was calculated for reads with ≤ 10 match locations.
Figure 2GC content of genic regions and overall genome assemblies. For each gene, the GC content (percent G + C nucleotides) of genomic regions containing the gene was determined as described in methods. The cumulative distributions of GC content for overall genome assemblies (thin red line for Amel 4.5 and thin black line for Amel_2.0) show that the Amel_4.5 assembly has a higher fraction of low GC content regions than does the Amel_2.0 assembly (note the thin red line is to the left of the thin black line below about 28% GC). The cumulative distributions of GC content for the regions containing genes (thick red line for all OGSv3.2, thick green line for Previously known genes, thick blue line for Type I New genes and thick pink line for Type II new genes) show that regions containing genes are lower in GC content than the overall genome. This trend applies for the complete set of OGSv3.2 genes, as well as the three subsets. The distribution for Type I New genes lies to the left of the other distributions, showing that Type I New genes are located in lower GC content regions than the other gene subsets. The distribution for Type II new genes is to the right of the distributions for the other gene subsets, showing that the Type II New genes are located in higher GC content regions.
Comparison of OGSv1.0 and OGSv3.2
| Number of genes | 10,157 | 15,314 |
| Number of genes within mapped scaffolds (% of total no. of genes) | 5,973 (58.8%) | 13,285 (86.8%) |
| Number of genes within un-mapped scaffolds (% of total no. of genes) | 4,184 (41.2%) | 2,029 (13.2%) |
| Average coding sequence length (bp) | 1,623 | 1,266 |
| Average number of coding exons | 6.4 | 5.3 |
| Number of single coding exon genes (% of total no. of genes) | 795 (7.8%) | 2,059 (13.4%) |
| Number of multi-coding exon genes (% of total no. of genes) | 9,362 (92.2%) | 13,255 (86.6%) |
| Number of genes with spliced EST coverage (% of total no. of genes) | 3,039 (29.9%) | 12,172 (79.5%) |
| Number of genes with un-spliced EST coverage (% of total no. of genes) | 1,734 (17.1%) | 11,019 (72%) |
| Number of genes that overlap a protein alignment (% of total no. of genes) | 7,940 (78.2%) | 6,778 (44.3%) |
Figure 3Distribution of coding sequence lengths in OGSv3.2 and OGSv1.0. Histogram plots showing the number of genes having “X” coding sequence length in bins of 20 nt are illustrated using points instead of lines to allow visualization of both distributions. The range in coding sequence length extends to 70,263 and 53,649 in OGSv3.2 (blue) and OGSv1.0 (red), respectively, but this figure zooms in to show lengths only up to 5,000 nt. There were 386 and 344 genes with coding sequences longer than 5,000 nt in OGSv3.2 and OGSv1.0, respectively. This figure shows that the increased number of genes in OGSv3.2 is largely due to increased numbers of short genes. The number of larger genes is not decreased, so gene splitting is not likely a major source of additional genes.
New and previously known OGSv3.2 genes
| | Number of genes (% of total OGSv3.2 genes) | 15,314 (100%) | 782 (5.1%) | 3,953 (25.8%) | 10,579 (69.1%) |
| Scaffold analysis | Number of genes within mapped scaffolds (% of no. of gene type) | 13,285 (86.8%) | 544 (69.6%) | 3,199 (80.9%) | 9,542 (90.2%) |
| Number of genes within un-mapped scaffolds (% of no. of gene type) | 2,029 (13.2%) | 238 (30.4%) | 754 (19.1%) | 1,037 (9.8%) | |
| CDS analysis | Average CDS length | 1,266 | 1,172 | 330 | 1,622 |
| Average no. CDS Exons | 5.3 | 5.6 | 2.1 | 6.5 | |
| Number of single CDS exon genes (% of no. of gene type) | 2,059 (13.4%) | 101 (12.9%) | 1,239 (31.3%) | 719 (6.8%) | |
| Number of multi-CDS exon genes (% of no. of gene type) | 13,255 (86.6%) | 681 (87.1%) | 2,714 (68.7%) | 9,860 (93.2%) | |
| Intron analysis | Number of introns (% of total OGSv3.2 introns) | 66,212 (100%) | 3,585 (5.4%) | 4,333 (6.5%) | 58,294 (88%) |
| Number of introns validated by EST intron coordinates (% of introns of gene type) | 54,514 (82.3%) | 2,573 (71.8%) | 1,930 (44.5%) | 50,011 (85.8%) | |
| Peptide analysis | Number of genes with a peptide match (% of no. of gene type) | 3,631 (23.7%) | 132 (16.9%) | 82 (2.1%) | 3,417 (32.3%) |
| Protein analysis | No. of genes with overlap to at least one protein alignment (% of no. of gene type) | 6,778 (44.3%) | 270 (34.5%) | 186 (4.7%) | 6,322 (59.8%) |
| No. of genes with overlap to a Dmel protein alignment (% of no. of gene type) | 1,205 (7.9%) | 38 (4.9%) | 13 (0.3%) | 1,154 (10.9%) | |
| Total spliced and un-spliced expressed sequence support | No. of genes with overlap to at least one transcript alignment from any of the ten libraries (% of no. of gene type) | 13,517 (88.3%) | 704 (90.0%) | 2,771 (70.1%) | 10,042 (94.9%) |
| Spliced expressed sequence analysis | No. of genes with overlap to at least one spliced transcript alignment from each of the ten libraries (% of no. of gene type) | 1,062 (6.9%) | 32 (4.1%) | 15 (0.4%) | 1,015 (9.6%) |
| No. of genes with overlap to at least one spliced transcript alignment from any of the ten libraries (% of no. of gene type) | 12,172 (79.5%) | 622 (79.5%) | 2,110 (53.4%) | 9,440 (89.2%) | |
| No. of genes without overlap to any spliced transcript alignments in any of the ten libraries (% of no. of gene type) | 3,142 (20.5%) | 160 (20.5%) | 1,843 (46.6%) | 1,139 (10.8%) | |
| Genes broadly expressed across four tissues (% of no. of gene type) | 2,326 (15.2%) | 60 (7.7%) | 95 (2.4%) | 2,171 (20.5%) | |
| Genes narrowly expressed in only a single tissue (% of no. of gene type) | 3,346 (21.8%) | 234 (29.9%) | 1,139 (28.8%) | 1,973 (18.7%) | |
| No. of genes without overlap to any spliced transcript alignments in any of the four tissues (% of no. of gene type) | 3,632 (23.7%) | 192 (24.6%) | 1,985 (50.2%) | 1,455 (13.8%) | |
| Analysis of alignments to other bee genomes | No. of genes that align to Aflo_1.0 (% of no. of gene type) | 13,491 (88.1%) | 566 (72.4%) | 2,584 (65.4%) | 10,341 (97.8%) |
| No. of genes that align to Bter_1.0 (% of no. of gene type) | 12,262 (80.1%) | 527 (67.4%) | 1,566 (39.6%) | 10,169 (96.1%) | |
| Evidence-supported genes | No. of genes with overlap to at least one form of biological evidence (% of no. of gene type) | 14,084 (92.0%) | 713 (91.2%) | 2,930 (74.1%) | 10,441 (98.7%) |
| No. of genes that align to Aflo_1.0 and/or Bter_1.0 and/or overlap at least one form of biological evidence (% of no. of gene type) | 14,836 (96.9%) | 734 (93.9%) | 3,555 (89.9%) | 10,547 (99.7%) | |
| GC analysis | Number of genes on GC compositional domains >10 kb (% of OGSv3.2 total) | 15,224 (99.4%) | 777 (5.1%) | 3,923 (25.8%) | 10,524 (69.1%) |
| Avg. GC content of compositional domain gene resides in | 29.60% | 26.40% | 32.00% | 28.90% | |
| ENC analysis | Effective number of codons | 44.95 | 41.97 | 45.69 | 44.9 |
Genes were mapped to Amel_2.0 assembly with stringent mapping criteria of 80% gene coverage and 95% identity. Biological evidence includes transcript overlap (spliced or un-spliced), peptide hit, protein homolog alignment overlap, or InterPro domain presence.
Figure 4Insect orthologs in two gene sets (V2 and OGSv3.2). For each species, counts of near-universal orthologous groups that are missing an ortholog in that species, or in that species and one other species, are shown. Total counts are divided into groups with only single-copy orthologs and those with gene duplications, further divided into those with only one missing species and those with two missing species.
Repetitive elements in the genome
| | | 22,134,229 | 9.46 | | | | | | |
| | | | 94,86,745 | 4.05 | | | | | |
| SSR | 29,697 | | 1,441,651 | 0.62 | | | | | |
| Low complexity | 31,728 | | 8,001,104 | 3.42 | | | | | |
| Satellite | 5 (0) | 75 (6) | 43,990 | 0.02 | 0 | na | 0 | 0 | 0 |
| | 881 (65) | 28,004 (1102) | 12,647,484 | 5.40 | 25 | 7 | 40 | 2 | 6 |
| | 758 (13) | 21,244 (903) | 9,790,204 | 4.18 | 4 | 1 | 4 | 0 | 0 |
| | 2 (9) | 42 (4) | 49,549 | 0.02 | 1 | 1 | 1 | 0 | 0 |
| Copia | 1 (3) | 29 (3) | 43,892 | 0.02 | 0 | 1 | 1 | 0 | 0 |
| Gypsy | 0 (2) | 0 (0) | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 |
| Bel-Pao | 0 (4) | 0 (0) | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 |
| Unclassified LTR retrotransposons | 1 (0) | 13 (1) | 5,657 | 0.00 | 1 | 0 | 0 | 0 | 0 |
| | 2 (0) | 9 (3) | 12,472 | 0.01 | 0 | 0 | 2 | 0 | 0 |
| | 3 (4) | 140 (3) | 83,103 | 0.04 | 2 | 0 | 1 | 0 | 0 |
| R2 (NeSL, R2, R4, CRE) | 2 (1) | 112 (2) | 72,107 | 0.03 | 2 | 0 | 0 | 0 | 0 |
| Jockey (Rex, Jockey, Cr1, Kiri, L2, crack, Daphne) | 0 (1) | 0 (0) | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 |
| I (R1, I, Nimb, outcast, Tad, Loa) | 1 (1) | 28 (1) | 10,996 | 0.00 | 0 | 0 | 1 | 0 | 0 |
| Unclassified LINE | 0 (1) | 0 (0) | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 |
| | 19 (0) | 222 (29) | 69,938 | 0.03 | 0 | 0 | 0 | 0 | 0 |
| SS-Sine | 5 (0) | 31 (7) | 22,660 | 0.01 | 0 | na | 0 | 0 | 0 |
| Unclassified SINE | 14 (0) | 191 (22) | 47,278 | 0.02 | 0 | na | 0 | 0 | 0 |
| | 1 (0) | 2 (1) | 8,526 | 0.00 | 0 | na | 0 | 0 | 0 |
| | 301 (0) | 16,406 (348) | 7,256,932 | 3.10 | 1 | 0 | 0 | 0 | 0 |
| | 430 (0) | 4,423 (515) | 2,309,684 | 0.99 | 0 | 0 | 0 | 0 | 0 |
| | 51 (52) | 3,209 (93) | 1,339,131 | 0.57 | 7 | 6 | 27 | 2 | 5 |
| | 50 (46) | 3,200 (89) | 1,335,380 | 0.57 | 7 | 6 | 27 | 2 | 5 |
| Tc1/Mariner | 43 (40) | 2,636 (80) | 1,147,521 | 0.49 | 5 | 6 | 25 | 2 | 5 |
| PiggyBac | 2 (6) | 184 (2) | 87,963 | 0.04 | 2 | 0 | 2 | 0 | 0 |
| Unclassified TIR DNA transposons | 5 (0) | 380 (7) | 99,896 | 0.04 | 0 | na | 0 | 0 | 0 |
| | 0 (6) | 0 (0) | 0 | 0.00 | 0 | 0 | 0 | 0 | 0 |
| | 1 (0) | 9 (4) | 3,751 | 0.00 | 0 | na | 0 | 0 | 0 |
| | 72 (0) | 3,551 (106) | 1,518,149 | 0.65 | 14 | na | 9 | 0 | 1 |
| 158 (0) | 13,760 (250) | 6,934,063 | 2.96 | 17 | 0 | 0 | 0 | 0 | |
| Not categorized | 6 (0) | 946 (11) | 1,233,884 | 0.53 | 0 | na | 0 | 0 | 0 |
| Potential host gened | 152 (0) | 12,814 (239) | 5,700,179 | 2.44 | 17 | na | 0 | 0 | 0 |
For each group, the number of elements (putative families), the number of element fragments or copies in the genome, the cumulative length, the proportion of the genome and other features (elements containing chimeric or nested inserts of other elements (A), elements that appear to be complete with all typical structural and coding parts present even if stop codons or frameshifts are present (B), elements with a RT or Tase domain detected (C), potentially active elements that contain an intact ORF with all the typical domains although these can lack terminal repeats (D), elements with an intact ORF for the RT domain or parts of the Tase domain that could thus be partly active (E) are shown. The elements that could not be categorized or contained features of A. mellifera coding regions are shown at the bottom, these are probably not transposable elements.
aThe numbers of chimeric/nested elements within elements of other categories are not included in the total numbers of elements.
bThe software uses alignments to identify the longest fragment, which it deems as full-length. The number of full-length copies is also included in the total number of fragments.
cAdditional Columns:
A. No. elements containing inserts
B. No. complete elements
C. No. elements with RT or Tase domains
D. No. potentially active elements
E. No. potentially partially active elements
dPotential host genes were predicted by software using DNA characteristics, not by overlap analysis with gene predictions. An example of a potential host gene element is a coding sequence for a repeated protein domain or motif.