| Literature DB >> 28486658 |
Reuben W Nowell1, Ben Elsworth1, Vicencio Oostra2, Bas J Zwaan3, Christopher W Wheat4, Marjo Saastamoinen5, Ilik J Saccheri6, Arjen E Van't Hof6, Bethany R Wasik7, Heidi Connahs8, Muhammad L Aslam8, Sujai Kumar1, Richard J Challis1, Antónia Monteiro7,8,9, Paul M Brakefield10, Mark Blaxter1.
Abstract
The mycalesine butterfly Bicyclus anynana, the "Squinting bush brown," is a model organism in the study of lepidopteran ecology, development, and evolution. Here, we present a draft genome sequence for B. anynana to serve as a genomics resource for current and future studies of this important model species. Seven libraries with insert sizes ranging from 350 bp to 20 kb were constructed using DNA from an inbred female and sequenced using both Illumina and PacBio technology; 128 Gb of raw Illumina data was filtered to 124 Gb and assembled to a final size of 475 Mb (∼×260 assembly coverage). Contigs were scaffolded using mate-pair, transcriptome, and PacBio data into 10 800 sequences with an N50 of 638 kb (longest scaffold 5 Mb). The genome is comprised of 26% repetitive elements and encodes a total of 22 642 predicted protein-coding genes. Recovery of a BUSCO set of core metazoan genes was almost complete (98%). Overall, these metrics compare well with other recently published lepidopteran genomes. We report a high-quality draft genome sequence for Bicyclus anynana. The genome assembly and annotated gene models are available at LepBase (http://ensembl.lepbase.org/index.html).Entities:
Keywords: bicyclus anynana; lepidopteran genome; nymphalidae, nymphalid; satyrid; squinting bush brown
Mesh:
Year: 2017 PMID: 28486658 PMCID: PMC5493746 DOI: 10.1093/gigascience/gix035
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:Wet-season morph of Bicyclus anynana (picture credit: William H. Piel and Antónia Monteiro).
Data counts and library information.
| Library type | Platform | Read length | Insert size (expected) | Number of reads (raw) | Number of reads (trimmed) | Number of bases (trimmed) | SRA run accessions |
|---|---|---|---|---|---|---|---|
| Short insert | Illumina HiSeq2500 | 125 bp paired-end | 350 bp | 271 808 057 pairs | 267 241 712 (98.3%) | 66 334 099 834 (97.6%) | ERR1102671-2, ERR1102675-6 |
| Short insert | Illumina HiSeq2500 | 125 bp paired-end | 550 bp | 241 050 065 pairs | 234 269 871 (97.2%) | 57 913 474 128 (96.1%) | ERR1102673-4, ERR1102677-8 |
| Mate pair | Illumina HiSeq2500 | 100 bp paired-end | 3 kb | 77 105 680 pairs | 31 848 200 (41.3%) | 5 758 856 502 (37.3%) | ERR1750945 |
| Mate pair | Illumina MiSeq | 100 bp paired-end | 3 kb | 5 641 764 pairs | 2 170 610 (38.5%) | 397 993 018 (35.3%) | ERR754051 |
| Mate pair | Illumina HiSeq2500 | 100 bp paired-end | 5 kb | 77 614 870 pairs | 45 676 725 (58.9%) | 8 203 769 131 (52.8%) | ERR1750946 |
| Mate pair | Illumina MiSeq | 100 bp paired-end | 5 kb | 7 939 601 pairs | 4 734 000 (59.6%) | 861 352 793 (54.2%) | ERR754052 |
| Long read | PacBio P6 | 0.80–50 kb | 10 kb | 1 388 796 | 1 199 064 (86.3%) | 4 086 394 966 | ERR1797559-74 |
Figure 2:Kmer frequency distribution for B. anynana short-insert libraries (k = 31). The bimodality of the distribution, with peaks at approximately ×105 and ×210, is the result of heterozygosity in the sequence data.
Figure 3:Taxon-annotated GC-coverage plots for (a) draft and (b) final B. anynana genome assemblies. Each contig/scaffold in the assembly is represented by a circle, coloured according to the best match to taxonomically annotated sequence databases (see legends) and distributed according to the proportion GC (x-axis) and read coverage (y-axis). The upper- and right-hand panels show the distribution of the total span (kb) of contigs/scaffolds for a given coverage (upper panel) or GC (right panel) bin. The heterozygosity in the sample is evident in the bimodal coverage distribution seen in (a). The cluster of orange-coloured contigs at a lower coverage and higher GC than the main cloud were likely derived from contaminant Enterococcus present in the sample. The final assembly (b) shows the effective collapse of heterozygous regions, the removal of contaminant sequences, and the scaffolding of contigs into long contiguous sequences. Note that only taxon annotations with a span > 1 Mb are shown in the legend for clarity.
Summary of B. anynana genome assembly and comparison to selected lepidopteran genomes.
|
|
|
|
|
| |
|---|---|---|---|---|---|
| Assembly version | 1.2 | ASM15162v1 | 3 | Hmel2 | MelCinx1.0 |
| Span | 475.4 Mb | 481.8 Mb | 248.6 Mb | 275.2 Mb | 389.9 Mb |
| Contigs | |||||
| Number | 23 699 | 88 673 | 10 682 | 3100 | 48 180 |
| N50 | 78.7 kb | 15.5 kb | 111.0 kb | 328.9 kb | 14.1 kb |
| NumN50b | 1543 | 8075 | 548 | 214 | 7366 |
| Scaffolds | |||||
| Number | 10 800 | 43 379 | 5397 | 795 | 8261 |
| N50 | 638.3 kb | 4008.4 kb | 715.6 kb | 2102.7 kb | 119.3 kb |
| NumN50 | 194 | 38 | 101 | 34 | 970 |
| N90 | 99.3 kb | 61.1 kb | 160.5 kb | 273.1 kb | 29.6 kb |
| NumN90 | 909 | 258 | 366 | 176 | 3396 |
| Shortest/longest | 201 b/5 Mb | 53 b/16.2 Mb | 300 b/6.2 Mb | 394 b/9.4 Mb | 1.5 kb/668 kb |
| G+C content | 36.5% | 37.7% | 31.6% | 32.8% | 32.6% |
| NNNs | |||||
| Span | 5.8 Mb (1.2%) | 50.1 Mb (10.4%) | 6.7 Mb (2.7%) | 986 kb (0.4%) | 28.9 Mb (7.4%) |
| N50 | 1.4 kb | 5.0 kb | 2.5 kb | 2.4 kb | 1.4 kb |
| CEGMAc ( | C: 81.1%; D: 1.1; F: 97.2% | C: 76.6%; F: 96.8% | C: 90.3%; F: 96% | C: 88.7%; F: 96.8% | NA |
| BUSCOc ( | C: 98.3%; D: 1%; F: 99.2% | C: 97.5%; D: 0.5%; F: 98.4% | C: 97.4%; D: 8.6%; F: 98.5% | C: 98.8%; D: 0.7%; F: 99.3% | C: 85.7%; D: 0.2%; F: 91.8% |
aN50: the length of the contig/scaffold at which 50% of the genome span is accounted for, given a list of sequences sorted by length. bnumN50: the number of sequences required to reach the N50 sequence. cCEGMA/BUSCO notation: C, proportion (%) of genes completely recovered; D, duplication rate; F, proportion (%) of genes at least partially recovered (including complete genes); n, number of queries. Note that duplication rate (D) for CEGMA is given as the average number of (complete) genes recovered, whereas for BUSCO it is the proportion of complete genes recovered multiple times. BUSCO values are based on comparisons to the Arthropoda gene set.
Major types of repeat content for B. anynana.
| Repeat type | Span (Mb) | Proportion of genome |
|---|---|---|
| SINE | 10.8 | 2.3% |
| LINE | 15.3 | 3.2% |
| LTR elements | 1.1 | 0.2% |
| DNA elements | 0.8 | 0.2% |
| Small RNA | 10.8 | 2.3% |
| Unclassified | 86.2 | 18.1% |
| Total | 122.6 | 25.8% |
Number of genes in potential error categories.
| Category | Description | Number of genes |
|---|---|---|
| (a) | Single-exon | 7112 |
| (b) | Small exon (<9bp) | 1866 |
| (c) | Small intron (≤40 bp) | 45 |
| (d) | Short (CDS < 120 bp) | 127 |
| (e) | No hit to | 6532 |
| (f) | Duplicate (≥98% identity over ≥98% query length) | 822 |
| Totala | 4080 |
aDefined as the non-redundant total of the intersection of each category (a) to (d) with category (e), plus the shorter of any duplicates identified in category (f).
Summary of B. anynana gene prediction.
|
|
|
|
|
| |
|---|---|---|---|---|---|
| Assembly version | 1.2 | ASM15162v1 | 3 | Hmel2 | MelCinx1.0 |
| Number of CDS | 22 642 | 19 618 | 15 130 | 13 178 | 16 668 |
| Mean length | 1.4 kb | 1.6 kb | 1.4 kb | 1.3 kb | 958 bp |
| Median length | 1.2 kb | 1.2 kb | 981 bp | 927 bp | 693 bp |
| Min/max | 84 bp/28.3 kb | 23 bp/60.3 kb | 9 bp/58.9 kb | 45 bp/46.4 kb | 6 bp/45.4 kb |
| Introns | |||||
| Mean number per gene | 4.4 | 9.9 | 5.7 | 5 | NA |
| Length (mean/median) | 1.3/0.6 kb | 2.4/0.8 kb | 795/280 bp | 960/416 bp | NA |
| Exons | |||||
| Length (mean/median) | 208/126 bp | 283/161 bp | 206/149 bp | 284/157 bp | NA |
| Number of single-exon genes | 3571 | 1744 | 1461 | 3113 | NA |
| Transcript GC | 49.2% | 48.3% | 46.5% | 43% | 41.7% |
| Gene frequencyb (genes per Mb) | 47.7 | 32.1 | 60.9 | 55.5 | NA |
aGFF for M. cinxia not available. bDefined as the number of genes divided by the total genome span (Mb).
Figure 4:Assembly and gene prediction comparison among 10 lepidopteran genomes. (a) Cumulative assembly curves showing the relationship between the number of scaffolds (x-axis) and the cumulative span of each assembly (y-axis), coloured by species. Higher-quality assemblies are represented by an almost-vertical line (e.g., H. melpomene Hmel2 assembly in black), indicating that a relatively small number of scaffolds is required to reach the final genome span; conversely, a long tail indicates that the assembly includes a large number of smaller scaffolds. The curve for B. anynana (brown and bold) suggests a good assembly for this species, with the majority of the assembly comprised of relatively few scaffolds. (b)B. anynana v. 1.2 encodes the greatest number of genes of the 10 genomes and is particularly different from B. mori, which is of equivalent length. Species names/colours are as follows: “bicyclus” (brown), B. anynana; “bombyx” (blue), B. mori; “danaus” (light green), D. plexippus; “heliconius” (black), H. melpomene; “lerema” (dark green), L. accius; “melitaea” (orange), M. cinxia; “glaucus” (red), P. glaucus; “polytes” (pink), P. polytes; “xuthus” (violet), P. xuthus; “plutella” (grey), P. xylostella.