Jinhui Shen1, Qian Cong1, Nick V Grishin2. 1. Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX 75390-8816, USA. 2. Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX 75390-9050, USA ; Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX 75390-8816, USA.
Abstract
Due to the intriguing morphology, lifecycle, and diversity of butterflies and moths, Lepidoptera are emerging as model organisms for the study of genetics, evolution and speciation. The progress of these studies relies on decoding Lepidoptera genomes, both nuclear and mitochondrial. Here we describe a protocol to obtain mitogenomes from Next Generation Sequencing reads performed for whole-genome sequencing and report the complete mitogenome of Papilio (Pterourus) glaucus. The circular mitogenome is 15,306 bp in length and rich in A and T. It contains 13 protein-coding genes (PCGs), 22 transfer-RNA-coding genes (tRNA), and 2 ribosomal-RNA-coding genes (rRNA), with a gene order typical for mitogenomes of Lepidoptera. We performed phylogenetic analyses based on PCG and RNA-coding genes or protein sequences using Bayesian Inference and Maximum Likelihood methods. The phylogenetic trees consistently show that among species with available mitogenomes Papilio glaucus is the closest to Papilio (Agehana) maraho from Asia.
Due to the intriguing morphology, lifecycle, and diversity of butterflies and moths, Lepidoptera are emerging as model organisms for the study of genetics, evolution and speciation. The progress of these studies relies on decoding Lepidoptera genomes, both nuclear and mitochondrial. Here we describe a protocol to obtain mitogenomes from Next Generation Sequencing reads performed for whole-genome sequencing and report the complete mitogenome of Papilio (Pterourus) glaucus. The circular mitogenome is 15,306 bp in length and rich in A and T. It contains 13 protein-coding genes (PCGs), 22 transfer-RNA-coding genes (tRNA), and 2 ribosomal-RNA-coding genes (rRNA), with a gene order typical for mitogenomes of Lepidoptera. We performed phylogenetic analyses based on PCG and RNA-coding genes or protein sequences using Bayesian Inference and Maximum Likelihood methods. The phylogenetic trees consistently show that among species with available mitogenomes Papilio glaucus is the closest to Papilio (Agehana) maraho from Asia.
The order Lepidoptera contains approximately 160,000 described and half a million estimated species (Kristensen et al., 2007), and it represents one of the most diverse and fascinating group of insects. Recent studies have revealed their potential as model organisms to study the genetics of interesting phenotypic traits in butterflies, such as the Batesian mimicry in swallowtails (Clarke and Sheppard, 1972, Nishikawa et al., 2013), migration in the monarch (Zhan et al., 2011, Zhan et al., 2014) and wing pattern development in longwings (Hines et al., 2012, Surridge et al., 2011). The profound diversity and the recent evolutionary radiation of Lepidoptera provide rich materials to study evolution, speciation and adaptation (Engsontia et al., 2014, Zhang et al., 2013a). These studies benefit significantly from decoding the genomes of various Lepidoptera species.Recently, we published the draft genome of Eastern Tiger swallowtailPapilio (Pterourus) glaucus using next generation sequencing techniques (Cong et al., 2015). This nuclear whole genome was the first reported from the Papilionidae family. Traditional genome assemblers failed to automatically assemble the mitogenome, probably due to the difficulty in distinguishing NGS reads of the mitogenome from those of nuclear genome, the presence of nuclear copies of mitochondrial (NUMT) DNA (which can lead to conflicts in assembly), and the poor signal-to-noise ratio caused by the high coverage of mitochondrial DNA (Hahn et al., 2013). However, a dedicated effort should be able to assemble the mitogenome from whole-genome sequencing reads. The mitogenome sequences are expected to be useful for phylogenetic studies, and they have been obtained for many species. The Papilionidae family has over 570 species worldwide (C.A. Bridges, 1988), while only 17 species have complete mitogenomes currently available in GenBank (included in Table 1, accessed on: 11/28/2014).
Table 1
List of taxa analyzed in present paper.
Family
Species
Length (bp)
Accession number
References
Papilionidae
Papilio glaucus
15,306
KR822739
This study
Papilio bianor
15,357
NC_018040.1
Unpublished
Papilio dardanus
15,337
JX313686.2
Unpublished
Papilio maackii
15,357
KC433408.1
Dong et al. (2013)
Papilio machaon
15,185
HM243594.1
Unpublished
Papilio maraho
16,094
FJ810212.1
Unpublished
Papilio polytes
15,256
KM014701.1
Wang et al. (2014a)
Papilio syfanius
15,359
KJ396621.1
Dong et al. (2014)
Parnassius apollo
15,404
KF746065.1
Chen et al. (2014a)
Parnassius bremeri
15,389
NC_014053.1
Kim et al. (2009)
Parnassius imperator
15,424
KM507326.1
Wang et al. (2014b)
Sericinus montela
15,242
HQ259122.1
Ji et al. (2012)
Luehdorfia taibai
15,553
KC952673.1
Lian-Xi et al. (2014)
Teinopalpus aureus
15,242
HM563681.1
Qin et al. (2012a)
Lamproptera curius
15,277
KJ141168.1
Unpublished
Graphium timur
15,226
KJ472924.1
Chen et al. (2014b)
Atrophaneura alcinous
15,266
KJ540880.1
Chen et al. (2014c)
Troides aeacus
15,263
EU625344.1
Unpublished
Lycaenidae
Coreana raphaelis
15,314
DQ102703.1
Kim et al. (2006)
Cupido argiades
15,330
KC310728.1
Zhang et al. (2013b)
Curetis bulis
15,162
JX262888.1
Zhang et al. (2013c)
Lycaena phlaeas
15,280
JX262887.1
Zhang et al. (2013c)
Protantigius superans
15,248
HQ184265.1
Kim et al. (2011a)
Spindasis takanonis
15,349
HQ184266.1
Kim et al. (2011a)
Riodinidae
Abisara fylloides
15,301
HQ259069.1
Unpublished
Apodemia mormo
15,262
KJ647171.1
Kim and Kim (2014)
Pieridae
Anthocharis bambusarum
15,180
KC465748.1
Unpublished
Aporia crataegi
15,140
JN796473.1
Park et al. (2012)
Aporia intercostata
15,144
KC461928.1
Unpublished
Catopsilia pomona
15,142
JX274649.1
Hao et al. (2014)
Delias hyparete
15,186
JX094279.1
Shi et al. (2012)
Eurema hecabe
15,160
KC257480.1
Sun et al. (2014)
Hebomoia glaucippe
15,701
KC489093.1
Hao et al. (2013a)
Leptidea morsei
15,122
JX274648.1
Hao et al. (2014)
Artogeia melete
15,140
EU597124.1
Hong et al. (2009)
Pieris rapae
15,157
NC_015895.1
Mao et al. (2010)
Hesperiidae
Ampittia dioscorides
15,313
KM102732.1
Unpublished
Carterocephalus silvicola
15,765
KJ629163.1
Kim et al. (2014)
Celaenorrhinus maculosa
15,282
KF543077.1
Wang et al. (2013a)
Choaspes benjaminii
15,300
KJ629164.1
Kim et al. (2014)
Ctenoptilum vasava
15,468
JF713818.1
Hao et al. (2012)
Daimio tethys
15,350
KJ629165.1
Kim et al. (2014)
Erynnis montanus
15,530
KC659955.1
Wang et al. (2014c)
Lobocla bifasciatus
15,366
KJ629166.1
Kim et al. (2014)
Ochlodes venata
15,622
HM243593.1
Unpublished
Potanthus flavus
15,267
KJ629167.1
Kim et al. (2014)
Nymphalidae
Abrota ganga
15,356
KF590536.1
Wu et al. (2014)
Acraea issoria
15,245
GQ376195.1
Hu et al. (2010)
Apatura ilia
15,242
JF437925.1
Chen et al. (2012)
Apatura metis
15,236
JF801742.1
Zhang et al. (2012)
Argynnis childreni
15,131
KF590547.1
Wu et al. (2014)
Argynnis hyperbius
15,156
JF439070.1
Wang et al. (2011)
Athyma asura
15,181
KF590542.1
Wu et al. (2014)
Athyma cama
15,269
KF590526.1
Wu et al. (2014)
Athyma kasa
15,230
KF590524.1
Wu et al. (2014)
Athyma opalina
15,240
KF590551.1
Wu et al. (2014)
Athyma perius
15,277
KF590528.1
Wu et al. (2014)
Athyma selenophora
15,208
KF590529.1
Wu et al. (2014)
Athyma sulpitia
15,268
JQ347260.1
Tian et al. (2012)
Bhagadatta austenia
15,615
KF590545.1
Wu et al. (2014)
Calinaga davidis
15,267
HQ658143.1
Xia et al. (2011)
Danaus chrysippus
15,236
KF690637.1
Gan et al. (2014a)
Danaus plexippus
15,314
KC836923.1
Servin-Garciduenas and Martinez-Romero (2014)
Dichorragia nesimachus
15,355
KF590541.1
Wu et al. (2014)
Dophla evelina
15,320
KF590532.1
Wu et al. (2014)
Euploea core
15,192
KF590546.1
Wu et al. (2014)
Euploea midamus
15,187
KJ866207.1
Unpublished
Euploea mulciber
15,166
HQ378507.1
Hao et al. (2013b)
Euthalia irrubescens
15,365
KF590527.1
Wu et al. (2014)
Fabriciana nerippe
15,140
JF504707.1
Kim et al. (2011b)
Hamadryas epinome
15,207
KM378244.1
Cally et al. (2014)
Heliconius cydno
15,367
KM208636.1
Qian (2014)
Heliconius hecale
15,338
KM068091.1
Shen and Wang (2014)
Heliconius melpomene
15,328
KP100653.1
(Heliconius Genome, 2012, Meng et al., 2014)
Heliconius pachinus
15,369
KM014809.1
Huang et al. (2014a)
Hipparchia autonoe
15,489
GQ868707.1
Kim et al. (2010)
Ideopsis similis
15,200
KJ476729.1
Gan et al. (2014b)
Issoria lathonia
15,172
HM243590.1
Unpublished
Junonia almana
15,256
KF590539.1
Wu et al. (2014)
Junonia orithya
15,214
KF199862.1
Shi et al. (2013a)
Kallima inachus
15,183
JN857943.1
Qin et al. (2012b)
Lexias dirtea
15,250
KF590531.1
Wu et al. (2014)
Libythea celtis
15,164
HQ378508.1
Unpublished
Melanargia asiatica
15,142
KF906486.1
Huang et al. (2014b)
Melanitis leda
15,122
JF905446.1
Shi et al. (2013b)
Melanitis phedima
15,142
KF590538.1
Wu et al. (2014)
Melitaea cinxia
15,162
HM243592.1
Unpublished
Neope pulaha
15,209
KF590543.1
Wu et al. (2014)
Neptis philyra
15,164
KF590552.1
Wu et al. (2014)
Neptis soma
15,130
KF590533.1
Wu et al. (2014)
Pandita sinope
15,257
KF590530.1
Wu et al. (2014)
Pantoporia hordonia
15,603
KF590534.1
Wu et al. (2014)
Parantica sita
15,211
KF590544.1
Wu et al. (2014)
Pararge aegeria
15,240
KJ547676.1
Teixeira da Costa (2014)
Parasarpa dudu
15,236
KF590537.1
Wu et al. (2014)
Parthenos sylvia
15,249
KF590550.1
Wu et al. (2014)
Polyura arja
15,363
KF590540.1
Wu et al. (2014)
Sasakia charonda
15,233
JX119051.1
Wang et al. (2012)
Sasakia funebris
15,233
JX131328.1
Wang et al. (2013b)
Tanaecia julii
15,316
KF590548.1
Wu et al. (2014)
Timelaea maculata
15,178
KC572131.1
Cao et al. (2013)
Tirumala limniace
15,285
KJ784473.1
Gan et al. (2014c)
Triphysa phryne
15,143
KF906487.1
Zhang et al. (2014a)
Yoma sabina
15,330
KF590535.1
Wu et al. (2014)
Ypthima akragas
15,227
KF590553.1
Wu et al. (2014)
Cossidae
Eogystia hippophaecolus
15,431
KC831443.1
Gong et al. (2014)
Bombycidae
Bombyx mori
15,643
KM279431.1
Zhang et al. (2014b)
Hepialidae
Thitarodes renzhiensis
16,173
NC_018094.1
Cao et al. (2012)
The insect mitogenome is a circular DNA of 14–19 kilobases (kb), containing 13 protein-coding genes (PCGs), 2 ribosomal-RNA-coding genes (rRNAs), 22 transfer-RNA-coding genes (tRNAs), and an A + T rich displacement loop (D-loop) control region (Cameron, 2014). Because of their maternal inheritance, compact structure, lack of genetic recombination, and relatively fast evolutionary rate, mitogenomes have been used widely in molecular phylogenetics and evolution studies (Cameron, 2014, Moritz et al., 1987). Here, we reconstruct and annotate the complete mitogenome of Papilio glaucus from next generation sequencing reads, and perform phylogenetic analyses of P. glaucus mitogenome with available complete mitogenomes of butterflies.
Materials and methods
Library preparation and Illumina sequencing
A male P. glaucus was caught and freshly frozen from Lake Ray Roberts State Park, Greenbelt Corridor along the Elm Fork of the Trinity River, 33.2536, − 97.0434, Denton County, Texas, USA (date 4-VIII-2013). The specimen will be deposited in the National Museum of Natural History, Smithsonian Institution, Washington, DC, USA (USNM). Detailed procedures and protocols for library preparation were described in Cong et al. (2015). Briefly, we extracted genomic DNA from a piece of muscle dissected from the butterfly thorax using the ChargeSwitch gDNA mini tissue kit (Life Technologies, Grand Island, NY, USA). 250 bp and 500 bp paired-end libraries were made following the Illumina TruSeq DNA sample preparation guide using enzymes from NEBNext Modules (New England Biolabs, Ipswich, MA, USA). These libraries were sequenced at the genomics core facility in UT Southwestern Medical Center for 150 bp from both ends with a rapid run on Illumina HiSeq1500.
Sequence assembly
Sequencing reads were processed sequentially by MIRABAIT (Chevreux et al., 1999) to remove contamination from sequence adapters, by Fastq_quality_trimmer (http://hannonlab.cshl.ed/fastx_toolkit/) to trim low-quality regions at both ends and by QUAKE to correct errors (Cong et al., 2015, Kelley et al., 2010). From either the 250 bp or 500 bp library, we used mitochondrial baiting and iterative mapping (MITObim) v1.6 (Hahn et al., 2013) to assemble the mitogenome using two approaches: (1) using mitogenomes of Papilio maackii (KC433408.1), Papilio polytes (KM014701.1) and Papilio maraho (FJ810212.1) as references to guide the assembly; (2) using a short COI barcode sequence (a segment of about 600 bp from the mitochondrial gene cytochrome oxidase I) of P. glaucus (GU090087.1) as the starting seed. We used the default parameters for MITObim except for setting the — kbait to be 35 instead of 31.The genome assemblies directly produced by the reference-guided mode in MITObim did not directly consider that the mitogenome should be a circular DNA and that the reads mapped to the N-terminus of the reference sequence could overlap with reads mapped to the C-terminus of the reference mitogenome. Therefore, it produced sequences whose C-terminal segment (usually about several hundred base pairs) is a duplication of the N-terminal segment. We detected such duplicated regions by aligning the N-terminal half and C-terminal half with BLASTN and manually removed the redundant segments. We also adjusted the linear representation of the circular DNA by circular permutation so that the sequence starts with the coding gene for ND2, which is the convention for most Lepidoptera sequences deposited in the database.We aligned the mitogenome sequences produced by different methods with MAFFT (Katoh and Standley, 2013) (Supplementary data). These assemblies mostly agree with each other, with most discrepancies located in the D-loop region. We derived our final mitogenome sequence from the alignment of these different assemblies by taking the dominant nucleotide or gap at each position.
Annotation and analysis of the mitochondrial genome
The mitogenome sequence was annotated using the MITOS web server (Bernt et al., 2013). We translated the sequences of PCGs to protein sequences using the genetic code for invertebrate mitogenomes. Secondary structures of tRNA genes were predicted using the same server.
Assembly quality assessment
We first checked if the assembly was well-supported by the sequencing reads by mapping the reads to the mitogenome using bowtie2 v2.2.3 (Langmead and Salzberg, 2012). The alignments were combined into one single SAM-format file, processed with SAMtools (Li et al., 2009) and visualized in Integrative Genomics Viewer (IGV) (Robinson et al., 2011). Number of mapped reads (coverage by reads) at each position was calculated using bedtools v2.20.1 (Quinlan and Hall, 2010) and the histogram of the coverage was prepared in IBM SPSS Statistics v21 and Microsoft Excel 2010.Second, we assessed the quality of our assembly by its consistency with other published Papilio sequences in the protein-, rRNA- and tRNA-coding regions. We aligned the rRNA- and tRNA-coding sequences directly and aligned translated sequences for PCGs to the corresponding proteins of other published Papilio mitogenomes. Alignments confirmed that our sequences are consistent with the majority of these published mitogenomes, and gaps are only in regions that are poorly conserved among other Papilio species.Finally, we confirmed the assembly by comparing with Sanger sequencing results. We compared the mitogenome sequence with a reported 2291 bp partial mitogenome sequence of P. glaucus (accession: AF044013) (Caterino and Sperling, 1999) using BLAST. In addition, we sequenced the D-loop region and one arbitrarily selected coding region that partly covers ND4 and ND5. We amplified the two fragments from genomic DNA using primers shown in Table 2 with AmpliTaq Gold® 360 Master Mix (Life Technologies, Grand Island, NY, USA), following the manufacturer's protocol. Amplified products were separated by 2% E-Gel® EX Agarose Gels (Life Technologies, Grand Island, NY, USA) and purified by Zymoclean™ Gel DNA Recovery Kit (Zymo Research, Irvine, CA, USA). The purified DNA fragments were sent for Sanger sequencing in the sequencing core facility at UT Southwestern Medical Center. Sequencing results were manually confirmed by visualizing the traces in FinchTV v1.4 and then compared with the assembled mitogenome.
Table 2
Primers used for amplification of the fragments containing D-loop and ND4 end region.
Fragment location on genome
Primer
Primer sequence (5′ to 3′)
14,387–15,306, 1–104
D-loopF
GCAACTGCTGGCACAAAAT
D-loopR
CCAATTCAACATCCCAATCA
7428–8136
ND4F
CTAATCCTAACCCATCCCAACC
ND4R
TAGCTGCTCCTCCTTCTATGA
Phylogenetic analysis
104 complete, non-redundant butterfly mitogenomes that are currently available were downloaded from GenBank (Table 1). Moth mitogenomes from Thitarodes renzhiensis (NC_018094.1), Bombyx mori (KM279431.1), and Eogystia hippophaecolus (KC831443.1) were also downloaded and used as outgroups. DNA sequences of the 37 protein- and RNA-coding genes were aligned by MAFFT. We manually checked the alignments of each gene, corrected sequences from some species with annotation errors based on consensus, replaced mitogenomes of poor quality (for example, sequences with frame-shift mutation that causes premature ending of proteins) with alternative mitogenome from the same species, and removed positions with large gaps and their surrounding regions with uncertain alignment.The processed alignments were analyzed for phylogeny with Bayesian Inference and Maximum likelihood methods using MrBayes v3.2 (Huelsenbeck and Ronquist, 2001) and RaxML v8.1 (Stamatakis, 2006). We built trees with a different partitioning of the data sets (unpartitioned; partitioned by genes, PCG codon positions, and the exclusion of 3rd codon positions of PCGs). In addition, the translated protein sequences were aligned with MAFFT and analyzed with MrBayes. For analyses based on DNA alignments, the most suitable nucleotide substitution model (GTR + I + G) selected by jModelTest v2.1.7 (Darriba et al., 2012) was used, and for the protein-based analyses, we used a mixed model (Poisson, Jones, Dayhoff, Mtrev, Mtmam, Wag, Rtrev, Cprev, Vt, and Blosum) provided by the MrBayes program. The resulting phylogenetic trees were visualized in FigTree v1.4.2.
Results
Coverage of the mitogenome assembly by the reads
The coverage of the assembled P. glaucus mitogenome was high by the sequencing reads, with about 6000 fold mean coverage at base pair level and 500 fold minimal coverage (Fig. 1A and B). As shown in Fig. 1A, the regions with the lowest coverage were the beginning, the end and the D-loop regions. The low coverage at the beginning and the end of the mitogenome was primarily an artifact from limiting each read to map to only one most likely position: the circular DNA was represented as a linear sequence, and reads that should map partly to the beginning and partly to the end were restricted to map to either the beginning or the end.
Fig. 1
Coverage of Papilio glaucus mitogenome by sequencing reads (250 bp and 500 bp) mapped to them by Bowtie2. (A). Coverage at each base position. (B). Histogram of coverage distribution. (C). Negative correlation between the coverage and A + T contents of 50 bp windows in the genome.
The lower coverage in the D-loop might indicate potential errors in that region. However, independent Sanger sequencing of both the D-loop region (from the end of rrnS to the beginning of ND2) and another arbitrarily selected region (from the end of ND4 to the beginning of ND5), completely matched our assembly, indicating its high quality. In addition, the mitogenome sequence also agreed with the partial mitogenome (2291 bp) sequence of P. glaucus (accession: AF044013) in the database. Our sequence showed 0.3% sequence divergence from the previous sequence (only 6 out of 2291 positions are different), which likely corresponded to sequence variations between different individuals of the same species.Instead, the low coverage of the D-loop region is probably related to its AT-rich composition, which tends to break during library preparation and is thus underrepresented in the sequencing libraries (Benjamini and Speed, 2012). Indeed, we observed that the percentage of A and T in a 50 bp window from the mitogenome is negatively correlated with the coverage for that region by the reads (Fig. 1C).
Base composition and genome structure
The P. glaucus mitogenome is a closed circular DNA of 15,306 bp. The nucleotide composition of the majority strand is A = 6117 (39.96%), T = 6193 (40.46%), G = 1167 (7.62%), and C = 1829 (11.95%), which is highly biased towards A and T. The majority strand has a negative AT-skew (− 0.0062) and GC-skew (− 0.2210) (Table 3), indicating a higher occurrence of T over A, C over G nucleotides on this strand. The P. glaucus mitogenome retains the typical insect mitogenome gene set, including 13 PCGs (ND1-6, COX1-3, ND4L, ATP8, ATP6, and CYTB), 22 tRNA genes (two for serine and leucine and one for each of the remaining amino acids), 2 ribosomal RNAs (rrnL and rrnS), and an A + T rich D-loop control region (Fig. 2 and Table 4).
Table 3
Composition and skewness of Papilio glaucus mitogenome regions.
Map of genes in the Papilio glaucus mitochondrial genome. PCGs are colored in yellow, tRNA-coding genes are in purple, rrnL and rrnS are in green. Each gene is shown as an arrow indicating the transcription direction. The black line in the middle shows the coordinate of each gene in the mitogenome. The arrows on top of the line correspond to genes coded on the majority strand, and those below show genes on the minority strand.
Table 4
Summary of the Papilio glaucus mitogenome.
Gene
Direction
From
To
Size
Intergenic nucleotides
Anticodon
Start codon
Stop codon
ND2
F
1
1014
1014
− 2
–
ATT
TAA
trnW
F
1013
1077
65
− 8
TCA
–
–
trnC
R
1070
1131
62
3
GCA
–
–
trnY
R
1135
1199
65
2
GTA
–
–
COX1
F
1202
2732
1531
0
–
CGA
T
trnL2
F
2733
2800
68
0
TAA
–
–
COX2
F
2801
3482
682
0
–
ATG
T
trnK
F
3483
3552
70
0
CTT
–
–
trnD
F
3553
3619
67
0
GTC
–
–
ATP8
F
3620
3787
168
− 7
–
ATT
TAA
ATP6
F
3781
4458
678
7
–
ATG
TAA
COX3
F
4466
5254
789
2
–
ATG
TAA
trnG
F
5257
5322
66
− 3
TCC
–
–
ND3
F
5320
5676
357
− 2
–
ATA
TAG
trnA
F
5675
5738
64
− 1
TGC
–
–
trnR
F
5738
5801
64
− 1
TCG
–
–
trnN
F
5801
5866
66
0
GTT
–
–
trnS1
F
5867
5927
61
1
ACT
–
–
trnE
F
5929
5995
67
− 2
TTC
–
–
trnF
R
5994
6060
67
0
GAA
–
–
ND5
R
6061
7794
1734
0
–
ATA
TAA
trnH
R
7795
7860
66
0
GTG
–
–
ND4
R
7861
9202
1342
− 4
–
ATA
T
ND4L
R
9199
9489
291
6
–
ATG
TAA
trnT
F
9496
9559
64
0
TGT
–
–
trnP
R
9560
9623
64
2
TGG
–
–
ND6
F
9626
10,159
534
4
–
ATT
TAA
CYTB
F
10,164
11,318
1155
2
–
ATA
TAA
trnS2
F
11,321
11,385
65
16
TGA
–
–
ND1
R
11,402
12,340
939
1
–
ATG
TAG
trnL1
R
12,342
12,409
68
0
TAG
–
–
rrnL
R
12,410
13,728
1319
0
–
–
–
trnV
R
13,729
13,791
63
0
TAC
–
–
rrnS
R
13,792
14,572
781
0
–
–
–
A + T rich region
14,573
15,060
488
0
–
–
–
trnM
F
15,061
15,129
69
0
CAT
–
–
trnI
F
15,130
15,193
64
− 3
GAT
–
–
trnQ
R
15,191
15,258
68
48
TTG
–
–
Annotation of the mitogenome
The annotation of the mitogenome is illustrated in Fig. 2 and summarized in Table 4. Nine protein-coding genes (ND2, COX1, COX2, ATP8, ATP6, COX3, ND3, ND6, and CYTB) are coded on the majority strand. COX1 uses start codon CGA, which is consistent with many other insect mitogenomes (Kim et al., 2009). A recent study using an expressed sequence tag from a Lepidopteran species confirmed the presence of COXI transcripts starting from CGA (Margam et al., 2011). Each of the rest of the genes starts with the typical ATN. COX1, COX2 and ND4 use an incomplete stop codon T (Ojala et al., 1981), and a complete TAA codon will likely be formed during mRNA maturation (Boore, 1999, Ojala et al., 1981). The 13 PCGs have a total length of 11,214 bp (Table 4).14 out of the 22 tRNA-coding genes are encoded on the majority strand. The tRNAs have a total length of 1443 bp, and their individual lengths range from 61 bp to 70 bp (Table 4). Secondary structures predicted by MITOS suggest that all tRNA genes adopt a typical cloverleaf structure except for trnS1 (Fig. 3). The dihydrouridine (DHU) arm of trnS1 does not form a stable stem-loop structure, which is very common in butterfly mitogenomes (Kim et al., 2014, Lu et al., 2013). The two rRNA genes, rrnL and rrnS, are located on the minority strand, and their lengths are 1319 bp and 781 bp, respectively (Table 4).
Fig. 3
Secondary structure of 22 tRNA-coding genes of Papilio glaucus mitogenome predicted by the MITOS web server. The tRNAs are labeled by their corresponding amino acids in abbreviations.
A 488 bp A + T rich region (A + T content: 94.7%) connects rrnS and trnM. This region contains an “ATAGA” motif located 19 bp downstream from rrnS and followed by 14 bp of poly-T stretch that is consistent with a gene regulation element commonly found in Lepidoptera (Lu et al., 2013, Salvato et al., 2008). In addition to this A + T rich region, there are 94 bp non-coding nucleotides that make up 12 intergenic spacer sequences, ranging from 1 bp to 48 bp in length. The longest 48 bp spacer is located between trnQ and ND2, and a 16 bp spacer is located between trnS2 and ND1 (Table 4). In addition, there are 33 bp of overlapping sequences at 10 locations. The longest 8 bp overlap is between trnW and trnC. There is a 7 bp “ATGATAA” overlap between ATP8 and ATP6 (Table 4), and this is a common feature for Lepidopteran mitogenomes (Lu et al., 2013).
Phylogenetic relationships
We phylogenetically analyzed 105 butterfly species from 6 families: Papilionidae, Hesperiidae, Pieridae, Lycaenidae, Riodinidae, and Nymphalidae, and used 3 species of moths as outgroups. Maximum Likelihood analysis of the DNA alignments (Fig. 4A) and Bayesian Inference of DNA (Fig. 4B and Supplemental S2) and protein sequences (Fig. 4C) correctly partitioned butterflies into 6 families and suggested the same tree topology at the family level: (Papilionidae + (Hesperiidae + (Pieridae + ((Riodinidae + Lycaenidae) + Nymphalidae)))). Tree topologies between different methods were very similar. The positions of several species vary between trees obtained with different methods, such as Carterocephalus silvicola, Hebomoia glaucippe, and Ypthima akragas. However, all the trees consistently place P. glaucus, the only available mitogenome from the subgenus Pterourus, as a sister of P. maraho, the only available mitogenome from the subgenus Agehana (Fig. 4 and Supplemental S2).
Fig. 4
Phylogeny of butterflies. (A). Phylogenetic tree obtained by RaxML with the data set of PCG and RNA genes, partitioned into 13 PCGs, 1 tRNA, and 2 rRNA groups. (B). Phylogenetic tree obtained by MrBayes based on PCG and RNA genes, partitioned into 13 PCGs, 1 tRNA, and 2 rRNA groups. (C). Phylogenetic tree obtained by MrBayes based on 13 protein sequences, unpartitioned. Number at each node shows confidence of that group by bootstrap in (A), or posterior probability in (B) and (C). Thitarodes renzhiensis (NC_018094.1), Bombyx mori (KM279431.1), and Eogystia hippophaecolus (KC831443.1) were used as outgroup.
Discussion
The traditional method to obtain the mitogenome is through Sanger sequencing of a couple of overlapping segments. Here, we describe our protocol of assembling mitogenomes from Next Generation Sequencing reads for whole genome sequencing, and report the mitogenome of Eastern tiger swallowtail, P. glaucus. We used MITObim, a published tool designed for this task. MITObim has a reference-guided mode: it finds the conserved regions of a mitogenome using related reference species and extends these regions by baiting reads that overlap with the assembled regions until no gaps are left in between. Another mode of MITObim works without a reference mitogenome: it starts with a short COI sequence and extends by baiting reads with overlaps, till the N- and C-termini of the sequence can be mapped to the same reads, indicating that a circular mitogenome has been assembled (Hahn et al., 2013). Compared with the traditional PCR method, this method of mitogenome assembly does not require multiple primer designing and optimization, especially for species with limited knowledge available for primers' design.However, MITObim could make mistakes due to (1) ambiguity in mapping and aligning the reads, (2) presence of sequencing error, and (3) difficulty in distinguishing mitochondrial DNA reads from nuclear DNA reads, especially those from nuclear copies of mitochondrial DNA or low-complexity regions. Therefore, taking the consensus of different MITObim runs with different modes and references, and careful validation of the result are needed. We produced a highly accurate assembly by integrating assemblies made by different MITObim modes (using other mitogenomes as references or the COI barcode sequence as seed). Even for the D-loop region that contains multiple short repeats, sequences obtained from Sanger sequencing showed no differences from our assembly.Our phylogenetic analysis yielded a detailed tree of butterflies with available complete mitogenomes. Traditionally, Hesperiidae were considered to be at the root of the phylogenetic tree for all butterflies due to their similarity in morphology to moths (Kristensen and Skalski, 1998). However, several recent studies with molecular evidence have suggested a different evolutionary history of butterflies: (Papilionidae + (Hesperiidae + (Pieridae + (Nymphalidae + (Lycaenidae + Riodinidae))))) (Heikkila et al., 2012, Kim et al., 2014, Regier et al., 2013). Our phylogenetic analysis contained many more taxa compared with previous analyses and supported the same result in placing Papilionidae at the base. The reason for this apparent contradiction between the traditional morphology and recent molecular-based phylogeny is still poorly understood and requires future analysis.It is notable that P. glaucus, a swallowtail native of eastern North America, is found to be confidently grouped with P. maraho based on our phylogeny analysis. P. maraho is a threatened swallowtail endemic to Taiwan in Asia (Baillie and Groombridge, 1996). Morphology, behavior and phylogenetic studies based on COI barcode suggested that P. maraho is very close to Papilio elwesi (Igarashi, 1979, Lu et al., 2009). Both P. elwesi and P. maraho are frequently attributed to the subgenus Agehana, which in some studies is included in subgenus Chilasa (Hancock, 1983, Zakharov et al., 2004). Despite a wide geographic separation, Agehana and Chilasa native to Asia are likely to be the closest relatives of the subgenus Pterourus (Zakharov et al., 2004), which is native to America. We speculate that these butterflies might have migrated between Asia (Agehana) and North America (Pterourus). Swallowtail butterflies have dispersed between continents (Condamine et al., 2012, Condamine et al., 2013). A recent report had found that Polyommatus blue butterflies traveled from Asia to North America via the Bering Strait, ultimately migrating to South America (Vila et al., 2011). Several other reports found that other butterfly species, animals, and plants followed the same route to the New World (Donoghue and Smith, 2004, Enghoff, 1995, Mullen, 2006, Peña et al., 2010).Although our phylogenetic analyses produced similar phylogenetic trees, despite being performed with different methods, on different subsets of data (without or without RNA-coding sequences, DNA and translated proteins), and with different partition schemes, minor variations in positions of some taxa were observed. These inconsistencies might be caused by the available mitogenomes not carrying sufficient information to deduce accurate phylogeny, available software not being fully adequate for the task, or our position selection strategy requiring improvement. The current sample of taxa did not provide a fully resolved, consistent and accurate phylogeny. Additional analyses, mitogenomes, and possibly information from nuclear genomes are needed to address the question of the applicability of mitogenomes to infer a more accurate phylogenetic tree of butterflies.The following are the supplementary data related to this article.