Literature DB >> 30949689

De novo genome assembly of the white-spotted flower chafer (Protaetia brevitarsis).

Kui Wang1, Pengpeng Li2, Yongyang Gao2, Chunqin Liu3, Qinglei Wang3, Jiao Yin1, Jie Zhang1, Lili Geng1, Changlong Shu1.   

Abstract

BACKGROUND: Protaetia brevitarsis, commonly known as the white-spotted flower chafer, is an important Scarabaeidae insect that is distributed in most Asian countries. Recently, research on the insect's harmfulness to crops, usefulness in agricultural waste utilization, edibility, medicinal value, and usability in insect immunology has provided sufficient impetus to demonstrate the need for a detailed study of its biology. Herein, we sequenced the whole genome of this species to improve our understanding and study of P. brevitarsis.
FINDINGS: We developed a highly reliable genome resource for P. brevitarsis (Lewis, 1879; Coleoptera: Cetoniinae) using Illumina and PacBio sequencing platforms. A total of 135.75 gigabases (Gb) was generated, providing 150-fold coverage based on the 810-megabases (Mb) estimated genome size. The assembled P. brevitarsis genome was 751 Mb (including the scaffolds longer than 2 kilobases (kb)) with 327 scaffolds, and the N50 length of the assembly was 2.94 Mb. A total of 34,110 (22,229 in scaffolds and 11,881 located in alleles) genes were identified using Evidence Modeler, which was based on the gene prediction results obtained from 3 different methods (ab initio, RNA sequencing based, and known gene based).
CONCLUSIONS: We assembled a high-quality P. brevitarsis genome, which will not only provide insight into the biology of the species but also provide a wealth of information that will inform researchers on the evolution, control, and utilization of P. brevitarsis.
© The Author(s) 2019. Published by Oxford University Press.

Entities:  

Keywords:  zzm321990 Protaetia brevitarsiszzm321990 ; assembly; genome; white-spotted flower chafer

Mesh:

Year:  2019        PMID: 30949689      PMCID: PMC6449472          DOI: 10.1093/gigascience/giz019

Source DB:  PubMed          Journal:  Gigascience        ISSN: 2047-217X            Impact factor:   6.524


Data Description

Context

Protaetia brevitarsis (Protaetia brevitarsis, NCBI:txid348688), commonly known as the white-spotted flower chafer (Fig. 1), is an important Scarabaeidae insect that is distributed throughout China and surrounding countries (Mongolia, Russia, Japan, South Korea, and North Korea) [1]. P. brevitarsis adults feed on multiple plant parts, while larvae live in the topsoil and feed on soil humus, decaying plant residues, and even animal dung. P. brevitarsis adults represent one of the most destructive pests in agriculture, and these insects cause direct damage to ≥29 important plant species [2]. In contrast, P. brevitarsis larvae are considered resource insects, and researchers in China investigated the use of the larvae to convert crop straw and other agricultural wastes to organic fertilizer [3]. Furthermore, research examined the potential of the insects to mitigate pollution caused by the improper treatment of crop straw and to produce insect protein fodder. In South Korea, P. brevitarsis was recently registered as a temporal standard food ingredient by the Ministry of Food and Drug Safety, and the insects were mass reared for commercial purposes [4,5]. Larval stage insects have been used in traditional medicine to treat inflammatory disease, breast cancer, hepatic cancer, liver cirrhosis, and hepatitis. Furthermore, researchers have identified and characterized compounds that were associated with activity against microbial pathogens [6] and cancer cells [7, 8], as well as those that inhibited platelet aggregation or thrombosis [9]. Furthermore, P. brevitarsis larvae are also considered a good model for insect immune system studies [10-12]. P. brevitarsis have well-developed cellular and humoral defence systems, and P. brevitarsis last instar larvae can produce approximately 0.5 mL of haemolymph, which is sufficient for most immunological experiments.
Figure 1

Image of adult of the white-spotted flower chafer, P. brevitarsis.

Image of adult of the white-spotted flower chafer, P. brevitarsis. These significant properties provided enough impetus for a detailed study of P. brevitarsis biology. However, the genetic basis and the evolutionary characteristics of P. brevitarsis remain unclear, and little information about this insect is available in public databases. In this study, we provide the first report of the draft P. brevitarsis genome assembly with high sequencing depth coverage that was generated using the Illumina and PacBio genome-sequencing platforms. These data will provide valuable information for further studies, as well as the control or utilization of this insect.

Samples and sequencing

A single P. brevitarsis pupa was selected from the laboratory population for genome sequencing. The laboratory population was derived from a field population collected in Gongzhuling, Jilin province, China. The genomic DNA of the pupa was extracted using a Qiagen Blood and Tissue Kit (Qiagen, Valencia, CA, USA) according to the manufacturer's instructions. A 20-kb SMRTbell library was generated using a BluePippin DNA Size Selection instrument (Sage Science, Beverly, MA, USA), and the prepared library was sequenced using P6/C4 chemistry according to the manufacturer's protocols (Pacific Biosciences, Menlo Park, CA, USA). The single-molecule real-time (SMRT) sequencing of long reads was conducted on a PacBio RS II System, and we obtained 27.98 Gb PacBio data (Table 1).
Table 1.

Summary statistics of generated sequence data

Library nameExperiment titleSequencing instrumentTotal bases (bp)Accession No.
Raw_200_DNA_HiseqDNA pair end (PE) libraryIllumina HiSeq 250048,637,157,380-
Raw_420_DNA_HiseqDNA PE libraryIllumina HiSeq 250059,133,181,272-
Filtered_200_DNA_HiseqDNA PE libraryIllumina HiSeq 250046,322,512,285SRR7421508
Filtered_420_DNA_HiseqDNA PE libraryIllumina HiSeq 250040,349,624,172SRR7421507
DNA_PacBio1DNA PacBio libraryPacBio RS II1,248,598,019SRR7429397
DNA_PacBio2DNA PacBio libraryPacBio RS II1,742,919,487SRR7429396
DNA_PacBio3DNA PacBio libraryPacBio RS II1,471,376,296SRR7429395
DNA_PacBio4DNA PacBio libraryPacBio RS II1,446,032,590SRR7429394
DNA_PacBio5DNA PacBio libraryPacBio RS II1,410,533,432SRR7429401
DNA_PacBio6DNA PacBio libraryPacBio RS II1,303,543,797SRR7429400
DNA_PacBio7DNA PacBio libraryPacBio RS II1,185,731,970SRR7429399
DNA_PacBio8DNA PacBio libraryPacBio RS II1,360,241,545SRR7429398
DNA_PacBio9DNA PacBio libraryPacBio RS II1,033,036,210SRR7429403
DNA_PacBio10DNA PacBio libraryPacBio RS II981,818,132SRR7429402
DNA_PacBio11DNA PacBio libraryPacBio RS II1,192,589,806SRR7429389
DNA_PacBio12DNA PacBio libraryPacBio RS II707,437,407SRR7429388
DNA_PacBio13DNA PacBio libraryPacBio RS II659,418,664SRR7429391
DNA_PacBio14DNA PacBio libraryPacBio RS II618,638,129SRR7429390
DNA_PacBio15DNA PacBio libraryPacBio RS II630,384,409SRR7429393
DNA_PacBio16DNA PacBio libraryPacBio RS II761,167,622SRR7429392
DNA_PacBio17DNA PacBio libraryPacBio RS II2,180,394,708SRR7470031
DNA_PacBio18DNA PacBio libraryPacBio RS II2,035,388,872SRR7470028
DNA_PacBio19DNA PacBio libraryPacBio RS II1,796,143,706SRR7470027
DNA_PacBio20DNA PacBio libraryPacBio RS II1,980,034,243SRR7470030
DNA_PacBio21DNA PacBio libraryPacBio RS II2,229,575,050SRR7470029
EggRNA-Seq libraryIllumina HiSeq 25006,049,557,600SRR7418793
LarvaRNA-Seq libraryIllumina HiSeq 25006,112,599,900SRR7418797
PrepupalRNA-Seq libraryIllumina HiSeq 25006,168,021,600SRR7418791
Middle pupalRNA-Seq libraryIllumina HiSeq 25006,015,743,700SRR7418789
Late pupalRNA-Seq libraryIllumina HiSeq 25006.260,516,400SRR7418796
Male adultRNA-Seq libraryIllumina HiSeq 25006.054,195,300SRR7418798
Female adultRNA-Seq libraryIllumina HiSeq 25006.188,099,400SRR7418790
Forewing (D1)RNA-Seq libraryIllumina HiSeq 25006.234,580,800SRR7585362
Forewing (D3)RNA-Seq libraryIllumina HiSeq 25006.208,411,800SRR7418792
Underwing (D1)RNA-Seq libraryIllumina HiSeq 25006.154,223,400SRR7418801
Underwing (D3)RNA-Seq libraryIllumina HiSeq 25006.172,792,500SRR7418794
Head (D1)RNA-Seq libraryIllumina HiSeq 25006.090,345,900SRR7418799
Head (D3)RNA-Seq libraryIllumina HiSeq 25006.247,745,100SRR7418800

Note: D1 or D3: tissues of newly (1-day) or 3-day emerged adults.

Summary statistics of generated sequence data Note: D1 or D3: tissues of newly (1-day) or 3-day emerged adults. Furthermore, 2 paired-end libraries with insert sizes of 200 and 420 bp, respectively, were constructed using the TruSeq DNA PCR-Free Library Prep Kit, and sequencing was performed on an Illumina HiSeq 2500 sequencer (Illumina, San Diego, CA, USA), producing 107.77 Gb of raw data (Table 1). The following reads were then removed: (1) reads with Ns, >20% low-quality bases (quality criterion: Q20), or >10 bp that overlapped with adapter sequences (allowing ≤3 bp mismatches) and (2) duplicated reads generated by polymerase chain reaction amplification during library construction. Therefore, a total of 86.67 Gb of clean data were obtained (Table 1). For transcriptome sequencing, total RNA from P. brevitarsis whole eggs, larvae, 3 different pupal stages, male adults, female adults, and tissues (forewing, underwing, and head) of newly (1-day) and 3-day emerged adults were collected and prepared using TRIzol reagent (Invitrogen, CA, USA). RNA quality was confirmed by gel electrophoresis, and the quantity was determined using a Nanodrop spectrophotometer. Sequencing libraries were generated using an Illumina TruSeq Stranded mRNA Library Prep Kit (Illumina, CA, USA), and sequencing was also performed on an Illumina HiSeq 2500 sequencer. In total, 79.96 Gb of data (Table 1), comprising 533.05 million reads (Table 2), were generated.
Table 2.

Summary statistics of RNA-Seq reads mapped onto the assemblies

 SampleNo. of reads Reads mapped to scaffolds (No. [%])Reads mapped to ASs (No. [%])
Egg40,330,38435,659,462 (88.42)14,961,772 (37.10)
Larva40,750,66633,467,876 (82.13)15,266,678 (37.46)
Prepupal stage41,120,14434,780,542 (84.58)15,172,926 (36.90)
Middle pupal stage40,104,95836,742,877 (91.62)15,307,418 (38.17)
Late pupal stage41,736,77637,468,206 (89.77)17,178,294 (41.16)
Male adult40,361,30236,198,806 (89.69)14,518,842 (35.97)
Female adult41,253,99632,620,778 (79.07)16,135,954 (39.11)
Forewing (D1)41,563,87236,449,354 (87.69)9233,440 (22.22)
Forewing (D3)41,389,41235,727,909 (86.32)13,516,032 (32.66)
Underwing (D1)41,028,15636,669,771 (89.38)14,943,970 (36.42)
Underwing (D3)41,151,95037,484,048 (91.09)16,851,062 (40.95)
Head (D1)40,602,30632,278,844 (79.50)11,935,160 (29.40)
Head (D3)41,651,63435,214,660 (84.55)11,779,198 (28.28)

Note: D1 or D3: tissues of newly (1-day) or 3-day emerged adults.

Summary statistics of RNA-Seq reads mapped onto the assemblies Note: D1 or D3: tissues of newly (1-day) or 3-day emerged adults.

Genome size and heterozygosity estimation

The k-mer analysis approach was used to estimate the genome size and heterozygosity. Quality-filtered 420 bp–insert size clean reads (Illumina) were used to perform the k-mer (k = 17) analysis. A total of 60,101,962,676 k-mers were counted from these clean reads. The count distribution of 17-mers with the highest peak occurred at a depth of 63 (Fig. 2), the estimated genome size was ∼810 Mb, and the heterozygosity was 2.35% (Table S1).
Figure 2

The 17-mer distribution of the P. brevitarsis genome using the jellyfish [13] program with 420-bp paired-end whole-genome sequencing data.

The 17-mer distribution of the P. brevitarsis genome using the jellyfish [13] program with 420-bp paired-end whole-genome sequencing data.

Genome assembly

The k-mer analysis indicated that the P. brevitarsis genome exhibited high heterozygosity, and a hierarchical assembly stratagem was used for genome assembly. Allele sequences (ASs) that differentiated from different sister chromatids could potentially generate bubbles and junctions in the string graph, which would hinder the genome assembler's generation of longer contigs. To achieve a high-quality assembly, we used PacBio long reads during the assembly process, and we detected and separated ASs during the assembly process in the hierarchical stratagem. Before assembly, all PacBio reads were quality filtered using SMRT Portal, and polymerase reads with read score <0.80 and subread lengths shorter than 500 bp were removed. After data filtering, 14.25 Gb of PacBio subreads were left (Table 3). The N50 value and mean size of filtered PacBio subreads were 16.06 and 10.53 kb, respectively, and the mean read score of filtered PacBio subreads was 0.837.
Table 3.

Summary statistics of data during the assembly process

No.Total bases (bp)N50Mean length (bp)
Filtered PacBio reads1,353,92614,251,368,54616,05910,525
Elementary contigs8,7601,127,134,570190,967128,668
HGCs3,816738,878,186347,620193,626
ASs4,939391,445,91991,68779,256
Scaffolds313751,076,2572,939,5222,399,604
Corrected HGCs3,821739,117,100327,214193,435
Corrected ASs4,939393,190,60992,10579,609
Corrected scaffolds313751,076,2572,939,5222,399,604
Final scaffolds327750,736,5012,939,5212,295,830
Summary statistics of data during the assembly process The mitochondrial genome was assembled first. The mitochondrial genome reads were picked out by alignment to the published reference P. brevitarsis mitochondrial genome (Genebank: NC_02 3453.1) using Blasr (BLASR, RRID:SCR_000764) (Table S2). Then, the selected reads were assembled using Canu (Canu, RRID:SCR_015880) (Table S2). When comparing the new assembled mitochondrial genome with the previous one (Genebank: NC_02 3453.1), there were 116 single-nucleotide variations and 12 insertions or deletions. Then, we used Marvel (Table S2) [14] to construct string graphs of filtered PacBio reads, and we assembled them into unitigs. In this step, both unitigs and singletons were collected as elementary contigs, and the total size of the elementary contigs was 1,127,134,570 bp (N50 = 190,967 bp; Table 3). We then selected ASs and employed a whole-genome alignment strategy to recognize alternative heterozygous ASs after masking all repeat sequences in the elementary contigs. As shown in Fig. 3, MUMmer (Table S2) [15] was used conduct whole-genome self-alignments. Small individual matches were clustered using the longest increasing subset algorithm and were then merged into larger matches. These matches were used to calculate the coverage of overlapping lengths of each pair of elementary contigs. The short one was defined as the AS if 85% no-repeat sequence of the total length was aligned to the long elementary contigs or if 85% of the reads were the same as longer elementary contigs, while the longer one was kept in elementary contigs. Each AS was confirmed via dot plot examination, and sequences were used to restore the AS to elementary contigs if the alignment quality was poor. After this step, elementary contigs were separated into 2 parts, haploid genome contigs (HGCs) and the ASs. Finally, 3,816 HGCs were retained (N50 = 347,620 bp; total length = 738,878,186 bp), and 4,939 ASs were retained (N50 = 91,687 bp; total length = 391,445,919 bp) (Table 3). HGCs were joined and elementary scaffolds were produced using SSPACE (SSPACE, RRID:SCR_005056) (Table S2) [16] and all PacBio RSII subread information. With the above procedure, we obtained a haploid genome assembly with a size of 751.08 MB, 313 raw scaffolds, and an N50 scaffold size of 2.94 Mb (Table 3). In the last step, we used Pilon (Pilon, RRID:SCR_014731) (Table S2) [17] to correct single-base differences, small insertions or deletions, block substitution events, and gaps in HGCs, ASs, and elementary scaffolds. All Illumina genome sequence data were aligned using BWA (BWA, RRID:SCR_010910) (Table S2) [18], and the corresponding alignments were provided as input to Pilon to conduct consensus polishing. Finally, the total size of the corrected HCGs and ASs was 739.12 Mb (including 3,821 contigs) and 393.19 Mb (including 4,939 sequences), respectively. The total size of the corrected scaffolds was 751.08 Mb (including 313 scaffolds), and the N50 was 2.94 Mb (Table 3). Then, we ran the assembly sequences through Contamination Screen and removed the contaminated sequences, trimming any Ns at the ends of the sequence. The total size of the final scaffolds was 750.74 Mb (including 327 scaffolds), and the N50 was 2.94 Mb (Table 3).
Figure 3

Schematic illustration of the method used to detect aligned sequences in the assembly. MUMmer was used to perform self-alignment on the elementary contigs; paired contigs were categorized into 4 types of outcome. In Case I, the contig aligns to itself, which will be ignored. In Case II, Contig 2 represents a contig with no obvious alignment with other contigs, and the contig type is defined as haploid genome contig. In Cases III and IV, the contig under analysis can align with another contig; in the figure, Contigs 4 and 6 are defined as haploid genome contigs because B is longer than A. In Case III, Contig 3 (the shorter contig) is defined as AS because the aligned sequence (a+b+c) accounted for >85% of the no-repeat sequence total length (A). In Case IV, Contig 5 (the shorter contig) is considered to be a duplication of sequence because the aligned sequence (a+b+c) accounted for <85% of the no-repeat sequence total length (A); therefore, Contig 5 is defined as a haploid genome contig.

Schematic illustration of the method used to detect aligned sequences in the assembly. MUMmer was used to perform self-alignment on the elementary contigs; paired contigs were categorized into 4 types of outcome. In Case I, the contig aligns to itself, which will be ignored. In Case II, Contig 2 represents a contig with no obvious alignment with other contigs, and the contig type is defined as haploid genome contig. In Cases III and IV, the contig under analysis can align with another contig; in the figure, Contigs 4 and 6 are defined as haploid genome contigs because B is longer than A. In Case III, Contig 3 (the shorter contig) is defined as AS because the aligned sequence (a+b+c) accounted for >85% of the no-repeat sequence total length (A). In Case IV, Contig 5 (the shorter contig) is considered to be a duplication of sequence because the aligned sequence (a+b+c) accounted for <85% of the no-repeat sequence total length (A); therefore, Contig 5 is defined as a haploid genome contig.

Validation and quality control

The completeness and accuracy of the assembly were assessed using 3 independent measures. We first mapped all Illumina paired-end reads onto the assemblies (scaffolds and ASs), and the results indicated that >73.24-fold effective depth was obtained across all of the scaffolds. Regarding ASs, the lowest depth was 13.08-fold. These data indicated that the genome was extensively covered by sequence reads (Table 4). We then aligned RNA-sequencing (RNA-Seq) reads to our assemblies (scaffold and ASs) using Spliced Transcripts Alignment to a Reference (STAR) (STAR, RRID:SCR_015899) (Table S2) [19]. For the RNA-Seq reads, the data indicated that 79.07–91.62% of reads generated from these samples could be correctly mapped to the scaffolds with appropriate splicing, while 22.22–41.16% of RNA-Seq reads were mapped to the ASs (Table 2). Furthermore, the benchmarking universal single-copy orthologs (BUSCO, RRID:SCR_015008) (BUSCO, Table S2) [20] data set was used to evaluate the completeness of the assembly. Approximately 93% of complete BUSCOs were found in the assembly. When compared to other sequenced coleopteran genomes, the data indicated that the complete BUSCOs found in the current assembled P. brevitarsis genome totaled 93.00%. Therefore, this percentage was lower than that observed in Tribolium castaneum (96.59%) and Pyrocoelia pectoral (98.80%) but higher than that observed in other genomes (Table 5). In summary, these results suggested that the genome assembly was complete and of high quality.
Table 4.

Summary statistics of Illumina genome-sequencing reads mapped onto the assemblies

Mean depthLowest depthHighest depth
Corrected HGCs121.973.24167.07
Corrected ASs85.6313.081,221.48
Corrected scaffolds122.273.24167.07
Table 5.

BUSCOs found in coleopteran genomes

SpeciesComplete (%)Fragment (%)Missing (%)Duplication (%)
O. taurus 80.4510.009.558.80
D. ponderosae 81.478.5310.0010.87
A. glabripennis 82.318.708.999.12
A. planipennis 91.902.705.404.10
P. brevitarsis 93.001.905.107.20
T. castaneum 96.592.900.519.40
P. pectoral 98.800.600.607.20
Summary statistics of Illumina genome-sequencing reads mapped onto the assemblies BUSCOs found in coleopteran genomes

Genome annotation

Repetitive sequences, including tandem repeats and interspersed repeats, were searched for in the P. brevitarsis genome. Tandem repeats in the genome were defined as ≥2 adjacent, approximate copies of a pattern of nucleotides. Tandem Repeats Finder (Table S2) [21] was used to search for tandem repeats in the genome. Two independent methods, homology based and de novo prediction, were used to identify interspersed repeats in the assembly. Regarding the homology-based method, the assembled genome was compared with Repbase (V.22.11) [22] using RepeatMasker (RepeatMasker, RRID:SCR_012954) and RepeatProteinMasker (Table S2) with default settings [23]. For de novo predictions, we built a de novo repeat library with LTR Finder (Table S2) [24] and RepeatScout (RepeatScout, RRID:SCR_014653) (Table S2) [25]. RepeatProteinMask (Table S2) was then used to identify putative transposable element (TE)-related proteins. After merging all of the repetitive elements identified using the aforementioned tools, we identified a total of 396.23 Mb of repetitive sequences, accounting for 51.82% of the haploid genome (Table 6). Regarding the ASs, 220.22 Mb of repetitive sequences were identified, accounting for 56.02% of the total length of the genome (Table 6).
Table 6.

Summary of identified repeat elements in the P. brevitarsis genome

Repeat element Repeat elements from haploid genomeRepeat elements from ASs
Length (bp)Percentage (%)Length (bp)Percentage (%)
Long terminal repeat109,722,08514.3560,133,49115.29
Long interspersed nuclear element101,529,62713.2852,758,84913.42
Short interspersed nuclear element259,9360.0350,3660.01
DNA element166,788,39221.8192,972,80123.65
Simple repeat4,749,9080.622,485,6610.63
Low complexity1,132,9190.15656,6260.17
Rolling circle7,162,2760.945,220,6921.33
Satellite304,7340.04221,4370.06
Other131,6050.0299,1130.03
Unclassified4,451,2770.585,618,7121.43
Total396,232,75951.82220,217,74856.02
Summary of identified repeat elements in the P. brevitarsis genome Four types of noncoding RNAs were searched for across the P. brevitarsis genome. Transfer RNAs (tRNAs) were annotated using tRNAscan-SE (tRNAscan-SE, RRID:SCR_010835) (Table S2) [26] with default parameters for eukaryotes. Ribosomal RNAs (rRNAs) were identified using BlastN (BLASTN, RRID:SCR_001598) alignments, and RNAmmer (Table S2) [27] was used to predict rRNAs and their subunits. Small nuclear RNAs and microRNAs were predicted using the Rfam (Rfam, RRID:SCR_007891) [28] database and BlastN (Table S2). These analyses identified 864 microRNAs, 3,277 tRNAs, 113 rRNAs, and 95 small nuclear RNAs. The protein-coding genes were annotated on the basis of evidence obtained using the homology-based method, ab initio prediction, and RNA-Seq data. Regarding the homology-based method, protein sequences from all Coleoptera in the NCBI Reference Sequence Database (2 October 2017) were collected and aligned with our genome scaffolds using GenBlastA (Table S2) [29]. Target regions were then expanded to 10 kb both for upstream and downstream analyses and were then used to determine accurate gene structures using GeneWise (GeneWise, RRID:SCR_015054) software (Table S2) [30]. For de novo prediction, AUGUSTUS (Augustus, RRID:SCR_008417) (Table S2) [31], Genemark (GeneMark, RRID:SCR_011930) (Table S2) [32], and SNAP (SNAP, RRID:SCR_007936) (Table S2) [33] programs were used to obtain predicted gene structures from repeat-masked genomes. The top 300 longest coding sequence identities (>90%) associated with RNA-Seq unigenes were selected to train these programs, and the resulting suitable parameters were used for P. brevitarsis gene de novo prediction. Furthermore, we identified gene structures with the assistance of RNA-Seq data. First, RNA-Seq reads were aligned against the genome using STAR (Table S2) to identify candidate exon regions with default parameters. StringTie (StringTie, RRID:SCR_016323) (Table S2) [34] was then used to assemble the aligned reads into transcripts. Finally, all data were combined using Evidence Modeler (EVidenceModeler, RRID:SCR_014659) [35] to produce the consensus gene set, and 22,229 and 11,881 protein-coding genes were generated from scaffolds and ASs, respectively. There were 469 identical genes detected between the 2 methods. Functional annotation of genes was performed using BlastP (BLASTP, RRID:SCR_001010) (Table S2) alignment to the Kyoto Encyclopedia of Genes and Genomes (KEGG) (KEGG, RRID:SCR_012773) [36, 37], Nr/Nt (2 March 2016), [38], Swiss-Prot/Uniprot and TrEMBL databases [39, 40]. Motifs and domains were determined using InterProScan (InterProScan, RRID:SCR_005829) [41, 42] against protein databases, including Pfam (Pfam, RRID:SCR_004726) [43, 44], SMART (SMART, RRID:SCR_005026) [45, 46], PANTHER (PANTHER, RRID:SCR_004869) [47, 48], and PROSITE (PROSITE, RRID:SCR_003457) [49, 50]. The results indicated that 17,625 genes from the haploid genome were annotated, while 8,887 genes from ASs were annotated (Table 7).
Table 7.

Summary of annotated genes in the P. brevitarsis genome

Database Annotated genes
From haploid genome (No. [%])From ASs (No. [%])
KEGG15,828 (71.16)7,980 (67.17)
Swiss-Prot10,509 (47.25)5,179 (43.59)
Nr17,487 (78.62)8,757 (73.71)
Nt3,688 (16.58)1,855 (15.61)
TrEMBL_eggNOG15,986 (71.87)8,029 (67.58)
No. of total annotated genes17,625 (79.24)8,887 (74.80)
Summary of annotated genes in the P. brevitarsis genome

Phylogenetic tree reconstruction and divergence time estimation

To investigate the phylogenetic position of P. brevitarsis, protein data from the NCBI database were retrieved for coleopteran insects Anoplophora glabripennis, Dendroctonus ponderosae, T. castaneum, Onthophagus taurus, P. pectoral, and Agrilus planipennis, and the lepidopteran insect Danaus plexippus was used to root the tree. All proteins were pooled together, and OrthoMCL (Table S2) [51] was used for orthologue group identification. A total of 76,623 orthologue groups were identified, and 13,627 gene families were specific to P. brevitarsis. Moreover, 2,354 orthologue groups, which were identified as single-copy genes that were shared between these species, were selected for subsequence analyses. The selected proteins from these species were concatenated and subjected to multiple alignment using MAFFT (MAFFT, RRID:SCR_011811) (Table S2) [52] and profile trimming with TrimAI (Table S2) [53]. After that, Beast 2 (Table S2) [54] was used to conduct phylogenetic analyses. The phylogenetic tree indicated that P. brevitarsis was closely related to O. taurus, and the estimated divergence time was ∼140 million years ago (Fig. 4).
Figure 4

Phylogenetic relationship of P. brevitarsis and 6 Coleoptera insects based on 2,354 orthologue genes. Estimated divergence times using D. ponderosae-T. castaneum [180Mya] as the calibration time are shown [55].

Phylogenetic relationship of P. brevitarsis and 6 Coleoptera insects based on 2,354 orthologue genes. Estimated divergence times using D. ponderosae-T. castaneum [180Mya] as the calibration time are shown [55].

Discussion

Scarabaeoidea is a diverse lineage of predominantly plant- and dung-feeding beetles that consists of >31,000 described species [55]. In this study, we sequenced the genome of P. brevitarsis, and this represents the first high-quality genome of a plant-feeding scarab. Plant- and dung-feeding scarab beetles are considered sister lineages [56], and they exhibit modes that can be used to test hypotheses of species diversification that may have been driven by interactions with angiosperm and mammal lineages. Therefore, P. brevitarsis genomic data could provide useful resources for studies that examine the evolution of insect lineages and major biotic changes in Earth's history. Furthermore, this high-quality reference genome will contribute to research associated with several recent investigations regarding P. brevitarsis’ harmfulness to crops, usefulness in agricultural waste utilization, edibility, medicinal value, and applications to insect immunology research.

Availability of supporting data

Raw sequencing reads have been deposited in the Sequence Read Archive database with NCBI Bioproject ID PRJNA477715 and PRJNA482477. The genome assembly including haploid genome contigs, ASs, and complete mitochondrial genome has been deposited in NCBI Genomes with accession No. RXPK00000000. Gene models and other supporting data are available via the GigaScience database GigaDB [57]. Key parameters we used that may affect the software results are available in Table S2.

Additional files

Table S1. Estimation of genome characteristics based on 17-mer analysis. Table S2. The software used in the study.

Abbreviations

AS: allele sequence; bp: base pair; BUSCO: benchmarking universal single-copy ortholog; Gb: gigabases; HGC: haploid genome contig; kb: kilobases; KEGG: Kyoto Encyclopedia of Genes and Genomes; Mb: megabases; NCBI: National Center for Biotechnology Information; RNA-Seq: RNA sequencing; SMRT: single-molecule real time; STAR: Spliced Transcripts Alignment to a Reference; rRNA: ribosomal RNA; TE: transposable element; tRNA: transfer RNA.

Competing interests

The authors declare that they have no competing interests.

Funding

This study was supported by the National Key Research and Development Program of China (2018YFD0800906 and 2017YFD0201204) and National Natural Science Foundation of China (31530095 and 31872935).

Author contributions

C.S. and J.Z. designed the study; C.L. and Q.W. collected samples; C.L., J.Y., and L.G. extracted DNA and RNA samples; Y.G. and P.L. worked on sequencing; C.S., P.L., and K.W. worked on the genome assembly, assessment, and annotation; and C.S. and K.W. wrote the manuscript. All authors read and approved the final version of the manuscript. Click here for additional data file. Click here for additional data file. Click here for additional data file. 8/26/2018 Reviewed Click here for additional data file. 11/26/2018 Reviewed Click here for additional data file. 1/16/2019 Reviewed Click here for additional data file. Click here for additional data file.
  39 in total

1.  SMART 4.0: towards genomic data integration.

Authors:  Ivica Letunic; Richard R Copley; Steffen Schmidt; Francesca D Ciccarelli; Tobias Doerks; Jörg Schultz; Chris P Ponting; Peer Bork
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

2.  Scaffolding pre-assembled contigs using SSPACE.

Authors:  Marten Boetzer; Christiaan V Henkel; Hans J Jansen; Derek Butler; Walter Pirovano
Journal:  Bioinformatics       Date:  2010-12-12       Impact factor: 6.937

3.  Tandem repeats finder: a program to analyze DNA sequences.

Authors:  G Benson
Journal:  Nucleic Acids Res       Date:  1999-01-15       Impact factor: 16.971

4.  STAR: ultrafast universal RNA-seq aligner.

Authors:  Alexander Dobin; Carrie A Davis; Felix Schlesinger; Jorg Drenkow; Chris Zaleski; Sonali Jha; Philippe Batut; Mark Chaisson; Thomas R Gingeras
Journal:  Bioinformatics       Date:  2012-10-25       Impact factor: 6.937

5.  cDNA cloning and molecular characterization of a defensin-like antimicrobial peptide from larvae of Protaetia brevitarsis seulensis (Kolbe).

Authors:  Jiae Lee; Kyeongrin Bang; Sejung Hwang; Saeyoull Cho
Journal:  Mol Biol Rep       Date:  2016-03-12       Impact factor: 2.316

6.  GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses.

Authors:  John Besemer; Mark Borodovsky
Journal:  Nucleic Acids Res       Date:  2005-07-01       Impact factor: 16.971

7.  InterProScan: protein domains identifier.

Authors:  E Quevillon; V Silventoinen; S Pillai; N Harte; N Mulder; R Apweiler; R Lopez
Journal:  Nucleic Acids Res       Date:  2005-07-01       Impact factor: 16.971

8.  BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics.

Authors:  Robert M Waterhouse; Mathieu Seppey; Felipe A Simão; Mosè Manni; Panagiotis Ioannidis; Guennadi Klioutchnikov; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal:  Mol Biol Evol       Date:  2018-03-01       Impact factor: 16.240

9.  RNAmmer: consistent and rapid annotation of ribosomal RNA genes.

Authors:  Karin Lagesen; Peter Hallin; Einar Andreas Rødland; Hans-Henrik Staerfeldt; Torbjørn Rognes; David W Ussery
Journal:  Nucleic Acids Res       Date:  2007-04-22       Impact factor: 16.971

10.  Gene finding in novel genomes.

Authors:  Ian Korf
Journal:  BMC Bioinformatics       Date:  2004-05-14       Impact factor: 3.169

View more
  6 in total

1.  Lignocellulose degradation in Protaetia brevitarsis larvae digestive tract: refining on a tightly designed microbial fermentation production line.

Authors:  Kui Wang; Peiwen Gao; Lili Geng; Chunqin Liu; Jie Zhang; Changlong Shu
Journal:  Microbiome       Date:  2022-06-13       Impact factor: 16.837

2.  A high-quality genome assembly from a single, field-collected spotted lanternfly (Lycorma delicatula) using the PacBio Sequel II system.

Authors:  Sarah B Kingan; Julie Urban; Christine C Lambert; Primo Baybayan; Anna K Childers; Brad Coates; Brian Scheffler; Kevin Hackett; Jonas Korlach; Scott M Geib
Journal:  Gigascience       Date:  2019-10-01       Impact factor: 6.524

Review 3.  Drug Discovery Insights from Medicinal Beetles in Traditional Chinese Medicine.

Authors:  Stephen T Deyrup; Natalie C Stagnitti; Mackenzie J Perpetua; Siu Wah Wong-Deyrup
Journal:  Biomol Ther (Seoul)       Date:  2021-03-01       Impact factor: 4.634

4.  Digestibility of insect meals for Pacific white shrimp (Litopenaeus vannamei) and their performance for growth, feed utilization and immune responses.

Authors:  Jaehyeong Shin; Kyeong-Jun Lee
Journal:  PLoS One       Date:  2021-11-19       Impact factor: 3.240

5.  Identification and field verification of an aggregation pheromone from the white-spotted flower chafer, Protaetia brevitarsis Lewis (Coleoptera: Scarabaeidae).

Authors:  Xiaofang Zhang; Liuyang Wang; Chunqin Liu; Yongqiang Liu; Xiangdong Mei; Zhongyue Wang; Tao Zhang
Journal:  Sci Rep       Date:  2021-11-16       Impact factor: 4.379

6.  Characterization of Microorganisms from Protaetia brevitarsis Larva Frass.

Authors:  Huina Xuan; Peiwen Gao; Baohai Du; Lili Geng; Kui Wang; Kun Huang; Jie Zhang; Tianpei Huang; Changlong Shu
Journal:  Microorganisms       Date:  2022-01-28
  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.