Literature DB >> 30101298

Draft genome assembly of the invasive cane toad, Rhinella marina.

Richard J Edwards1, Daniel Enosi Tuipulotu1, Timothy G Amos1, Denis O'Meally2, Mark F Richardson3,4, Tonia L Russell5, Marcelo Vallinoto6,7, Miguel Carneiro6, Nuno Ferrand6,8,9, Marc R Wilkins1,5, Fernando Sequeira6, Lee A Rollins3,10, Edward C Holmes11, Richard Shine12, Peter A White1.   

Abstract

Background: The cane toad (Rhinella marina formerly Bufo marinus) is a species native to Central and South America that has spread across many regions of the globe. Cane toads are known for their rapid adaptation and deleterious impacts on native fauna in invaded regions. However, despite an iconic status, there are major gaps in our understanding of cane toad genetics. The availability of a genome would help to close these gaps and accelerate cane toad research. Findings: We report a draft genome assembly for R. marina, the first of its kind for the Bufonidae family. We used a combination of long-read Pacific Biosciences RS II and short-read Illumina HiSeq X sequencing to generate 359.5 Gb of raw sequence data. The final hybrid assembly of 31,392 scaffolds was 2.55 Gb in length with a scaffold N50 of 168 kb. BUSCO analysis revealed that the assembly included full length or partial fragments of 90.6% of tetrapod universal single-copy orthologs (n = 3950), illustrating that the gene-containing regions have been well assembled. Annotation predicted 25,846 protein coding genes with similarity to known proteins in Swiss-Prot. Repeat sequences were estimated to account for 63.9% of the assembly. Conclusions: The R. marina draft genome assembly will be an invaluable resource that can be used to further probe the biology of this invasive species. Future analysis of the genome will provide insights into cane toad evolution and enrich our understanding of their interplay with the ecosystem at large.

Entities:  

Mesh:

Year:  2018        PMID: 30101298      PMCID: PMC6145236          DOI: 10.1093/gigascience/giy095

Source DB:  PubMed          Journal:  Gigascience        ISSN: 2047-217X            Impact factor:   6.524


Data Description

Introduction

The cane toad (Rhinella marina formerly Bufo marinus) (Fig. 1) is a true toad (Bufonidae) native to Central and South America that has been introduced to many areas across the globe [1]. Since its introduction into Queensland in 1935, the cane toad has spread widely and now occupies more than 1.2 million square kilometers of the Australian continent, fatally poisoning predators such as the northern quoll, freshwater crocodiles, and several species of native lizards and snakes [1-5]. The ability of cane toads to kill predators with toxic secretions has contributed to the success of their invasion [1]. To date, research on cane toads has focused primarily on ecological impacts, rapid evolution of phenotypic traits, and population genetics using neutral markers [6, 7], with limited knowledge of the genetic changes that allow the cane toad to thrive in the Australian environment [8-11]. A reference genome will be useful for studying loci subject to rapid evolution and could provide valuable insights into how invasive species adapt to new environments. Amphibian genomes have a preponderance of repetitive DNA [12, 13], confounding assembly with the limited read lengths of first- and second-generation sequencing technologies. Here, we employ a hybrid assembly of Pacific Biosciences (PacBio) long reads and Illumina short reads (Fig. 2) to overcome assembly challenges presented by the repetitive nature of the cane toad genome. Using this approach, we assembled a draft genome of R. marina that is comparable in contiguity and completeness to other published anuran genomes [14-17]. We used our previously published transcriptomic data [18] and other published anuran sequences to annotate the genome. Our draft cane toad assembly will serve as a reference for genetic and evolutionary studies and provides a template for continued refinement with additional sequencing efforts.
Figure 1:

An adult cane toad, Rhinella marina.

Figure 2:

Schematic overview of project workflow. A summary of the experimental methods used for sequencing, assembly, annotation, and size estimation of the cane toad genome. Transcriptome data (orange segment) were obtained from our previous study [18].

An adult cane toad, Rhinella marina. Schematic overview of project workflow. A summary of the experimental methods used for sequencing, assembly, annotation, and size estimation of the cane toad genome. Transcriptome data (orange segment) were obtained from our previous study [18].

Sample collection, library construction, and sequencing

Adult female cane toads were collected by hand from Forrest River in Oombulgurri, WA (15.1818oS, 127.8413oE) in June 2015. Toads were placed in individual damp cloth bags and transported by plane to Sydney, NSW, before they were anaesthetized by refrigeration for 4 hours and killed by subsequent freezing. High-molecular-weight genomic DNA (gDNA) was extracted from the liver of a single female using the genomic-tip 100/G kit (Qiagen, Hilden, Germany). This was performed with supplemental RNase (Astral Scientific, Taren Point, Australia) and proteinase K (NEB, Ipswich, MA, USA) treatment, as per the manufacturer's instructions. Isolated gDNA was further purified using AMPure XP beads (Beckman Coulter, Brea, CA, USA) to eliminate sequencing inhibitors. DNA quantity was assessed using the Quanti-iT PicoGreen dsDNA kit (Thermo Fisher Scientific, Waltham, MA, USA). DNA purity was calculated using a Nanodrop spectrophotometer (Thermo Fisher Scientific), and molecular integrity was assessed using pulse-field gel electrophoresis. For short-read sequencing, a paired-end library was constructed from the gDNA using the TruSeq polymerase chain reaction (PCR)-free library preparation kit (Illumina, San Diego, CA, USA). Insert sizes ranged from 200 to 800 bp. This library was sequenced (2 × 150 bp) on the HiSeq X Ten platform (Illumina) to generate approximately 282.9 Gb of raw data (Table 1). Illumina short sequencing reads were assessed for quality using FastQC v0.10.1 [19]. Low-quality reads were filtered and trimmed using Trimmomatic v0.36 [20] with a Q30 threshold (LEADING:30, TRAILING:30, SLIDINGWINDOW:4:30) and a minimum 100-bp read length, leaving 64.9% of the reads generated, of which 75.2% were in retained read pairs.
Table 1:

Summary statistics of generated whole-genome shotgun sequencing data

PlatformLibrary typeMean insert size (kb)Mean read length (bp)Number of readsNumber of bases (Gb)
HiSeqX (raw)Paired-end0.35147.71857,762 ,090282.92
HiSeqX (filtered) 140.6 1,205,616,705 169.47
PacBio RS IISMRTbell15–508,8522,794,39124.736
PacBio RS IISMRTbell15–509,085595,4475.409
PacBio RS IISMRTbell15–5010,4321,867,54319.482
PacBio RS IISMRTbell20–5010,8342,487,85226.952
PacBio Total 9,887 7,745 ,233 76.58
PacBio Unique a 10 987 6,167,714 67.77

Bold rows indicate data used for assembly.

Longest read per sequenced molecule (single-molecule real-time zero-mode waveguide- ZMW).

Summary statistics of generated whole-genome shotgun sequencing data Bold rows indicate data used for assembly. Longest read per sequenced molecule (single-molecule real-time zero-mode waveguide- ZMW). For long-read sequencing, we used the single-molecule real-time (SMRT) sequencing technology (PacBio, Menlo Park, CA, USA). Four SMRTbell libraries were prepared from gDNA using the SMRTBell template preparation kit 1.0 (PacBio). To increase subread length, either 15–50 kb or 20–50 kb BluePippin size selection (Sage Science, Beverly, MA, USA) was performed on each library. Recovered fragments were sequenced using P6C4 sequencing chemistry on the RS II platform (240 min movie time). The four SMRTbell libraries were sequenced on 97 SMRT cells to generate 7,745,233 subreads for 76.6 Gb of raw data. Collectively, short- and long-read sequencing produced around 359.5 Gb of data (Table 1).

Genome assembly

We employed a hybrid de novo whole-genome assembly strategy, combining both short-read and long-read data. Trimmed Q30-filtered short reads were de novo assembled with ABySS v1.3.6 [21] using k = 64 and default parameters (contig N50 = 583 bp) (Table 2). Long sequence reads were de novo assembled using the program DBG2OLC [22] (k 17 AdaptiveTh 0.0001 KmerCovTh 2 MinOverlap 20 RemoveChimera 1) (contig N50 = 167.04 kbp) (Table 2). Following this, both assemblies were merged together using the hybrid assembler (“sparc”) tool of DBG2OLC with default parameters, combining the contiguity of the long-read data with the improved accuracy of the high-coverage Illumina assembly. This hybrid assembly (v2.0) was twice “polished” to remove errors. In the first round, the Q30-trimmed Illumina reads were mapped to the hybrid assembly with bowtie v2.2.9 [23] and filtered for proper pairs using samtools v1.3.1 [24]. Scaffolds were polished with Pilon v1.21 [25] to generate the second iteration of the assembled genome (v2.1). In the second round, PacBio subreads were mapped to assembly v2.1 for error correction using SMRT analysis software (PacBio). PacBio subreads for each library were converted to BAM format with bax2bam v0.0.08 and aligned to the genome using pbalign v.0.3.0. BAM alignment files were combined using samtools merge v1.3.1, and the scaffolds were polished with Arrow v2.1.0 to generate the final genome assembly (v2.2). Our final draft assembly of the cane toad genome (v2.2) has 31-392 scaffolds with an N50 of 167 kb (Table 2). The GC content (43.23%) is within 1% of the published estimate of 44.17%, determined by flow cytometry [26].
Table 2:

Summary of genome assemblies For comparison, statistics are provided for two existing neobatrachian genomes, Nanorana parkeri (v2) [15] and Lithobates catesbeianus (v2.1)[14], and two anuran reference genomes, Xenopus tropicalis (v9.1) [16] and Xenopus laevis (v9.2) [17]. Lengths are given to 3 significant figures (s.f.). All percentages are given to 1 decimal point (d.p).

Genome assemblyHybrid (v2.2)Short readLong read N. parkeri (v2.0) L. catesbeianus (v2.1) X. tropicalis(v9.1) X. laevis (v9.2)
Total length (Gb)2.553.752.692.076.251.442.72
No. scaffolds31,39219.9 Ma31,392a135,8081.54 M6,822108,033
Proportion gap (%N)0.00.10.03.911.64.911.4
N50168 kb583 bp167 kb1.06 Mb39.4 kb135 Mb137 Mb
L503,373715 k3,53155531,24859
Longest scaffold3.53 Mb72.6 kb3.64 Mb8.61 Mb1.38 Mb195 Mb220 Mb
GC (%)43.243.342.942.643.140.139.0
 BUSCOb
Complete single copy (%)80.915.52.283.442.387.552.9
Complete duplicate (%)2.20.70.01.60.91.039.8
Fragment (%)7.533.62.27.222.36.03.2

For comparison, statistics are provided for two existing neobatrachian genomes, Nanorana parkeri (v2) [15] and Lithobates catesbeianus (v2.1)[14], and two anuran reference genomes, Xenopus tropicalis (v9.1) [16] and Xenopus laevis (v9.2) [17]. Lengths are given to 3 significant figures (s.f.). All percentages are given to 1 d.p.

Statistics for short- and long-read assemblies refer to contigs used for hybrid assembly.

Benchmarking Universal Single-Copy Orthologs v2.0.1 short summary statistics (n = 3,950).

Summary of genome assemblies For comparison, statistics are provided for two existing neobatrachian genomes, Nanorana parkeri (v2) [15] and Lithobates catesbeianus (v2.1)[14], and two anuran reference genomes, Xenopus tropicalis (v9.1) [16] and Xenopus laevis (v9.2) [17]. Lengths are given to 3 significant figures (s.f.). All percentages are given to 1 decimal point (d.p). For comparison, statistics are provided for two existing neobatrachian genomes, Nanorana parkeri (v2) [15] and Lithobates catesbeianus (v2.1)[14], and two anuran reference genomes, Xenopus tropicalis (v9.1) [16] and Xenopus laevis (v9.2) [17]. Lengths are given to 3 significant figures (s.f.). All percentages are given to 1 d.p. Statistics for short- and long-read assemblies refer to contigs used for hybrid assembly. Benchmarking Universal Single-Copy Orthologs v2.0.1 short summary statistics (n = 3,950).

Assessment of genome completeness

BUSCO [27] analysis of conserved single-copy orthologues is widely used as a proxy for genome completeness and accuracy. While direct comparisons are only truly valid within an organism, comparing BUSCO scores to genomes from related organisms provides a useful benchmark. We ran BUSCO v2.0.1 (short mode, lineage tetrapoda_odb9, BLAST+ v2.2.31 [28], HMMer v3.1b2 [29], AUGUSTUS v3.2.2 [30], EMBOSS v6.5.7 [31] on each of our assemblies, along with four published anuran genomes (Fig. 3, Table 2). The hybrid assembly combined the completeness of the long-read assembly with the accuracy of the short-read assembly, providing an enormous boost in BUSCO completeness from less than 50% full and partial orthologs to more than 90%. Error correction through pilon and arrow polishing had a positive effect on the BUSCO measurement of genome completeness, with an increase of 7.8% in the number of full and partial orthologs between v2.0 and v2.2. For the polished assembly (v2.2), 3,279 (83.0%) of the 3,950 ultraconserved tetrapod genes were complete, 296 (7.5%) were fragmentary, and 375 (9.5%) were missing. It should be noted that these numbers mask some underlying complexity of BUSCO assessments; aggregate improvements in BUSCO scores with polishing include some losses as well as gains. Taking the best rating for each BUSCO in v2.0, v2.1, or v2.2 reduces the number of missing BUSCO genes to 326 (8.3%) and increases the complete number to 3,366 (85.2%) (Fig. 3, “R. marina (combined)”). This is explored further in the “Genome annotation and prediction” section below. Overall, BUSCO metrics indicate that our draft R. marina genome is approaching the quality and completeness of the widely used anuran amphibian reference genomes for X. laevis (v9.2) [17] and X. tropicalis (v.9.1) [16] and compares well to the recently published neobatrachian genomes of Nanorana parkeri (v2) [15] and Lithobates catesbeianus (v2.1) [14].
Figure 3:

Assessment of genome assembly completeness. BUSCO analysis of Rhinella marina genome assembly (v2.0 uncorrected, v2.1 pilon polishing, v2.2 pilon and arrow polishing, combined v2.1, 2.2, and 2.2 ratings), Lithobates catesbeianus (v2.1), Nanorana parkeri (v2.0), Xenopus tropicalis (v9.1), and Xenopus leavis (v9.2) genomes using the tetrapoda_odb9 orthologue set (n = 3,950). The Xenopus leavis genome duplication is made clear by the large number of paralogs (light blue) with respect to other assemblies.

Assessment of genome assembly completeness. BUSCO analysis of Rhinella marina genome assembly (v2.0 uncorrected, v2.1 pilon polishing, v2.2 pilon and arrow polishing, combined v2.1, 2.2, and 2.2 ratings), Lithobates catesbeianus (v2.1), Nanorana parkeri (v2.0), Xenopus tropicalis (v9.1), and Xenopus leavis (v9.2) genomes using the tetrapoda_odb9 orthologue set (n = 3,950). The Xenopus leavis genome duplication is made clear by the large number of paralogs (light blue) with respect to other assemblies.

Estimation of R. marina genome size

Previous reports have estimated the size of the cane toad genome at being from 3.98 to 5.65 Gb using either densitometry or flow cytometry analysis of stained nuclei within erythrocytes, hepatocytes, and renal cells [26, 32-38]. We employed two alternative strategies to measure the genome size, using short-read k-mer distributions and quantitative PCR (qPCR) of single copy genes. The k-mer frequencies were calculated for both raw and trimmed Q30-filtered paired-end short reads (Table 1) with Jellyfish v2.2.3 [39] using k = 21 and k = 23 and a maximum k-mer count of 10,000. The k-mer distributions were analyzed using GenomeScope [40] with mean read lengths of 148 bp (raw) or 141 bp (Q30) and k-mer coverage cutoffs of 1,000 and 10,000 (Table 3, Fig. 4). GenomeScope gave genome size estimates ranging from 1.77 Gb to 2.30 Gb, with the raw reads giving consistently larger estimates (1.85 Gb to 2.30 Gb) than the trimmed and filtered reads (1.77 Gb to 2.10 Gb). Estimates of the unique (single-copy) region of the genome were more consistent, ranging from 1.31 Gb to 1.46 Gb, with k = 23 estimates 99 Mb (raw) or 80 Mb (Q30) higher than k = 21. Increasing the GenomeScope maximum k-mer coverage threshold had the greatest effect on predicted genome size, increasing repeat length estimates by 274 Mb to 385 Mb. GenomeScope explicitly models heterozygous diploid k-mer distributions, which should make it robust to the additional challenge of sequencing a wild animal. However, GenomeScope predictions are affected by nonuniform repeat distributions, and this difference could indicate high copy number repeats in the genome that are difficult to model accurately. It is possible that high-frequency repeats with raw sequencing counts exceeding 10,000 are resulting in an underestimate of total repeat length and therefore genome size compared to the previous densitometry and flow cytometry predictions.
Table 3:

GenomeScope genome size estimates for Rhinella marina based on raw trimmed Illumina data using different combinations of k and maximum k-mer coverage .

Unique length (Mb)Repeat length (Mb)Genome size (Mb)
DataMax k-mer coverageMinMaxMinMaxMinMax
Raw (k = 21)1,0001,3651,3664894891,8531,855
Raw (k = 21)10,0001,3651,3658748742,2392,240
Raw (k = 23)1,0001,4531,4554704711,9241,926
Raw (k = 23)10,0001,4541,4548428422,2962,296
Q30 (k = 21)1,0001,3071,3084624621,7681,771
Q30 (k = 21)10,0001,3071,3087497492,0562,057
Q30 (k = 23)1,0001,3891,3914384391,8281,830
Q30 (k = 23)10,0001,3901,3917137132,1032,104

Lengths are in megabases (0 d.p.).

Figure 4:

GenomeScope k-mer frequency and log-transformed k-mer coverage profiles. (A) Raw Illumina data (k = 23). (B) Q30 trimmed Illumina data (k = 23). Profiles for k = 21 are similar (data not shown).

GenomeScope k-mer frequency and log-transformed k-mer coverage profiles. (A) Raw Illumina data (k = 23). (B) Q30 trimmed Illumina data (k = 23). Profiles for k = 21 are similar (data not shown). GenomeScope genome size estimates for Rhinella marina based on raw trimmed Illumina data using different combinations of k and maximum k-mer coverage . Lengths are in megabases (0 d.p.). In the second approach, the zfp292 (zinc finger protein 292) gene was selected from our BUSCO analysis as a single-copy target for genome estimation by qPCR [41]. First, PCR was used to amplify a 326-bp region of zfp292 (scaffold 6589, position 345 750–346 075) in a 25 µL reaction that contained 50 ng of gDNA, 200 µM dNTP, 0.625 units of Taq polymerase (Invitrogen), 10 × Taq polymerase buffer (Invitrogen), and 0.4 µM of each primer (Supplementary Table S1). The amplicon was cloned into the pGEM-T Easy vector (Promega, Madison, WI, USA), and the resultant plasmid was linearized with NdeI before being serially diluted to generate a qPCR standard (101–109 copies/µL). To amplify a smaller region (120 bp) within zfp292 (scaffold 6589, position 345 858–345 977), gDNA (10–25 ng) or 1 µL of the diluted standards was used as a template for a 20 µL qPCR reaction containing 2 × iTaq SYBR Green mastermix (BioRad, Hercules, CA, USA) and 0.5 µM of each primer (Supplementary Table S1). Cycle threshold values obtained for each plasmid dilution were used to generate a standard curve and infer the number of zfp292 amplicons generated from the template gDNA of known quantity. Genome sizes were generated from the formulae outlined by [41], and the average of two estimates (2.81 Gb and 1.94 Gb) was used to obtain a genome size of 2.38 Gb. This genome size provides an estimated combined 151X sequencing coverage (119X Illumina and 32X PacBio) (Table 4).
Table 4:

Estimation of Rhinella marina genome size using various methods and the corresponding level of sequencing coverage (3 s.f.)

MethodEstimated genome size (Gb)Illumina coverage (X)PacBio coverage (X)Reference
Flow cytometry (mean)4.3365.317.7[26, 33, 35, 38]
Flow cytometry (min)3.9871.119.2[38]
Flow cytometry (max)4.9057.715.6[35]
Densitometry (mean)4.9557.115.5[32, 34, 36, 37]
Densitometry (min)4.06a69.718.9[37]
Densitometry (max)5.6550.113.6[32]
GenomeScope (raw)2.0813636.8-
GenomeScope (Q30)1.9414639.4-
qPCR (zfp292)2.3811932.1-
Assembly (v2.2)2.5511130.0-

GenomeScope values in this table are mean values from the four setting combinations.

aValue adjusted to account for updated size of reference genome used to infer R. marina genome size.

Estimation of Rhinella marina genome size using various methods and the corresponding level of sequencing coverage (3 s.f.) GenomeScope values in this table are mean values from the four setting combinations. aValue adjusted to account for updated size of reference genome used to infer R. marina genome size. Our genome size estimation of 1.98 to 2.38 Gbp is smaller than the 2.55 Gbp assembly size and differs significantly from previously published estimates of 4 Gbp or more for this species. We suggest that this is a result of the repetitive nature of the genome (see below). Given this is the first estimate of the cane toad genome size using either k-mer or qPCR analysis, further investigations are required to more clearly understand the discrepancy in our estimates with respect to published genome sizes. Here, we estimate the depth of sequencing coverage using both sequence-based and cytometric genome size measures (Table 4).

Genome annotation and gene prediction

Annotation of the draft genome was performed using MAKER2 v2.31.6 [42], BLAST+ v2.2.31 [28], AUGUSTUS v3.2.2 [30], Exonerate v2.2.0 [43], RepeatMasker v4.0.6 [44] (DFAM [45], Library Dfam_1.2; RMLibrary v20150807), RepeatModeler v1.0.8 [46], and SNAP v2013-11-29 [47] using all Swiss-Prot protein sequences (downloaded 3 February 2017) [48]. AUGUSTUS was trained using BUSCO v2.0.1 (long mode, lineage tetrapoda_odb9) and a multitissue reference transcriptome we previously generated from tadpoles and six adult cane toads [18] (available from GigaDB [49], GenBank accession PRJNA383966). Whole tadpoles and the brain, liver, spleen, muscle, ovary, and testes of adult toads from Australia and Brazil were used to prepare cDNA libraries for the multitissue transcriptome sequencing. After the initial training run, two additional iterations of MAKER2 were run using hidden Markov models (HMMs) from SNAP training created from the previous run. Functional annotation of protein-coding genes predicted by MAKER2 were generated using Interproscan 5.25–64.0, with the following settings: -dp -t p -pa -goterms -iprlookup -appl TIGRFAM, SFLD, Phobius, SUPERFAMILY, PANTHER, Gene3D, Hamap, ProSiteProfiles, Coils, SMART, CDD, PRINTS, ProSitePatterns, SignalP_EUK, Pfam, ProDom, MobiDBLite, PIRSF, TMHMM. BLAST+ v2.6.0 [28] was used to annotate predicted genes using all Swiss-Prot proteins (release 2017_08, downloaded 2017-09-01) [48] using the following settings: -evalue 0.000001 -seg yes -soft_masking true -lcase_masking -max_hsps 1. In total, 58,302 protein-coding genes were predicted by the MAKER pipeline, with an average of 5.3 exons and 4.3 introns per gene (Table 5). Of these, 5,225 are single-exon genes, giving 4.7 introns per multi-exon gene with an average intron length of 4.08 kb. Predicted coding sequences make up 2.38% of the assembly. MAKER predicted considerably more than the approximately 20,000 genes expected for a typical vertebrate genome. There are two likely explanations for this: artifactual duplications in the genome assembly, either through underassembly or legitimate assembly of two heterozygous diploid copies; or the overprediction of proteins during genome annotation, including pseudogenes with high homology to functional genes, proteins from transposable elements or other repeats, and multiple fragments of open reading frames (ORFs) from the same gene (due to fragmentation of the genome) and lncRNA genes that have been incorrectly assigned a coding sequence. Of the 3,279 complete BUSCO genes identified (Table 2), only 85 (2.59%) were duplicated. This suggests that there is not widespread duplication in the assembly. Only 25,846 predicted genes were annotated as similar to known proteins in Swiss-Prot, with the remaining 32,456 predictions “of unknown function.” This is consistent with overprediction being the primary cause of inflated gene numbers. Poor-quality protein predictions are generally shorter (generated from fragmented or random ORFs) and have a larger annotation edit distance (AED) when compared to real proteins. Consistent with this, the predicted proteins of unknown function are shorter in sequence (median length 171 aa) compared to those with Swiss-Prot hits (median length 388 aa) (Fig. 5A) and have a greater AED (median 0.37 vs 0.2) (Fig. 5B). To investigate this further, predicted transcript and protein sequences were searched against the published de novo assembled transcriptome [18] using BLAST+ v2.2.31 [28] blastn or tblastn (top 10 hits, e-value <10−10) and compiled with GABLAM v2.28.3 [50]. For 56.5% of proteins with functional annotation, 95%+ of the protein length mapped to the top transcript hit (Table 6). Only 27.1% of unknown proteins had 95%+ coverage in the top transcript hit, which is again consistent with overprediction. We also reanalyzed the multitissue RNA sequencing (RNA-seq) data from Richardson et al. [18] by mapping the reads onto the MAKER predicted transcripts. Filtered reads (adaptor sequences and reads with average Phred <30 removed) were mapped with Salmon v0.8.0 [51] (Quasi-mapping default settings, IU libtype parameter). Read counts were converted into transcripts per million (TPM) by normalizing by transcript length, dividing by the sum of the length-normalized read counts, and then multiplying by 1 million. We observed lower expression levels overall in the “unknown” set (Fig. 6). With the caveat that real proteins may have very low expression, this is also consistent with the “unknown” gene set containing false annotations. Further review of the predicted protein descriptions revealed 4,357 with likely origins in transposable elements (including 4,114 long interspersed nuclear element-1 [LINE-1] ORFs) and 215 from viruses. However, many of these may be bona fide functional members of the cane toad proteome: 1,447 (33.2%) “transposon” and 151 (70.2%) of “viral” transcripts had support for expression >1 TPM.
Table 5:

Summary statistics of consensus protein-coding gene predictions and predicted repeat elements (including RNA genes) for the Rhinella marina v2.2 draft genome.

ElementCountNo. of scaffoldsAvg. lengthTotal lengthGenome coverage (%)PacBio depth (X)Illumina depth (X)
Protein-coding gene58,30219,53018.8 kb1.10 Gb42.9120.3258.07
Transcript58,30219,5301.24 kb72.3 Mb2.8320.4965.41
- Similar to known25,84611,9181.90 kb49.1 Mb1.9220.0856.42
- Unknown32,45615,213714 bp23.2 Mb0.9120.9868.82
Exon309,71819,530233 bp72.3 Mb2.8320.4965.41
- Coding294,53519,530207 bp60.8 Mb2.3820.6766.97
Intron251,41618,5094.08 kb1.03 Gb40.0920.3057.55
5’ untranslated region15,8558,839208 bp3.29 Mb0.1318.6953.86
Coding sequence58,30219,5301.04 kb60.8 Mb2.3820.6766.97
3’ untranslated region11,9655,780682 bp8.16 Mb0.3219.9158.52
BUSCO SC complete3,1942,01432.6 kb104 Mb4.0719.8953.01
Repeats
Short interspersed nuclear element21,6209,322338 bp7.31 Mb0.2919.4558.23
Long interspersed nuclear element268,56927,620513 bp138 Mb5.3821.0372.29
Long terminal repeat201,81724,949504 bp102 Mb3.9822.6268.96
DNA817,40530,689600 bp490 Mb19.1721.6768.37
Helitron20,3199,340826 bp16.8 Mb0.6619.3256.81
Retroposon1,042829549 bp570 kb0.0218.2250.87
Other1817209 bp3.7 kb0.00%14.2724.60
Unknown1,610,88330,966513 bp826 Mb32.2820.1259.39
Satellite25,55710,270440 bp11.3 Mb0.4418.3854.21
Simple repeats968,94730,62056.9 bp55.1 Mb2.1618.8848.51
Low complexity141,02824,02051.8 bp7.30 Mb0.2922.4864.48
rRNA5,2272,923422 bp2.20 Mb0.0940.88142.42
tRNA5,5584,474105 bp583 kb0.0229.15140.06
snRNA21,7889,432546 bp11.9 Mb0.4724.6389.12
srpRNA1711268 bp4.55 kb0.0022.11140.44
scRNA3369.0 bp207 bp0.0015.5347.29
RNA418266482 bp202 kb0.0132.65173.99
Repeat TOTALa 4,110,22231,179406 bp1.63 Gb63.920.8263.79

Lengths are given to 3 s.f. Coverage and mean depth statistics for PacBio and Q30-trimmed Illumina reads are given to 2 d.p. LINE: long interspersed nuclear element

Values for repeat totals account for overlapping repeats.

Figure 5:

Key protein statistics for predicted genes with and without annotated similarity to known genes. Histograms of (A) protein length and (B) MAKER2 AED for “similar” (blue) and “unknown” (red) classes of predicted genes.

Table 6:

Proportions of predicted protein and transcript sequences exceeding 50%, 80%, 95%, or 99% coverage in the top BLAST+ hit from the published transcriptome [18], and combined coverage for the top 10 transcript hits.

TypeCountCoverage in top transcript hitCoverage in top 10 transcript hits
50%+80%+95%+99%+50%+80%+95%+99%+
Protein (similar to known)25,84693.676.756.540.797.590.372.754.2
Transcript (similar to known)25,84675.050.030.821.482.673.157.240.9
Protein (unknown)32,45679.949.827.115.885.766.344.429.9
Transcript (unknown)32,45643.621.512.18.6152.637.325.419.1

All percentages given to 3 s.f.

Figure 6:

Multitissue gene expression for predicted genes with and without annotated similarity to known genes. (A) Histograms of RNA-seq TPM for “similar” (blue) and “unknown” (red) classes of predicted genes, capped at 100 TPM. (B) “Similar” and (C) “unknown” gene expression, rated as very low (<1 TPM), low (1–9 TPM), medium (10–99 TPM), or high (100+ TPM).

Key protein statistics for predicted genes with and without annotated similarity to known genes. Histograms of (A) protein length and (B) MAKER2 AED for “similar” (blue) and “unknown” (red) classes of predicted genes. Multitissue gene expression for predicted genes with and without annotated similarity to known genes. (A) Histograms of RNA-seq TPM for “similar” (blue) and “unknown” (red) classes of predicted genes, capped at 100 TPM. (B) “Similar” and (C) “unknown” gene expression, rated as very low (<1 TPM), low (1–9 TPM), medium (10–99 TPM), or high (100+ TPM). Summary statistics of consensus protein-coding gene predictions and predicted repeat elements (including RNA genes) for the Rhinella marina v2.2 draft genome. Lengths are given to 3 s.f. Coverage and mean depth statistics for PacBio and Q30-trimmed Illumina reads are given to 2 d.p. LINE: long interspersed nuclear element Values for repeat totals account for overlapping repeats. Proportions of predicted protein and transcript sequences exceeding 50%, 80%, 95%, or 99% coverage in the top BLAST+ hit from the published transcriptome [18], and combined coverage for the top 10 transcript hits. All percentages given to 3 s.f. To investigate the role of fragmented ORFs, we downloaded the Quest For Orthologues (QFO) reference proteomes (QFO 04/18) [52] and used BLAST+ v2.2.31 [28] blastp (e-value <10−7) to identify the top hit for each predicted protein in all eukaryote reference proteomes and in the X. tropicalis reference proteome. BLAST results were converted into global coverage with GABLAM v2.28.3 [50]. As expected, the vast majority (99.6%) of “similar” proteins had a blastp hit the QFO proteomes (data not shown). Perhaps surprisingly, nearly two thirds (66.5%) of “unknown” proteins also had a blastp hit, but these had lower coverage of the reference proteins than did proteins in the “similar” class (data not shown). A “combined coverage” score was calculated for each protein, taking the minimum percentage coverage of either the query protein or its top QFO hit. This metric was related to annotation quality, showing an inverse relationship with AED (data not shown). Excluding proteins with annotation indicating possible viral or transposable element origin, 45.7% of “similar” proteins and 96.8% of “unknown” proteins had the same closest X. tropicalis blastp hit as another predicted protein. Consistent with this being related to gene fragmentation, there was a negative relationship between the number of cane toad proteins sharing a given X. tropicalis top hit and how much of the X. tropicalis hit was covered by each cane toad protein. Nevertheless, it is likely that some of these protein fragments represent allelic variants that have been redundantly assembled. We ran BUSCO v2.0.1 (short mode, lineage tetrapoda_odb9, BLAST+ v2.2.31 [28], HMMer v3.1b2 [29], AUGUSTUS v3.2.2 [30], EMBOSS v6.5.7 [31]) on the MAKER2 transcriptome and proteome and retained the most complete rating for each gene (Fig. 7A, Supplementary Table S2, “Annotation”). MAKER annotation had fewer missing BUSCO genes than the v2.2 assembly (314 vs 375) but many more fragmented (561 vs 296). Equivalent BUSCO analysis of the Richardson et al. transcriptome [18] was only missing 296 genes. However, as seen with the assembly versions, these values mask hidden complexity. Combined BUSCO analysis of our hybrid assembly (v2.0, v2.1, v2.2) and annotation, revealed only 181 missing genes (Fig. 7A, Supplementary Table S2, “GigaDB”). Furthermore, >50% of the 279 genes “Missing” in the transcriptome are found in the genome and/or its annotation (Fig. 7B, Supplementary Table S2). When the transcriptome and our genome are combined, only 68 BUSCO genes (1.7%) are “Missing” and 3,845 (97.3%) are “Complete” (Fig. 7A, Supplementary Table S2, “CaneToad”). This highlights the usefulness of our assembly and illustrates the complementary nature of genome and transcriptome data. The former is more comprehensive but more difficult to assemble and annotate, whereas the latter is easier to assemble into full-length coding sequences but will miss some tissue-specific and lowly expressed genes. Some of the remaining “Missing” BUSCO genes may be present but too fragmented to reach the score threshold.
Figure 7:

Assessment of assembly annotation completeness. BUSCO analysis for (A) all BUSCO tetrapoda genes (n = 3950) and (B) the subset of BUSCO genes rated as “Missing” from the Richardson et al. transcriptome [18]. Rhinella marina (combined): combined v2.0, v2.1, and v2.2 ratings; Annotation: combined MAKER proteome and transcriptome ratings; GigaDB: combined assembly and annotation ratings; Cane Toad: combined assembly, annotation, and Richardson et al. transcriptome [18].

Assessment of assembly annotation completeness. BUSCO analysis for (A) all BUSCO tetrapoda genes (n = 3950) and (B) the subset of BUSCO genes rated as “Missing” from the Richardson et al. transcriptome [18]. Rhinella marina (combined): combined v2.0, v2.1, and v2.2 ratings; Annotation: combined MAKER proteome and transcriptome ratings; GigaDB: combined assembly and annotation ratings; Cane Toad: combined assembly, annotation, and Richardson et al. transcriptome [18]. Future work is needed to improve the quality of gene annotation. We have included all of the MAKER2 predictions in our annotation as well as a full table of protein statistics and top blastp hits from this analysis for further biological analyses (Supplementary Table S3). Annotation has also been made available via a WebApollo [53] genome browser [54] and an associated search tool [55]. This will facilitate community curation and annotation of genes of interest. For researchers who would like to use cane toad proteins in general evolutionary analyses, we have also created a “high-quality” dataset of 6,580 protein-coding genes with an AED no greater than 0.25 and at least 90% reciprocal coverage of its top QFO blastp hit, excluding possible viral and transposon proteins, available from the GigaScience database.

Phylogenetic analysis of high-quality proteins

To further validate the high-quality protein dataset, GOPHER [56] v3.4.2 was used to predict orthologues for each protein. QFO (04/18) [52] eukaryotic reference proteomes were supplemented with Uniprot Reference proteomes for Lithobates catesbeiana (UP000228934) [14] and Xenopus laevis (UP000186698) [17] and the annotated protein sequences of Nanorana parkeri v2 [15]. GOPHER orthologues were predicted with default settings based on a modified mutual best-hit algorithm that accounts for one-to-many or many-to-many orthologous relationships and retains the closest orthologue from each species. The closest orthologues were aligned with MAFFT [57] v7.310 (default settings) and phylogenetic trees inferred with IQ-TREE [58] v1.6.1 (default settings) for alignments containing at least three sequences. Phylogenetic trees were inferred in this manner for 6,417 of the 6,580 high-quality proteins. A supertree was then constructed from the 6,417 individual protein trees using CLANN [59] v4.2.2 (DFIT Most Similar Supertree Algorithm) (Fig. 8, Supplementary Fig. S1). Branch consistency was calculated for each branch as the proportion of source trees with taxa on either side of the branch that have no conflicts in terms of the placement of those taxa. The supertree supports the known phylogeny for amphibians used in this study, giving additional confidence in the quality and utility of these protein annotations. All alignments and trees are available in Supplementary Data via the GigaScience database.
Figure 8:

Phylogenetic supertree of 15 selected chordate taxa constructed from phylogenetic trees for 6,417 high-confidence cane toad proteins. Branch labels indicate percentage consistency (see text), rounded down. Numbers following each taxon are the number and percentage of source trees containing that taxon. The tree has been rooted using fish as an outgroup and visualized with FigTree [60]. The full supertree of 52 taxa is available as Supplementary Fig. S1.Pl

Phylogenetic supertree of 15 selected chordate taxa constructed from phylogenetic trees for 6,417 high-confidence cane toad proteins. Branch labels indicate percentage consistency (see text), rounded down. Numbers following each taxon are the number and percentage of source trees containing that taxon. The tree has been rooted using fish as an outgroup and visualized with FigTree [60]. The full supertree of 52 taxa is available as Supplementary Fig. S1.Pl

Repeat identification and analysis

The cane toad genome has proven very difficult to assemble using short reads alone, which suggests a high frequency of repetitive sequences, as for other amphibians [12, 13]. RepeatMasker annotations from the MAKER pipeline support this interpretation, with more than 4.1 million repeat sequences detected, accounting for 63.9% of the assembly (Table 5). The mean repeat length is 406 bp, which exceeds the Illumina read length used in our study (mean 140.6 bp paired-end). This makes short-read assembly of these regions difficult, as reflected by the poor ABySS contiguity (contig N50 = 583 bp; Table 2), and emphasizes the need for long-read data in this organism. The most abundant class of repeat elements is of unknown type (1.61 million elements covering 32.28% of the assembly), with DNA transposons the most abundant known class of element (817,262 repeats; 19.17% coverage). Of these, the most abundant are of the hAT-Ac (231,332 copies) and TcMar-Tc1 (226,145copies) superfamilies (Supplementary Table S4). Accounting for overlaps between repeat and gene features, 18.7% of the assembly (479,397,014 bp) has no annotation (Fig. 9).
Figure 9:

Summary of the main annotation classes for Rhinella marina genome assembly. Identified repeat classes exceeding 2% of assembly have been plotted separately (1 d.p.). All other repeats, including “Unknown,” have been grouped as “Other repeats.” The percentage for introns excludes any repeat sequences within those introns.

Summary of the main annotation classes for Rhinella marina genome assembly. Identified repeat classes exceeding 2% of assembly have been plotted separately (1 d.p.). All other repeats, including “Unknown,” have been grouped as “Other repeats.” The percentage for introns excludes any repeat sequences within those introns.

Conclusion

This draft genome assembly will be an invaluable tool for advancing knowledge of anuran biology, genetics, and the evolution of invasive species. Furthermore, we envisage these data will facilitate the development of biocontrol strategies that reduce the impact of cane toads on native fauna. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. 4/9/2018 Reviewed Click here for additional data file. 7/3/2018 Reviewed Click here for additional data file. 4/15/2018 Reviewed Click here for additional data file. 7/9/2018 Reviewed Click here for additional data file. 4/24/2018 Reviewed Click here for additional data file. 7/15/2018 Reviewed Click here for additional data file. 4/24/2018 Reviewed Click here for additional data file. Click here for additional data file.
  46 in total

1.  The evolution of genome size: what can be learned from anuran development?

Authors:  A D Chipman; O Khaner; A Haas; E Tchernov
Journal:  J Exp Zool       Date:  2001-12-15

2.  UniProt: the Universal Protein knowledgebase.

Authors:  Rolf Apweiler; Amos Bairoch; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

3.  Mixed population genomics support for the central marginal hypothesis across the invasive range of the cane toad (Rhinella marina) in Australia.

Authors:  Daryl R Trumbo; Brendan Epstein; Paul A Hohenlohe; Ross A Alford; Lin Schwarzkopf; Andrew Storfer
Journal:  Mol Ecol       Date:  2016-08-08       Impact factor: 6.185

4.  MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Authors:  Kazutaka Katoh; Daron M Standley
Journal:  Mol Biol Evol       Date:  2013-01-16       Impact factor: 16.240

5.  Phylogeography of Bufo marinus from its natural and introduced ranges.

Authors:  R W Slade; C Moritz
Journal:  Proc Biol Sci       Date:  1998-05-07       Impact factor: 5.349

6.  GenomeScope: fast reference-free genome profiling from short reads.

Authors:  Gregory W Vurture; Fritz J Sedlazeck; Maria Nattestad; Charles J Underwood; Han Fang; James Gurtowski; Michael C Schatz
Journal:  Bioinformatics       Date:  2017-07-15       Impact factor: 6.937

7.  Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement.

Authors:  Bruce J Walker; Thomas Abeel; Terrance Shea; Margaret Priest; Amr Abouelliel; Sharadha Sakthikumar; Christina A Cuomo; Qiandong Zeng; Jennifer Wortman; Sarah K Young; Ashlee M Earl
Journal:  PLoS One       Date:  2014-11-19       Impact factor: 3.240

8.  The North American bullfrog draft genome provides insight into hormonal regulation of long noncoding RNA.

Authors:  S Austin Hammond; René L Warren; Benjamin P Vandervalk; Erdi Kucuk; Hamza Khan; Ewan A Gibb; Pawan Pandoh; Heather Kirk; Yongjun Zhao; Martin Jones; Andrew J Mungall; Robin Coope; Stephen Pleasance; Richard A Moore; Robert A Holt; Jessica M Round; Sara Ohora; Branden V Walle; Nik Veldhoen; Caren C Helbing; Inanc Birol
Journal:  Nat Commun       Date:  2017-11-10       Impact factor: 14.919

9.  DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies.

Authors:  Chengxi Ye; Christopher M Hill; Shigang Wu; Jue Ruan; Zhanshan Sam Ma
Journal:  Sci Rep       Date:  2016-08-30       Impact factor: 4.379

10.  Draft genome assembly of the invasive cane toad, Rhinella marina.

Authors:  Richard J Edwards; Daniel Enosi Tuipulotu; Timothy G Amos; Denis O'Meally; Mark F Richardson; Tonia L Russell; Marcelo Vallinoto; Miguel Carneiro; Nuno Ferrand; Marc R Wilkins; Fernando Sequeira; Lee A Rollins; Edward C Holmes; Richard Shine; Peter A White
Journal:  Gigascience       Date:  2018-09-01       Impact factor: 6.524

View more
  18 in total

1.  Chromosome-level assembly of the mustache toad genome using third-generation DNA sequencing and Hi-C analysis.

Authors:  Yongxin Li; Yandong Ren; Dongru Zhang; Hui Jiang; Zhongkai Wang; Xueyan Li; Dingqi Rao
Journal:  Gigascience       Date:  2019-09-01       Impact factor: 6.524

2.  Transposable elements expression in Rhinella marina (cane toad) specimens submitted to immune and stress challenge.

Authors:  Adriana Ludwig; Michelle Orane Schemberger; Camilla Borges Gazolla; Joana de Moura Gama; Iraine Duarte; Ana Luisa Kalb Lopes; Carolina Mathias; Desirrê Alexia Lourenço Petters-Vandresen; Michelle Louise Zattera; Daniel Pacheco Bruschi
Journal:  Genetica       Date:  2021-08-12       Impact factor: 1.082

3.  Pseudogenized Amelogenin Reveals Early Tooth Loss in True Toads (Anura: Bufonidae).

Authors:  John Shaheen; Austin B Mudd; Thomas G H Diekwisch; John Abramyan
Journal:  Integr Comp Biol       Date:  2021-11-17       Impact factor: 3.326

4.  Intergenerational effects of manipulating DNA methylation in the early life of an iconic invader.

Authors:  Roshmi R Sarma; Michael R Crossland; Harrison J F Eyck; Jayna L DeVore; Richard J Edwards; Michael Cocomazzo; Jia Zhou; Gregory P Brown; Richard Shine; Lee A Rollins
Journal:  Philos Trans R Soc Lond B Biol Sci       Date:  2021-04-19       Impact factor: 6.671

5.  Evolutionary dynamics of DIRS-like and Ngaro-like retrotransposons in Xenopus laevis and Xenopus tropicalis genomes.

Authors:  Camilla Borges Gazolla; Adriana Ludwig; Joana de Moura Gama; Daniel Pacheco Bruschi
Journal:  G3 (Bethesda)       Date:  2022-02-04       Impact factor: 3.542

6.  Genome of Spea multiplicata, a Rapidly Developing, Phenotypically Plastic, and Desert-Adapted Spadefoot Toad.

Authors:  Fabian Seidl; Nicholas A Levis; Rachel Schell; David W Pfennig; Karin S Pfennig; Ian M Ehrenreich
Journal:  G3 (Bethesda)       Date:  2019-12-03       Impact factor: 3.154

7.  Developmental assays using invasive cane toads, Rhinella marina, reveal safety concerns of a common formulation of the rice herbicide, butachlor.

Authors:  Molly E Shuman-Goodier; Grant R Singleton; Anna M Forsman; Shyann Hines; Nicholas Christodoulides; Kevin D Daniels; Catherine R Propper
Journal:  Environ Pollut       Date:  2020-11-06       Impact factor: 8.071

Review 8.  Perspectives on studying molecular adaptations of amphibians in the genomic era.

Authors:  Yan-Bo Sun; Yi Zhang; Kai Wang
Journal:  Zool Res       Date:  2020-07-18

9.  Draft genome assembly of the invasive cane toad, Rhinella marina.

Authors:  Richard J Edwards; Daniel Enosi Tuipulotu; Timothy G Amos; Denis O'Meally; Mark F Richardson; Tonia L Russell; Marcelo Vallinoto; Miguel Carneiro; Nuno Ferrand; Marc R Wilkins; Fernando Sequeira; Lee A Rollins; Edward C Holmes; Richard Shine; Peter A White
Journal:  Gigascience       Date:  2018-09-01       Impact factor: 6.524

10.  Incomer, a DD36E family of Tc1/mariner transposons newly discovered in animals.

Authors:  Yatong Sang; Bo Gao; Mohamed Diaby; Wencheng Zong; Cai Chen; Dan Shen; Saisai Wang; Yali Wang; Zoltán Ivics; Chengyi Song
Journal:  Mob DNA       Date:  2019-11-23
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.