Literature DB >> 31712258

Genome Sequencing of Musa acuminata Dwarf Cavendish Reveals a Duplication of a Large Segment of Chromosome 2.

Mareike Busche1, Boas Pucker1, Prisca Viehöver1, Bernd Weisshaar1, Ralf Stracke2.   

Abstract

Different Musa species, subspecies, and cultivars are currently investigated to reveal their genomic diversity. Here, we compare the genome sequence of one of the commercially most important cultivars, Musa acuminata Dwarf Cavendish, against the Pahang reference genome assembly. Numerous small sequence variants were detected and the ploidy of the cultivar presented here was determined as triploid based on sequence variant frequencies. Illumina sequence data also revealed a duplication of a large segment on the long arm of chromosome 2 in the Dwarf Cavendish genome. Comparison against previously sequenced cultivars provided evidence that this duplication is unique to Dwarf Cavendish. Although no functional relevance of this duplication was identified, this example shows the potential of plants to tolerate such aneuploidies.
Copyright © 2020 Busche et al.

Entities:  

Keywords:  banana; crop genome assembly; pan-genomics; small sequence variants

Mesh:

Year:  2020        PMID: 31712258      PMCID: PMC6945009          DOI: 10.1534/g3.119.400847

Source DB:  PubMed          Journal:  G3 (Bethesda)        ISSN: 2160-1836            Impact factor:   3.154


Bananas (Musa) are monocotyledonous perennial plants. The edible fruit (botanically a berry) belongs to the most popular fruits in the world. In 2016, about 5.5 million hectares of land were used for the production of more than 112 million tons of bananas (FAO 2019). The majority of bananas were grown in Africa, Latin America, and Asia where they offer employment opportunities and are important export commodities (FAO 2019). Furthermore, with an annual per capita consumption of more than 200 kg in Rwanda and more than 100 kg in Angola, bananas provide food security in developing countries (FAO 2019; Arias ). While plantains or cooking bananas are commonly eaten as a staple food in Africa and Latin America, the softer and sweeter dessert bananas are popular in Europe and Northern America. Between 1998 and 2000, around 47% of the world banana production and the majority of the dessert banana production relied on the Cavendish subgroup of cultivars (Arias ). Therein the Dwarf Cavendish banana (“Dwarf” refering to the height of the pseudostem, not to the fruit size) is one of the commercially most important cultivars, along with Grand Naine (“Chiquita banana”). Although Cavendish bananas are almost exclusively traded internationally, numerous varieties are used for local consumption in Africa and Southeast Asia. Bananas went through a long domestication process which started at least 7,000 years ago (Denham ). The first step towards edible bananas was interspecific hybridization between subspecies from different regions, which caused incorrect meiosis and diploid gametes (Perrier ). The diversity of edible triploid banana cultivars resulted from human selection and triploidization of Musa acuminata as well as Musa balbisiana (Perrier ). These exciting insights into the evolution of bananas were revealed by the analysis of genome sequences. Technological advances boosted sequencing capacities and allowed the (re-)sequencing of genomes from multiple subspecies and cultivars. M. acuminata can be divided into several subspecies and cultivars. The first M. acuminata (DH Pahang) genome sequence has been published in 2012 (D’Hont ), many more genomes have been sequenced recently including: banksii, burmannica, zebrina (Rouard ), malaccensis (SRR8989632, SRR6996493), Baxijiao (SRR6996491, SRR6996491), Sucrier: Pisang_Mas (SRR6996492). Additionally, the genome sequences of other Musa species, M. balbisiana (Davey ), M. itinerans (Wu ), and M. schizocarpa (Belser ), have already been published. Here we report about our investigation of the genome of M. acuminata Dwarf Cavendish, one of the commercially most important cultivars. We identified an increased copy number of a segment of the long arm of chromosome 2, indicating that this region was duplicated in one haplophase.

Materials and Methods

Plant material and DNA extraction

Musa acuminata Dwarf Cavendish tissue culture seedlings were obtained from FUTURE EXOTICS/SolarTek (Düsseldorf, Germany) (Figure 1). Plants were grown under natural daylight at 21°. Genomic DNA was isolated from leaves following the protocol of Dellaporta .
Figure 1

M. acuminata Dwarf Cavendish plant, nine month old.

M. acuminata Dwarf Cavendish plant, nine month old.

Library preparation and sequencing

Genomic DNA was subjected to sequencing library preparation via the TrueSeq v2 protocol as previously described (Pucker ). Paired-end sequencing was performed on an Illumina HiSeq1500 and NextSeq500, respectively, resulting in 2x250 nt and 2x154 nt read data sets with an average Phred score of 38. These data sets provide 55x and 65.6x coverage, respectively, for the approximately 523 Mbp (D’Hont ) haploid banana genome.

Read mapping, variant calling, and variant annotation

All reads were mapped to the DH Pahang v2 reference genome sequence via BWA-MEM v0.7 (Li 2013) using –M to flag short hits for downstream filtering. This read mapping was analyzed by the HaplotypeCaller of the Genome Analysis ToolKit (GATK) v3.8 (McKenna ; Van der Auwera ) to identify sequence variations in single nucleotides, called “single nucleotide variants” (SNVs), and also insertions/deletions (InDels). SNVs and InDels were called using the following filter rules in accordance with the GATK developer recommendation: ‘QD < 2.0’, ‘FS > 60.0’, and ‘MQ < 40’ for SNVs and ‘QD < 2.0’ and ‘FS > 200.0’ for InDels. An InDel length cutoff of 100 bp was applied to restrict downstream analyses to a set of high quality variants called from 2x250nt reads. Only variants supported by at least five reads were kept. The resulting variant set was subjected to SnpEff (Cingolani ) to assign predictions about the functional impact to the variants in the set. Variants with disruptive effects were selected using a customized Python script as described earlier (Pucker ). The genome-wide distribution of SNVs and InDels was assessed based on previously developed scripts (Baasner et al. 2019). The length distribution of InDels inside coding sequences was compared to the length distribution of InDels outside coding sequences using a customized Python script (Pucker ).

De novo genome assembly

Trimmomatic v0.38 (Bolger ) was applied to remove low quality sequences (i.e., four consecutive bases below Phred 15) and remaining adapter sequences (based on similarity to all known Illumina adapter sequences). Different sets of trimmed reads were subjected to SOAPdenovo2 (Luo ) for assembly using optimized parameters (Pucker et al. 2019) including avg_ins = 600, asm_flags = 3, rd_len_cutoff = 300, pair_num_cutoff = 3, and map_len = 100. K-mer sizes ranged from 67 to 127 in steps of 10. Resulting assemblies were evaluated using previously described criteria (Pucker et al. 2019) including general assembly statistics (e.g., number of contigs, assembly size, N50, and N90) and a BUSCO v3 (Simão ) assessment. Polishing was done by removing potential contaminations and adapters as described before (Pucker et al. 2019). The DH Pahang v2 assembly (D’Hont ; Martin ) was used in the contamination detection process to distinguish between bona fide banana contigs and sequences of unknown origin. Contigs with high sequence similarity to non-plant sequences were removed as previously described (Pucker et al. 2019). Remaining contigs were sorted based on the DH Pahang v2 reference genome sequence and concatenated to build pseudochromosomes to facilitate downstream analyses. A de novo Dwarf Cavendish assembly generated with a K-mer size of K = 127 was choosen to give statistics.

Data availability

Supplemental tables and figures are available at GSA figshare. File S1 shows the per chromosome read coverage distribution of Dwarf Cavendish reads. File S2 presents a comparison of SNVs in the duplicated segment on the long arm of chromosome 2 to all other SNVs in the genome. The higher read coverage at variants indicates a duplication of this region. File S3 lists the public genomic banana sequence read samples used for comparison against Dwarf Cavendish based on the DH Pahang reference. File S4 gives coverage plots of public genomic banana sequence read samples for comparison against Dwarf Cavendish based on the DH Pahang reference. Samples are: Musa acuminata, Musa acuminata AYP_BOSN_r1, Musa acuminata ssp. banksii, Musa acuminata ssp. burmannica, Musa acuminata Cavendish BaXiJiao, Musa acuminata Gros Michel, Musa acuminata ssp. malaccensis, Musa acuminata Sucrier (Pisang Mas), Musa acuminata Sucrier (Pisang Mas 1998-2307), Musa acuminata ssp. zebrina (blood banana), Musa balbisiana Pisang Klutuk Wulung, Musa itinerans, Musa schizocarpa. File S5 lists selected high impact variants between Dwarf Cavendish and DH Pahang with resulting effects predicted by SnpEff. Sequencing data sets were submitted to the European Nucleotide Archive (Study PRJEB33317 with the runs ERR3412983, ERR3412984, ERR3413471, ERR3413472, ERR3413473, ERR3413474). Python scripts are freely available on github (https://github.com/bpucker/banana). SNVs and InDels detected between the M. acuminata cultivars DH Pahang and Dwarf Cavendish are available in VCF format at https://doi.org/10.4119/unibi/2936278. The Dwarf Cavendish genome assembly is available in FASTA format at https://doi.org/10.4119/unibi/2937697. Supplemental material available at figshare: https://doi.org/10.25387/g3.9994643.

Results and Discussion

Structural variants

Mapping of M. acuminata Dwarf Cavendish reads against the DH Pahang v2 reference sequence assembly revealed several copy number variations in different parts of the genome (Figure 2, File S1). The most remarkable difference between the Dwarf Cavendish and Pahang genome sequence is the amplification of an about 6.2 Mbp continuous region (length deduced from the reference genome) on the long arm of chromosome 2 (Figure 2, File S1, S2). An investigation of allele frequencies in the duplicated segment on chromosome 2 revealed that this duplication originates from a haplophase with high similarity to the reference sequence (Figure 3). Such a duplication was not observed in any of the other publicly available genomic sequencing data sets when compared against the DH Pahang v2 genome sequence (File S3, S4). Apparently, read mapping also indicates at least four large scale deletions in Dwarf Cavendish compared to Pahang v2 on chromosomes 2, 4, 5 and 7 (Figure 2). However, analysis of the underlying sequence revealed long stretches of ambiguous bases (Ns) at these positions in the Pahang assembly as the cause for these pseudo low coverage regions.
Figure 2

Coverage distribution. Chromosomes are ordered by increasing number with the north end on the left hand side. Centromere positions (D’Hont ) are indicated by thin vertical gray lines. Mapping of M. acuminata Dwarf Cavendish reads against the DH Pahang v2 reference sequence assembly revealed a 6.2 Mbp tetraploid region on the long arm of chromosome 2 in Dwarf Cavendish (see enlarged box in the upper right). Apparent large scale deletions, indicated by regions with almost zero coverage, are technical artifacts caused by large stretches of ambiguous bases (Ns) in the Pahang assembly that cannot be covered by reads; these artifacts are marked with horizontal gray lines. Plots with higher per chromosome resolution data are presented in Supplementary File S1.

Figure 3

Allele frequency histogram. Visualization of mapping results of Dwarf Cavendish Illumina reads against the Pahang v2 reference sequence, used for the determination of SNV frequencies. The frequencies of the reference allele at SNV positions are displayed here, excluding those positions at which the Pahang reference sequence deviates from an invariant sequence position of Dwarf Cavendish. Black vertical lines indicate allele frequencies of 0.33, 0.5, and 0.66, respectively. SNVs in the duplicated segment on the long arm of chromosome 2 (magenta) are distinguished from all other variants (lime). Within the duplicated segment on chromosome 2, the frequency of the reference alleles is often 0.75 or 0.25 indicating the presence of three similar alleles and one diverged allele.

Coverage distribution. Chromosomes are ordered by increasing number with the north end on the left hand side. Centromere positions (D’Hont ) are indicated by thin vertical gray lines. Mapping of M. acuminata Dwarf Cavendish reads against the DH Pahang v2 reference sequence assembly revealed a 6.2 Mbp tetraploid region on the long arm of chromosome 2 in Dwarf Cavendish (see enlarged box in the upper right). Apparent large scale deletions, indicated by regions with almost zero coverage, are technical artifacts caused by large stretches of ambiguous bases (Ns) in the Pahang assembly that cannot be covered by reads; these artifacts are marked with horizontal gray lines. Plots with higher per chromosome resolution data are presented in Supplementary File S1. Allele frequency histogram. Visualization of mapping results of Dwarf Cavendish Illumina reads against the Pahang v2 reference sequence, used for the determination of SNV frequencies. The frequencies of the reference allele at SNV positions are displayed here, excluding those positions at which the Pahang reference sequence deviates from an invariant sequence position of Dwarf Cavendish. Black vertical lines indicate allele frequencies of 0.33, 0.5, and 0.66, respectively. SNVs in the duplicated segment on the long arm of chromosome 2 (magenta) are distinguished from all other variants (lime). Within the duplicated segment on chromosome 2, the frequency of the reference alleles is often 0.75 or 0.25 indicating the presence of three similar alleles and one diverged allele.

Ploidy of M. acuminata Dwarf Cavendish

Based on allele frequency of small sequence variants (SNVs and InDels), the ploidy of Dwarf Cavendish was identified as triploid (Figure 3). Many heterozygous variant positions display a frequency of the reference allele close to 0.33 or close to 0.66. This fits the expectation for two copies of the reference allele and one copy of a different allele, or vice versa. Deviation from the precise values is explained by random fluctuation of the read distribution at the given position. Since the peak around 0.66 for the frequency of the allele identical to the reference is substantially higher than the peak around 0.33, it is reasonable to assume that two haplophases are very similar to the reference. The third haplophase is the one that contains more deviating positions and differs more from the reference. It is likely that reads of the divergent haplophase are mapped with a slightly reduced rate. This might explain why the peak at 0.66 is slightly more than twice the size of the peak at 0.33. In the duplicated segment on chromosome 2 the allele frequency peaks are shifted to 0.25 and 0.75 (Figure 3), indicating a tetraploid region with three haplophases identical to the reference and one haplophase divergent from the Pahang reference. To be able to test and prove or disprove hypotheses regarding differences of the haplophases of the Dwarf Cavendish genome, a high continuity phased assembly would be needed. Up-to-date long read sequencing technologies like Single Molecule Real-Time (Pacific Biosciences) or nanopore sequencing (Oxford Nanopore Technologies) in principle allow to generate such assemblies. However, successful phase separation currently requires tools like TrioCanu (Koren ) which use Mendelian relationships between parents and F1 (i.e., crosses) for assignment of reads to phases. Generation of such data sets will be very difficult for banana and goes significantly beyond the scope of this study.

Genome-wide distribution of small sequence variants

In total, 10,535,983 SNVs and 1,466,047 InDels were identified between the Dwarf Cavendish reads and the Pahang v2 assembly (see Data availability above). The genome-wide distribution of these variants is shown in Figure 4. As previously observed in other re-sequencing studies (Pucker ), the number of SNVs exceeds the number of InDels substantially. Moreover, InDels are more frequent outside of annotated coding regions. Inside coding regions, InDels show an increased proportion of lengths which are divisible by 3, a bias introduced due to the avoidance of frameshifts.
Figure 4

Genome-wide distribution of small sequence variants. SNVs (green) and InDels (magenta) distinguish Dwarf Cavendish from Pahang. Variants were counted in 100 kb windows and are displayed on two different y-axes to allow maximal resolution (Pucker ).

Genome-wide distribution of small sequence variants. SNVs (green) and InDels (magenta) distinguish Dwarf Cavendish from Pahang. Variants were counted in 100 kb windows and are displayed on two different y-axes to allow maximal resolution (Pucker ). SnpEff predicted 4,163 premature stop codons, 3,238 lost stop codons, and 8,065 frameshifts based on this variant set (File S5). Even given the larger genome size, these numbers are substantially higher than high impact variant numbers observed in re-sequencing studies of homozygous species before (Pucker ; Xu ). One explanation could be the presence of three alleles for each locus leading to compensation of disrupted alleles. Since banana plants are propagated vegetatively, breeders do not suffer inbreeding depressions. To facilitate wet lab applications like oligonucleotide design and validation of amplicons, the genome sequence of Dwarf Cavendish was assembled de novo. The assembly comprises 256,523 scaffolds with an N50 of 5.4 kb (Table 1). Differences between the three haplophases are one possible explanation for the low assembly contiguity. The assembly size slightly exceeds the size of one haplotype. Due to the low contiguity of this assembly and only minimal above 50% complete BUSCOs (Benchmarking Universal Single-Copy Orthologs) (Simão ), annotation was omitted. Nevertheless, we successfully used the produced genome assembly for primer design and detection of small sequence variants.
Table 1 M.

acuminata Dwarf Cavendish de novo genome assembly statistics

ParameterValue
Number of scaffolds256,523
Maximal scaffold length240,314 bp
Assembly size963,409,601 bp (0.96 Gbp)
GC content38.78%
N505,432 bp
N901,592 bp
  17 in total

1.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors:  Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal:  Genome Res       Date:  2010-07-19       Impact factor: 9.043

2.  A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.

Authors:  Pablo Cingolani; Adrian Platts; Le Lily Wang; Melissa Coon; Tung Nguyen; Luan Wang; Susan J Land; Xiangyi Lu; Douglas M Ruden
Journal:  Fly (Austin)       Date:  2012 Apr-Jun       Impact factor: 2.160

3.  Multidisciplinary perspectives on banana (Musa spp.) domestication.

Authors:  Xavier Perrier; Edmond De Langhe; Mark Donohue; Carol Lentfer; Luc Vrydaghs; Frédéric Bakry; Françoise Carreel; Isabelle Hippolyte; Jean-Pierre Horry; Christophe Jenny; Vincent Lebot; Ange-Marie Risterucci; Kodjo Tomekpe; Hugues Doutrelepont; Terry Ball; Jason Manwaring; Pierre de Maret; Tim Denham
Journal:  Proc Natl Acad Sci U S A       Date:  2011-07-05       Impact factor: 11.205

4.  The banana (Musa acuminata) genome and the evolution of monocotyledonous plants.

Authors:  Angélique D'Hont; France Denoeud; Jean-Marc Aury; Franc-Christophe Baurens; Françoise Carreel; Olivier Garsmeur; Benjamin Noel; Stéphanie Bocs; Gaëtan Droc; Mathieu Rouard; Corinne Da Silva; Kamel Jabbari; Céline Cardi; Julie Poulain; Marlène Souquet; Karine Labadie; Cyril Jourda; Juliette Lengellé; Marguerite Rodier-Goud; Adriana Alberti; Maria Bernard; Margot Correa; Saravanaraj Ayyampalayam; Michael R Mckain; Jim Leebens-Mack; Diane Burgess; Mike Freeling; Didier Mbéguié-A-Mbéguié; Matthieu Chabannes; Thomas Wicker; Olivier Panaud; Jose Barbosa; Eva Hribova; Pat Heslop-Harrison; Rémy Habas; Ronan Rivallan; Philippe Francois; Claire Poiron; Andrzej Kilian; Dheema Burthia; Christophe Jenny; Frédéric Bakry; Spencer Brown; Valentin Guignon; Gert Kema; Miguel Dita; Cees Waalwijk; Steeve Joseph; Anne Dievart; Olivier Jaillon; Julie Leclercq; Xavier Argout; Eric Lyons; Ana Almeida; Mouna Jeridi; Jaroslav Dolezel; Nicolas Roux; Ange-Marie Risterucci; Jean Weissenbach; Manuel Ruiz; Jean-Christophe Glaszmann; Francis Quétier; Nabila Yahiaoui; Patrick Wincker
Journal:  Nature       Date:  2012-08-09       Impact factor: 49.962

5.  From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.

Authors:  Geraldine A Van der Auwera; Mauricio O Carneiro; Christopher Hartl; Ryan Poplin; Guillermo Del Angel; Ami Levy-Moonshine; Tadeusz Jordan; Khalid Shakir; David Roazen; Joel Thibault; Eric Banks; Kiran V Garimella; David Altshuler; Stacey Gabriel; Mark A DePristo
Journal:  Curr Protoc Bioinformatics       Date:  2013

6.  Origins of agriculture at Kuk Swamp in the highlands of New Guinea.

Authors:  T P Denham; S G Haberle; C Lentfer; R Fullagar; J Field; M Therin; N Porch; B Winsborough
Journal:  Science       Date:  2003-06-19       Impact factor: 47.728

7.  Whole genome sequencing of a banana wild relative Musa itinerans provides insights into lineage-specific diversification of the Musa genus.

Authors:  Wei Wu; Yu-Lan Yang; Wei-Ming He; Mathieu Rouard; Wei-Ming Li; Meng Xu; Nicolas Roux; Xue-Jun Ge
Journal:  Sci Rep       Date:  2016-08-17       Impact factor: 4.379

8.  A De Novo Genome Sequence Assembly of the Arabidopsis thaliana Accession Niederzenz-1 Displays Presence/Absence Variation and Strong Synteny.

Authors:  Boas Pucker; Daniela Holtgräwe; Thomas Rosleff Sörensen; Ralf Stracke; Prisca Viehöver; Bernd Weisshaar
Journal:  PLoS One       Date:  2016-10-06       Impact factor: 3.240

9.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Authors:  Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiaoqian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak-Wah Lam; Jun Wang
Journal:  Gigascience       Date:  2012-12-27       Impact factor: 6.524

10.  De novo assembly of haplotype-resolved genomes with trio binning.

Authors:  Sergey Koren; Arang Rhie; Brian P Walenz; Alexander T Dilthey; Derek M Bickhart; Sarah B Kingan; Stefan Hiendleder; John L Williams; Timothy P L Smith; Adam M Phillippy
Journal:  Nat Biotechnol       Date:  2018-10-22       Impact factor: 54.908

View more
  3 in total

1.  Cultivar-specific markers, mutations, and chimerisim of Cavendish banana somaclonal variants resistant to Fusarium oxysporum f. sp. cubense tropical race 4.

Authors:  Bo-Han Hou; Yi-Heng Tsai; Ming-Hau Chiang; Shu-Ming Tsao; Shih-Hung Huang; Chih-Ping Chao; Ho-Ming Chen
Journal:  BMC Genomics       Date:  2022-06-25       Impact factor: 4.547

2.  Benchmarking small-variant genotyping in polyploids.

Authors:  Daniel P Cooke; David C Wedge; Gerton Lunter
Journal:  Genome Res       Date:  2021-12-29       Impact factor: 9.043

3.  Combined Transcriptome and Metabolome Analysis of Musa nana Laur. Peel Treated With UV-C Reveals the Involvement of Key Metabolic Pathways.

Authors:  Ming-Zhong Chen; Xu-Mei Zhong; Hai-Sheng Lin; Xiao-Ming Qin
Journal:  Front Genet       Date:  2022-01-27       Impact factor: 4.599

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.