Literature DB >> 26217816

Construction of Brassica A and C genome-based ordered pan-transcriptomes for use in rapeseed genomic research.

Zhesi He¹, Feng Cheng², Yi Li¹, Xiaowu Wang², Isobel A P Parkin³, Boulos Chalhoub⁴, Shengyi Liu⁵, Ian Bancroft¹.

Abstract

This data article reports the establishment of the first pan-transcriptome resources for the Brassica A and C genomes. These were developed using existing coding DNA sequence (CDS) gene models from the now-published Brassica oleracea TO1000 and Brassica napus Darmor-bzh genome sequence assemblies representing the chromosomes of these species, along with preliminary CDS models from an updated Brassica rapa Chiifu genome sequence assembly. The B. rapa genome sequence scaffolds required splitting and re-ordering to match the expected genome organisation based on a high density SNP linkage map, but the B. oleracea assembly was used unchanged. The resulting B. rapa (A genome) pseudomolecules contained 47,656 ordered CDS models and the B. oleracea (C genome) pseudomolecules contained 54,766 ordered CDS models. Interpolation of B. napus CDS models not already represented by orthologues resulted in 52,790 and 63,308 ordered CDS models in the A and C pan-transcriptomes, an increase of 13,676 overall. Comparison of the organisation of this resource with publicly available genome sequences for B. napus showed excellent consistency for the B. napus Darmor-bzh resource, but more breakdown of collinearity for the B. napus ZS11 resource. CDS datasets comprising the pan-transcriptomes are available with this article (B. rapa) or from public repositories (B. oleracea and B. napus).

Entities: Disease Species

Year: 2015 PMID： 26217816 PMCID： PMC4510581 DOI： 10.1016/j.dib.2015.06.016

Source DB: PubMed Journal: Data Brief ISSN： 2352-3409

Specifications Table Value of the data Provides an updated pseudomolecule description, with genome sequence scaffolds from B. rapa and B. oleracea representing genome organisation in B. napus Provides for the first time pan-transcriptome resources for use in Brassica species containing the A and/or C genomes Provides insights into the extent of gene content variation between the Brassica A and C genomes as represented in an allopolyploid and its diploid progenitors Provides a hypothetical gene order resource for the Brassica A and C pan-genomes for use in genome evolution studies and Associative Transcriptomics.

Experimental design, materials and methods

Transcriptome-based molecular marker systems have been developed and deployed with great success in the crop species B. napus for both genome organisation studies [1] and association genetics [2]. These studies exploit mRNAseq data, which need to be mapped to a suitable transcriptome reference sequence for single nucleotide polymorphism (SNP) identification and transcript quantification. The first generation approach used unigene assemblies as the reference sequences [3], which permitted some resolution of the contributions to the transcriptome of homoeologous gene pairs [4]. However, the genome sequences reported for B. napus Darmor-bzh indicate that sequence exchanges between the constituent genomes of this allotetraploid species (A genome from an unknown B. rapa and C genome from an unknown B. oleacea) may occur very frequently [5], making it imperative that the genome-of-origin of any given gene be determined as clearly as possible. The most reliable way of achieving this is to base resources primarily on those derived from the constituent genomes in the diploid progenitors of B. napus, i.e. from B. rapa and B. oleracea. As an improvement on the existing resource based on unigenes assembled across Brassica species [3,6], we therefore aimed to develop a new transcriptome reference, based on coding DNA sequence (CDS) gene models derived primarily from the Brassica A and C genomes as represented in the progenitor species. As the B. napus genome sequence annotation identified many gene models without orthologues in B. rapa and B. oleracea, we further aimed to interpolate those B. napus-specific CDS models, thus producing pan-transcriptome resources for the Brassica A and C genomes as represented by the union of orthologous genes of B. rapa, B. oleracea and B. napus. The version 2 B. rapa Chiifu genome sequence scaffolds represent a major advance on the published version 1 sequences [7] in that they provide more comprehensive coverage of the genome, with aggregate scaffold size increasing from 248 Mb to 370 Mb. A preliminary annotation was undertaken of the genome sequence scaffolds that had been organised into chromosomes, essentially as described for the version 1 genome sequences [7]. Briefly: Genscan and Augustus with parameters established using Arabidopsis thaliana gene models were used to perform de novo gene predictions in the new genome assembly of B. rapa, after masking the Class I and Class II transposable elements. The predicted genes with CDS models shorter than 150 bp were filtered out. We further performed homology based gene prediction by aligning A. thaliana, Carica papaya, Populus trichocarpa, Vitis vinifera and Oryza sativa protein sequences to the B. rapa genome. TBLASTN was used to do fast alignment (threshold e-value 1E−5), then Genewise was used to do precise alignment. Additionally, we assembled the Brassica ESTs downloaded from NCBI using PASA and aligned them to B. rapa genome by BLAT. Considering that the fragmented exons in EST data might lead to false results, we filtered out alignments with gaps (introns) that span over 10 kb in length. We then ran GLEAN to merge the gene sets generated from de novo and homology-based predictions, using mRNA-Seq data as the supporting evidence. Finally, the B. rapa gene set was aligned to the TE protein database of Repbase, those hits with e-value>1E−5 and coverage≥50% were filtered out. The remaining gene models were reported as Brassica gene set Version 2.0 (Additional file 1). These CDS models were then used in sequence similarity searches using BLAST to identify the highest-scoring significant hit (threshold e-value 1E−30) for each CDS model in both the version 2 B. rapa Chiifu genome sequence scaffolds and the A genome pseudomolecules reported previously [6], based on the version 1 B. rapa Chiifu genome sequence [7] that had been reordered relative to the B. napus genome via high density transcriptome SNP linkage mapping [1]. This enabled the identification of chimeric scaffolds in the version 2 assembly that could be split (Additional file 2) and re-organised (Additional file 3) to form pseudomolecules representative of the organisation of the Brassica A genome. The CDS models from the B. oleracea TO1000 [8] genome sequence were similarly used to assess collinearity with the C genome pseudomolecules reported previously [6] and were found to be in excellent agreement, so the B. oleracea TO1000 assembly was adopted unaltered as representing the Brassica C genome pseudomolecule resource. The B. rapa Chiifu CDS, along with CDS from the published B. oleracea TO1000 genome sequence [8], was mapped onto the respective genome sequence pseudomolecules using BLAST to identify the highest-scoring significant hit (threshold e-value 1E−30). This resulted in the mapping and ordering of 47,656 B. rapa CDS models to the A genome and 54,766 B. oleracea CDS models to the C genome. A total of 101,040 CDS models were annotated in the B. napus Darmor-bzh genome [5]. Of these, 80,927 CDS models which had been anchored to the 19 B. napus pseudomolecules were mapped onto the respective (B. rapa and B. oleracea-based) genome sequence pseudomolecules by BLAST (threshold e-value 1E−30). B. napus CDS models mapping redundantly with CDS models derived from B. rapa and B. oleracea (threshold e-value 1E−30) were excluded, resulting in the addition of 2165 and 3032 CDS models to the A and C genomes, respectively. Finally, CDS models from the B. napus Darmor-bzh genome sequence that did not have significant (threshold e-value 1E−30) BLAST hits in the (B. rapa and B. oleracea-based) genome sequence pseudomolecules were interpolated based on the positions of flanking gene models that did map. This was done by combining the B. napus Darmor-bzh CDS models׳ sorted location on the B. napus Darmor-bzh chromosome with the mapped location of flanking genes on the B. rapa or B. oleracea-based pseudomolecules using an R script (Additional file 4) to perform the following: (1) Sort B. napus CDS models by B. napus Darmor-bzh pseudomolecules, then by their B. rapa or B. oleracea-based pseudomolecules hit locations then (2) CDS models (or runs of adjacent CDS models) that do not have a hit onto the B. rapa or B. oleracea-based pseudomolecules are interpolated onto those pseudomolecules with a three digit suffix starting from the boundary of the point of insertion. When the boundaries are not in the right order, the interpolation starts from the closest boundary number to the mean of the nearest 10 neighbours of the run of CDS models. If there is no mapping in the 10 nearest neighbours, the interpolation starts from the minimum of the boundary numbers. This resulted in the addition of 2969 and 5510 further CDS models to the A and C genomes, respectively. The final AC pan-transcriptome resource therefore comprises a total of 116,098 hypothetically ordered CDS models (Additional file 5,6,7), 52,790 in the Brassica A genome and 63,308 in the Brassica C genome. This represents an increase of 35,171 over the 80,927 CDS models annotated in the published B. napus Darmor-bzh pseudomolecules, 15,058 over the complement of gene models for B. napus including the 20,113 in sequence scaffolds not incorporated into the B. napus Darmor-bzh pseudomolecules [5] and 13,676 more than had been identified in the B. rapa and B. oleracea pseudomolecules. The order of CDS models in the pan transcriptome was compared with the order of orthologous sequences in two publicly-available B. napus genome sequence resources. This was conducted by sequence similarity search using BLAST to identify the highest-scoring significant hit (threshold e-value 1E−30) for each CDS model in the pan-transcriptome in each of the B. napus Darmor-bzh and B. napus ZS11 chromosome assemblies. Of the aggregate 116,098 CDS models in the pan-transcriptome, 107,292 (92.4%) returned significant hits (threshold e-value 1E−30) in the B. napus Darmor-bzh assembly and 99,395 (85.6%) returned significant hits in the B. napus ZS11 assembly. The order of these best similarity matches in each resource is illustrated in Fig. 1. The inferred gene order in the pan-transcriptome and the B. napus Darmor-bzh genome assembly shows excellent collinearity. A small number of local rearrangements can be observed in regions with relative high densities of non-collinear matches, possibly corresponding to paracentromeric regions. In addition, two prominent segments shadowing the main collinearity diagonal can be observed amongst the background of CDS models mapping to non-orthologous positions. Such shadows have been observed in previous studies [6] and were shown to correspond to sequences missing from the genome sequence resource, with consequent mapping of sequences to one of the two paralogous segments of these paleohexaploid genomes. The inferred gene order in the pan-transcriptome and the B. napus ZS11 genome assembly show extensive collinearity, but with more disruption by rearrangements than was observed with B. napus Darmor-bzh resource. Linkage group C6 is also presented in the B. napus ZS11 genome sequence resource in the opposite orientation to the current reference genetic map for the Brassica C genome. These analyses, which together indicate extensive collinearity of the Brassica A and C genomes as represented in the allotetraploid B. napus and representatives of its progenitors, are also consistent with early observations of extensive collinearity, but with some divergence in gene content between orthologous regions of Brassica genomes, including both loss and mobility of coding sequences [9].

Fig. 1

Collinearity of ordered pan-transcriptomes and the genome sequences of B. napus Darmor-bzh and B. napus ZS11. The positions of best sequence matches in the B. napus chromosome assemblies are plotted for CDS models with significant similarity matches (threshold e-value 1E−30) in the B. napus Darmor-bzh assembly and B. napus ZS11 assembly.

Subject area	Biology
More specific subject area	Plant genome organisation
Type of data	CDS gene model sequences for the A genome, in FASTA format. Tables (in the form of MS Excel spreadsheets) providing A genome pseudomolecule specification based on genome sequence scaffolds, inferred order and anchoring positions in the A and C genome pseudomolecules for CDS models and a figure illustrating the collinearity of the ordered pan-transcriptome and two genome sequences reported for B. napus.
How data was acquired	CDS gene model sequences for the A genome were developed as part of the reported work. Genome sequence scaffolds and other CDS data were obtained from the groups generating them prior to publication.
Data format	The data accompanying this article are provided as text files (for B. rapa CDS models and R scripts) and MS Excel spreadsheets providing CDS and scaffold identifiers and sequence similarity coordinates.
Experimental factors	n/a
Experimental features	CDS modelling was undertaken using V2.0 B. rapa genome sequence scaffolds. A previously-reported set of Brassica A genome pseudomolecules was used to produce improved pseudomolecules derived from an updated B. rapa genome assembly in order to represent the organisation of the A genome in B. napus. Integration and interpolation of gene models called only in a B. napus genome sequence was undertaken, resulting in the establishment of a pan-transcriptome resource for the Brassica A and C genomes. Collinearity analysis with public B. napus genome sequences was undertaken, based on BLAST similarity hits of CDS models, to compare the order of genes in the pan-transcriptome resource with that of their orthologues in two published B. napus genome sequences.
Data source location	SRA, NCBI, ENA
Data accessibility	All genome sequence datasets were provided for analysis prior to publications, but are now available:
Data accessibility	The B. napus Darmor-bzh assembly is available at ENA (European Nucleotide Archive), in the WGS section for contigs (accession numbers CCCW010000001 to CCCW010044187) and the CON section for scaffolds, chromosomes, and annotation (accession numbers LK031787 to LK052685). The B. napus ZS11 assembly is available at http://www.ncbi.nlm.nih.gov/Traces/wgs/?val=JMKK01#. The B. oleracea assembly is available via Sequence Read Archive accession number PRJNA158027. The B. rapa version 2 assembly is in the process of publication and in the meantime is available from Xiaowu Wang (wangxiaowu@caas.cn).

9 in total

1. Associative transcriptomics of traits in the polyploid crop species Brassica napus.

Authors: Andrea L Harper; Martin Trick; Janet Higgins; Fiona Fraser; Leah Clissold; Rachel Wells; Chie Hattori; Peter Werner; Ian Bancroft
Journal: Nat Biotechnol Date: 2012-08 Impact factor: 54.908

2. Dissecting the genome of the polyploid crop oilseed rape by transcriptome sequencing.

Authors: Ian Bancroft; Colin Morgan; Fiona Fraser; Janet Higgins; Rachel Wells; Leah Clissold; David Baker; Yan Long; Jinling Meng; Xiaowu Wang; Shengyi Liu; Martin Trick
Journal: Nat Biotechnol Date: 2011-07-31 Impact factor: 54.908

3. Comparative analysis between homoeologous genome segments of Brassica napus and its progenitor species reveals extensive sequence-level divergence.

Authors: Foo Cheung; Martin Trick; Nizar Drou; Yong Pyo Lim; Jee-Young Park; Soo-Jin Kwon; Jin-A Kim; Rod Scott; J Chris Pires; Andrew H Paterson; Chris Town; Ian Bancroft
Journal: Plant Cell Date: 2009-07-14 Impact factor: 11.277

4. Single nucleotide polymorphism (SNP) discovery in the polyploid Brassica napus using Solexa transcriptome sequencing.

Authors: Martin Trick; Yan Long; Jinling Meng; Ian Bancroft
Journal: Plant Biotechnol J Date: 2009-01-21 Impact factor: 9.803

5. The genome of the mesopolyploid crop species Brassica rapa.

Authors: Xiaowu Wang; Hanzhong Wang; Jun Wang; Rifei Sun; Jian Wu; Shengyi Liu; Yinqi Bai; Jeong-Hwan Mun; Ian Bancroft; Feng Cheng; Sanwen Huang; Xixiang Li; Wei Hua; Junyi Wang; Xiyin Wang; Michael Freeling; J Chris Pires; Andrew H Paterson; Boulos Chalhoub; Bo Wang; Alice Hayward; Andrew G Sharpe; Beom-Seok Park; Bernd Weisshaar; Binghang Liu; Bo Li; Bo Liu; Chaobo Tong; Chi Song; Christopher Duran; Chunfang Peng; Chunyu Geng; Chushin Koh; Chuyu Lin; David Edwards; Desheng Mu; Di Shen; Eleni Soumpourou; Fei Li; Fiona Fraser; Gavin Conant; Gilles Lassalle; Graham J King; Guusje Bonnema; Haibao Tang; Haiping Wang; Harry Belcram; Heling Zhou; Hideki Hirakawa; Hiroshi Abe; Hui Guo; Hui Wang; Huizhe Jin; Isobel A P Parkin; Jacqueline Batley; Jeong-Sun Kim; Jérémy Just; Jianwen Li; Jiaohui Xu; Jie Deng; Jin A Kim; Jingping Li; Jingyin Yu; Jinling Meng; Jinpeng Wang; Jiumeng Min; Julie Poulain; Jun Wang; Katsunori Hatakeyama; Kui Wu; Li Wang; Lu Fang; Martin Trick; Matthew G Links; Meixia Zhao; Mina Jin; Nirala Ramchiary; Nizar Drou; Paul J Berkman; Qingle Cai; Quanfei Huang; Ruiqiang Li; Satoshi Tabata; Shifeng Cheng; Shu Zhang; Shujiang Zhang; Shunmou Huang; Shusei Sato; Silong Sun; Soo-Jin Kwon; Su-Ryun Choi; Tae-Ho Lee; Wei Fan; Xiang Zhao; Xu Tan; Xun Xu; Yan Wang; Yang Qiu; Ye Yin; Yingrui Li; Yongchen Du; Yongcui Liao; Yongpyo Lim; Yoshihiro Narusaka; Yupeng Wang; Zhenyi Wang; Zhenyu Li; Zhiwen Wang; Zhiyong Xiong; Zhonghua Zhang
Journal: Nat Genet Date: 2011-08-28 Impact factor: 38.330

6. Plant genetics. Early allopolyploid evolution in the post-Neolithic Brassica napus oilseed genome.

Authors: Boulos Chalhoub; France Denoeud; Shengyi Liu; Isobel A P Parkin; Haibao Tang; Xiyin Wang; Julien Chiquet; Harry Belcram; Chaobo Tong; Birgit Samans; Margot Corréa; Corinne Da Silva; Jérémy Just; Cyril Falentin; Chu Shin Koh; Isabelle Le Clainche; Maria Bernard; Pascal Bento; Benjamin Noel; Karine Labadie; Adriana Alberti; Mathieu Charles; Dominique Arnaud; Hui Guo; Christian Daviaud; Salman Alamery; Kamel Jabbari; Meixia Zhao; Patrick P Edger; Houda Chelaifa; David Tack; Gilles Lassalle; Imen Mestiri; Nicolas Schnel; Marie-Christine Le Paslier; Guangyi Fan; Victor Renault; Philippe E Bayer; Agnieszka A Golicz; Sahana Manoli; Tae-Ho Lee; Vinh Ha Dinh Thi; Smahane Chalabi; Qiong Hu; Chuchuan Fan; Reece Tollenaere; Yunhai Lu; Christophe Battail; Jinxiong Shen; Christine H D Sidebottom; Xinfa Wang; Aurélie Canaguier; Aurélie Chauveau; Aurélie Bérard; Gwenaëlle Deniot; Mei Guan; Zhongsong Liu; Fengming Sun; Yong Pyo Lim; Eric Lyons; Christopher D Town; Ian Bancroft; Xiaowu Wang; Jinling Meng; Jianxin Ma; J Chris Pires; Graham J King; Dominique Brunel; Régine Delourme; Michel Renard; Jean-Marc Aury; Keith L Adams; Jacqueline Batley; Rod J Snowdon; Jorg Tost; David Edwards; Yongming Zhou; Wei Hua; Andrew G Sharpe; Andrew H Paterson; Chunyun Guan; Patrick Wincker
Journal: Science Date: 2014-08-21 Impact factor: 47.728

7. Use of mRNA-seq to discriminate contributions to the transcriptome from the constituent genomes of the polyploid crop species Brassica napus.

Authors: Janet Higgins; Andreas Magusin; Martin Trick; Fiona Fraser; Ian Bancroft
Journal: BMC Genomics Date: 2012-06-15 Impact factor: 3.969

8. Collinearity analysis of Brassica A and C genomes based on an updated inferred unigene order.

Authors: Ian Bancroft; Fiona Fraser; Colin Morgan; Martin Trick
Journal: Data Brief Date: 2015-02-10

9. Transcriptome and methylome profiling reveals relics of genome dominance in the mesopolyploid Brassica oleracea.

Authors: Isobel A P Parkin; Chushin Koh; Haibao Tang; Stephen J Robinson; Sateesh Kagale; Wayne E Clarke; Chris D Town; John Nixon; Vivek Krishnakumar; Shelby L Bidwell; France Denoeud; Harry Belcram; Matthew G Links; Jérémy Just; Carling Clarke; Tricia Bender; Terry Huebert; Annaliese S Mason; J Chris Pires; Guy Barker; Jonathan Moore; Peter G Walley; Sahana Manoli; Jacqueline Batley; David Edwards; Matthew N Nelson; Xiyin Wang; Andrew H Paterson; Graham King; Ian Bancroft; Boulos Chalhoub; Andrew G Sharpe
Journal: Genome Biol Date: 2014-06-10 Impact factor: 13.583

9 in total

18 in total

1. Carbohydrate microarrays and their use for the identification of molecular markers for plant cell wall composition.

Authors: Ian P Wood; Bruce M Pearson; Enriqueta Garcia-Gutierrez; Lenka Havlickova; Zhesi He; Andrea L Harper; Ian Bancroft; Keith W Waldron
Journal: Proc Natl Acad Sci U S A Date: 2017-06-12 Impact factor: 11.205

Review 2. Connecting genome structural variation with complex traits in crop plants.

Authors: Iulian Gabur; Harmeet Singh Chawla; Rod J Snowdon; Isobel A P Parkin
Journal: Theor Appl Genet Date: 2018-11-17 Impact factor: 5.699

3. Validation of an updated Associative Transcriptomics platform for the polyploid crop species Brassica napus by dissection of the genetic architecture of erucic acid and tocopherol isoform variation in seeds.

Authors: Lenka Havlickova; Zhesi He; Lihong Wang; Swen Langer; Andrea L Harper; Harjeevan Kaur; Martin R Broadley; Vasilis Gegas; Ian Bancroft
Journal: Plant J Date: 2017-12-02 Impact factor: 6.417

Review 4. Current Status and Challenges in Identifying Disease Resistance Genes in Brassica napus.

Authors: Ting Xiang Neik; Martin J Barbetti; Jacqueline Batley
Journal: Front Plant Sci Date: 2017-11-06 Impact factor: 5.753

5. Using RNA-Seq for Genomic Scaffold Placement, Correcting Assemblies, and Genetic Map Creation in a Common Brassica rapa Mapping Population.

Authors: R J Cody Markelz; Michael F Covington; Marcus T Brock; Upendra K Devisetty; Daniel J Kliebenstein; Cynthia Weinig; Julin N Maloof
Journal: G3 (Bethesda) Date: 2017-07-05 Impact factor: 3.154

6. Extensive homoeologous genome exchanges in allopolyploid crops revealed by mRNAseq-based visualization.

Authors: Zhesi He; Lihong Wang; Andrea L Harper; Lenka Havlickova; Akshay K Pradhan; Isobel A P Parkin; Ian Bancroft
Journal: Plant Biotechnol J Date: 2016-12-06 Impact factor: 9.803

7. Identification of Candidate Genes for Calcium and Magnesium Accumulation in Brassica napus L. by Association Genetics.

Authors: Thomas D Alcock; Lenka Havlickova; Zhesi He; Ian Bancroft; Philip J White; Martin R Broadley; Neil S Graham
Journal: Front Plant Sci Date: 2017-11-15 Impact factor: 5.753

8. The oilseed rape developmental expression resource: a resource for the investigation of gene expression dynamics during the floral transition in oilseed rape.

Authors: D Marc Jones; Tjelvar S G Olson; Nick Pullen; Rachel Wells; Judith A Irwin; Richard J Morris
Journal: BMC Plant Biol Date: 2020-07-21 Impact factor: 4.215

9. Species-Wide Variation in Shoot Nitrate Concentration, and Genetic Loci Controlling Nitrate, Phosphorus and Potassium Accumulation in Brassica napus L.

Authors: Thomas D Alcock; Lenka Havlickova; Zhesi He; Lolita Wilson; Ian Bancroft; Philip J White; Martin R Broadley; Neil S Graham
Journal: Front Plant Sci Date: 2018-10-16 Impact factor: 5.753

10. Comparative analysis of the genetic variability within the Q-type C2H2 zinc-finger transcription factors in the economically important cabbage, canola and Chinese cabbage genomes.

Authors: Susan D Lawrence; Nicole G Novak
Journal: Hereditas Date: 2018-09-21 Impact factor: 3.271