| Literature DB >> 31782791 |
Glaucia Mendes Souza1, Marie-Anne Van Sluys2, Carolina Gimiliani Lembke1, Hayan Lee3,4, Gabriel Rodrigues Alves Margarido5, Carlos Takeshi Hotta1, Jonas Weissmann Gaiarsa2, Augusto Lima Diniz1, Mauro de Medeiros Oliveira1, Sávio de Siqueira Ferreira1,2, Milton Yutaka Nishiyama1,6, Felipe Ten-Caten1, Geovani Tolfo Ragagnin2, Pablo de Morais Andrade1, Robson Francisco de Souza7, Gianlucca Gonçalves Nicastro7, Ravi Pandya8, Changsoo Kim9,10, Hui Guo9, Alan Mitchell Durham11, Monalisa Sampaio Carneiro12, Jisen Zhang13, Xingtan Zhang13, Qing Zhang13, Ray Ming13,14, Michael C Schatz3,15, Bob Davidson8, Andrew H Paterson9, David Heckerman8.
Abstract
BACKGROUND: Sugarcane cultivars are polyploid interspecific hybrids of giant genomes, typically with 10-13 sets of chromosomes from 2 Saccharum species. The ploidy, hybridity, and size of the genome, estimated to have >10 Gb, pose a challenge for sequencing.Entities:
Keywords: allele; bioenergy; biomass; genome; polyploid
Mesh:
Substances:
Year: 2019 PMID: 31782791 PMCID: PMC6884061 DOI: 10.1093/gigascience/giz129
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Genome sequencing: technology and assembly details and gene prediction features
| Description | Genomic DNA | BAC clones |
|---|---|---|
|
| ||
| Sequencing data | 26 Illumina synthetic long-read libraries | Single-end Roche 454 of BAC library clones |
| Total sequence (Gb) | 19 | 6.6 |
| Genome coverage | 1.9× | 0.66× |
| Read length minimum/maximum/mean (bp) | 1,500/22,904/4,930 | 8/2,611/368.5 |
| Assembler software | Celera Assembler (Overlap Graph) | PHRAP/CONSED |
| Total reads used in assembly | 3,857,849 | 17,894,306 |
| Total assembly size | 4.26 Gb | 49.6 Mb |
| Number of unitigs/contigs + singletons | 450,609 | 463 |
| Contigs length minimum/maximum/mean (bp) | 1,500/468,011/9,452 | 11,723/235,533/107,129 |
| NG50 (bp) | 41,394 | 109,618 |
| N50 (bp) | 13,157 | N/A |
|
| ||
| No. genes | 373,869 | 3,550 |
| No. transcripts | 374,774 | |
| No. exons | 1,035,764 | 13,132 |
| Mean GC content (%) | 43.20 | 44.99 |
| Mean No. exons per gene | 2.8 | 3.7 |
| Mean exon size (bp) | 291 | 271.8 |
| Median exon size (bp) | 171 | 154 |
| Mean intron size (bp) | 352.6 | 539.2 |
| Median intron size (bp) | 132 | 139 |
| Mean gene size (bp) with UTR | 1,437.80 | 2,429.20 |
| Median gene size (bp) with UTR | 806 | 1,260.50 |
| Mean gene size (bp) without UTR | 1,318.80 | 2,351.30 |
| Median gene size (bp) without UTR | 771 | 1,199.50 |
| Mean gene density (kb per gene) | 11.4 | 14 |
GC: guanine-cytosine; UTR: untranslated region.
Figure 1:Frequency histogram of expressed sequence tags (ESTs) and CEGMA region alignment on sugarcane genome assembly. For 125,072 aligned ESTs, 106,133 (84.8%) show 2–30 matches on the genome (A), while for CEGMA regions, 205 (87.2%) range from 2 to 17 matches on the genome (B). SPALN v 2.3.3 [32] was used for alignment.
Figure 2:Gene copy number estimation. (A) Distribution of copy counts for putative single-copy genes in diploid grasses. From the 2,051 single-copy genes in sorghum, rice, and Brachypodium, 1,592 single-copy genes matched to ≥1 sugarcane predicted gene. More than 99.9% of the aligned single-copy genes are present between 1 and 15 times in the sugarcane assembly. (B) Copy differentiation between sugarcane coding sequences (CDSs) and upstream regions (sequences of 100bp upstream of the CDS, sequences of 500bp upstream of the CDS and sequences of 1000bp upstream of the CDS), based on pairwise sequence alignment of gene clusters. Genetic dissimilarity increases with increasing distance from the translation start site. (C) Indel length distribution in sugarcane putative homo(eo)logs. Frame-preserving indels are more common than frameshifts for this set of genes.
Figure 3:Homo(eo)log expression: The percentage frequency of sugarcane genes plotted against the total number of homo(eo)logs per gene and the number of expressed homo(eo)logs per gene. Genes with complementary DNAs aligned with FPKM > 1 were considered expressed. Plots show sense (A) and antisense (B) transcripts. Reads from Ion PGM Sequencing were used, and strand orientation is maintained [29].
Figure 4:Phylogeny, putative regulatory regions, and expression of sucrose synthase (SuSy) and phenylalanine-ammonia lyase (PAL) gene family. Phylogenetic analysis of (A) SuSy and (C) PAL genes from SP80-3280, R570, S. spontaneum, and sorghum. SuSy sequences from Saccharum ssp. [36] were also included. For both SuSy and PAL, nucleotide sequences (CDS) were aligned with CLUSTALW [37] software in MEGA 7.0 [38] and maximum likelihood trees were constructed with 1,000 bootstraps. Core promoter analysis (gray columns in B and D) using TSSPlant [39] suggests ScSuSy2 (B) and most ScPAL (D) as TATA-less (absence of black squares). Transcription factor binding site (TFBS) prediction (colored symbols in B and D) using MEME [40] and MotifSampler [41] suggests specific motif for each group (ScSuSy1, ScSuSy2, and ScSuSy5 and PAL I, PAL III, PAL Va, and PAL Vb). The three SP80-3280 PAL genes marked with an asterisk in D are present in the same contig. Transposable elements (TEs) were identified within 10 kb upstream from the gene (B and D). Heat map analysis of RNA-Seq data [29] (expression profile in B and D) shows more pronounced expression in SP80-3280 internodes (I1 and I5) of ScSuSy1, ScSuSy2, ScSuSy5, and PAL from group V. RNA-Seq of leaf tissues (L) indicates more pronounced expression of ScPAL from groups II and III. ScSuSy3 presents high numbers of TFBS and TE and low expression in all samples.hylogeny, putative regulatory regions, and expression of sucrose synthase (SuSy) and phenylalanine-ammonia lyase (PAL) gene family. Phylogenetic analysis of (A) SuSy and (C) PAL genes from SP80-3280, R570, S. spontaneum, and sorghum. SuSy sequences from Saccharum ssp. [36] were also included. For both SuSy and PAL, nucleotide sequences (CDS) were aligned with CLUSTALW [37] software in MEGA 7.0 [38] and maximum likelihood trees were constructed with 1,000 bootstraps. Core promoter analysis (gray columns in B and D) using TSSPlant [39] suggests ScSuSy2 (B) and most ScPAL (D) as TATA-less (absence of black squares). Transcription factor binding site (TFBS) prediction (colored symbols in B and D) using MEME [40] and MotifSampler [41] suggests specific motif for each group (ScSuSy1, ScSuSy2, and ScSuSy5 and PAL I, PAL III, PAL Va, and PAL Vb). The three SP80-3280 PAL genes marked with an asterisk in D are present in the same contig. Transposable elements (TEs) were identified within 10 kb upstream from the gene (B and D). Heat map analysis of RNA-Seq data [29] (expression profile in B and D) shows more pronounced expression in SP80-3280 internodes (I1 and I5) of ScSuSy1, ScSuSy2, ScSuSy5, and PAL from group V. RNA-Seq of leaf tissues (L) indicates more pronounced expression of ScPAL from groups II and III. ScSuSy3 presents high numbers of TFBS and TE and low expression in all samples.
Figure 5:SNVs. Alignment of sugarcane contigs to the genic regions of sorghum chromosomes (chromosome 1 is on top and 10 is at the bottom). X and Y axes indicate physical distance on each chromosome (Mb) and the number of SNVs compared to the sorghum reference genome, respectively. Each dot indicates sorghum genes matching ≥2 sugarcane contigs.
Figure 6:Pseudoassembly of contigs. Multiple correspondence analysis (MCA) with hierarchical clustering of the SP80-3280 assembly against the S. spontaneum tetraploid AP85-441 homo(eo)log-resolved assembly [14] and the R570 [13] monoploid genome. A: SP80-3280 contigs best hits against AP85-441 and R570 chromosomes and corresponding size of the preliminary scaffolds; cluster = hierarchical cluster from the MCA. B and C: Circos plot of the proportion of proteins from SP80-3280 (classified into 1 of the 6 clusters or as “non-clustered”) that align to the AP85-441 (chr 01-08) and R570 (sh 01-10) putative chromosomes, respectively.