| Literature DB >> 25362073 |
Seungill Kim1, Myung-Shin Kim1, Yong-Min Kim2, Seon-In Yeom3, Kyeongchae Cheong4, Ki-Tae Kim5, Jongbum Jeon4, Sunggil Kim6, Do-Sun Kim7, Seong-Han Sohn8, Yong-Hwan Lee9, Doil Choi10.
Abstract
The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp.Entities:
Keywords: de novo transcriptome; gene prediction; non-coding sequence; onion; reference gene set
Mesh:
Substances:
Year: 2014 PMID: 25362073 PMCID: PMC4379974 DOI: 10.1093/dnares/dsu035
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Figure 1.Integrated structural gene annotation pipeline (ISGAP). (A) ISGAP based on reference proteins and ab initio prediction. (B) The six-frame translation method as an independent process.
Statistics of the annotated onion gene sets from ISGAP
| Combined | H6 | SP3B | |
|---|---|---|---|
| Whole | |||
| Number of genes | 54,165 | 38,004 | 35,750 |
| Total length (Mb) | 59.6 | 42.3 | 40.6 |
| Average length (bp) | 1,100.4 | 1,112.1 | 1,136.5 |
| Representative | |||
| Number of genes | 20,447 | 18,034 | 17,101 |
| Total length (Mb) | 22.0 | 20.5 | 19.4 |
| Average length (bp) | 1,075.0 | 1,135.2 | 1,134.9 |
Detailed statistics for all the annotated gene sets from six-frame translation and ISGAP using the combined library
| Six-frame translation | Step 1a | Step 2b | Step 3c | Steps 2 + 3d | Final | |
|---|---|---|---|---|---|---|
| Number of genes | 65,645 | 42,435 | 42,435 | 51,092 | 61,852 | 54,165 |
| Number of genes containing multiple exons | N/A | 9,481 | 9,481 | 10,207 | 13,516 | 11,496 |
| Number of introns | N/A | 11,015 | 11,015 | 12,436 | 16,600 | 13,543 |
| Average length of exons (bp) | 945.0 | 752.9 | 859.8 | 886.1 | 813.6 | 880.3 |
| Average length of introns (bp) | N/A | 298.9 | 298.9 | 344.7 | 310.6 | 307.7 |
Gene model derived from reference proteins.
Extended gene model through the translation of partial genes in Step 2.
Ab initio predicted a gene model.
Integrated gene model of Steps 2 and 3.
Figure 2.Comparison of the annotated onion gene sets predicted the combined library by six-frame translation and ISGAP. For the black dotted line, the left and right of the histogram represent the numbers of covered query sequences and predicted genes, respectively. (A) Validation of the predicted gene sets using 511 onion proteins. (B) Assessment of the predicted proteins against the plant proteins in the RefSeq database.
Figure 3.Representative cases of well-annotated genes predicted by ISGAP compared with the genes predicted by six-frame translation. The genes predicted by ISGAP and six-frame translation are shown, as well as the onion, RefSeq, reference, and ab initio gene models. The plus and minus signs in the brackets indicate the strand of mapped or predicted genes. (A and B) Cases of genes containing multiple exons; (C) gene annotation with the correct region. (D) Gene annotation with the correct strand.
Figure 4.Distribution of biological functions and coverage graph for monocot and dicot plants. (A) Top 20 InterPro domains among the onion genes from the combined library. (B) Coverage graph of the onion genes in the assembly of the combined library on monocot and dicot plants. The line graph and histogram illustrate the proportions of onion genes and plant proteins in each species, respectively.
Sequence variation between two cultivated onions, H6, and SP3B
| Whole variation | Confirmed variationa | |||||||
|---|---|---|---|---|---|---|---|---|
| Exon | Intron | Othersb | Sum | Exon | Intron | Othersb | Sum | |
| SNPs | 9,875 | 1,357 | 38,832 | 50,064 | 5,502 | 300 | 5,642 | 11,444 |
| INDELs | 766 | 834 | 12,416 | 14,016 | 47 | 19 | 431 | 497 |
| Total | 10,641 | 2,191 | 51,248 | 64,080 | 5,549 | 319 | 6,073 | 11,941 |
Variations that have conserved flanking sequences in both assemblies.
Regions except exon and intron.