| Literature DB >> 15685292 |
Jun Yu1, Jun Wang, Wei Lin, Songgang Li, Heng Li, Jun Zhou, Peixiang Ni, Wei Dong, Songnian Hu, Changqing Zeng, Jianguo Zhang, Yong Zhang, Ruiqiang Li, Zuyuan Xu, Shengting Li, Xianran Li, Hongkun Zheng, Lijuan Cong, Liang Lin, Jianning Yin, Jianing Geng, Guangyuan Li, Jianping Shi, Juan Liu, Hong Lv, Jun Li, Jing Wang, Yajun Deng, Longhua Ran, Xiaoli Shi, Xiyin Wang, Qingfa Wu, Changfeng Li, Xiaoyu Ren, Jingqiang Wang, Xiaoling Wang, Dawei Li, Dongyuan Liu, Xiaowei Zhang, Zhendong Ji, Wenming Zhao, Yongqiao Sun, Zhenpeng Zhang, Jingyue Bao, Yujun Han, Lingli Dong, Jia Ji, Peng Chen, Shuming Wu, Jinsong Liu, Ying Xiao, Dongbo Bu, Jianlong Tan, Li Yang, Chen Ye, Jingfen Zhang, Jingyi Xu, Yan Zhou, Yingpu Yu, Bing Zhang, Shulin Zhuang, Haibin Wei, Bin Liu, Meng Lei, Hong Yu, Yuanzhe Li, Hao Xu, Shulin Wei, Ximiao He, Lijun Fang, Zengjin Zhang, Yunze Zhang, Xiangang Huang, Zhixi Su, Wei Tong, Jinhong Li, Zongzhong Tong, Shuangli Li, Jia Ye, Lishun Wang, Lin Fang, Tingting Lei, Chen Chen, Huan Chen, Zhao Xu, Haihong Li, Haiyan Huang, Feng Zhang, Huayong Xu, Na Li, Caifeng Zhao, Shuting Li, Lijun Dong, Yanqing Huang, Long Li, Yan Xi, Qiuhui Qi, Wenjie Li, Bo Zhang, Wei Hu, Yanling Zhang, Xiangjun Tian, Yongzhi Jiao, Xiaohu Liang, Jiao Jin, Lei Gao, Weimou Zheng, Bailin Hao, Siqi Liu, Wen Wang, Longping Yuan, Mengliang Cao, Jason McDermott, Ram Samudrala, Jian Wang, Gane Ka-Shu Wong, Huanming Yang.
Abstract
We report improved whole-genome shotgun sequences for the genomes of indica and japonica rice, both with multimegabase contiguity, or almost 1,000-fold improvement over the drafts of 2002. Tested against a nonredundant collection of 19,079 full-length cDNAs, 97.7% of the genes are aligned, without fragmentation, to the mapped super-scaffolds of one or the other genome. We introduce a gene identification procedure for plants that does not rely on similarity to known genes to remove erroneous predictions resulting from transposable elements. Using the available EST data to adjust for residual errors in the predictions, the estimated gene count is at least 38,000-40,000. Only 2%-3% of the genes are unique to any one subspecies, comparable to the amount of sequence that might still be missing. Despite this lack of variation in gene content, there is enormous variation in the intergenic regions. At least a quarter of the two sequences could not be aligned, and where they could be aligned, single nucleotide polymorphism (SNP) rates varied from as little as 3.0 SNP/kb in the coding regions to 27.6 SNP/kb in the transposable elements. A more inclusive new approach for analyzing duplication history is introduced here. It reveals an ancient whole-genome duplication, a recent segmental duplication on Chromosomes 11 and 12, and massive ongoing individual gene duplications. We find 18 distinct pairs of duplicated segments that cover 65.7% of the genome; 17 of these pairs date back to a common time before the divergence of the grasses. More important, ongoing individual gene duplications provide a never-ending source of raw material for gene genesis and are major contributors to the differences between members of the grass family.Entities:
Mesh:
Year: 2005 PMID: 15685292 PMCID: PMC546038 DOI: 10.1371/journal.pbio.0030038
Source DB: PubMed Journal: PLoS Biol ISSN: 1544-9173 Impact factor: 8.029
Figure 1Basic Algorithm for Construction of Scaffolds and Super-Scaffolds
We start with the smallest plasmids and progressively work our way up to the largest BACs. Only links with two or more pieces of supporting evidence are made. These include 34,190 “anchor points” constructed from a comparison of indica and japonica. Each anchor is a series of high-quality BlastN hits (typically 98.5% identity) put together by a dynamic programming algorithm that allows for small gaps to accommodate the polymorphic intergenic repeats. Typical anchor points contain four BlastN hits at a total size of 9 kb (including gaps). Notice how in the beginning indica and japonica are processed separately, to construct what we called scaffolds. Only at the end do we use data from one subspecies to link scaffolds in the other subspecies, and these are what we called super-scaffolds.
Summary of Assembled Contigs, Scaffolds, and Super-Scaffolds
Each piece can be further subdivided on the basis of whether or not it is mapped and, if not, on the basis of its size. N50 refers to the size above which half of the total length of the sequence set can be found. An equivalent size for the unassembled reads is computed by dividing the number of high-quality Q20 bases (estimated single-base error rate of 10−2) by the effective shotgun coverage
Summary of nr-KOME cDNAs with Complete Alignments (Not Including UTRs) in Each of the Three Rice Assemblies
We require that 95% of the gene be aligned, but there are two ways to count. “Found in genome” will accept fragmented genes that are aligned in multiple pieces, whereas “aligned in one piece” will not
Figure 2A Region on Beijing indica Chromosome 2, Showing Three Gene Islands Separated by Two Intergenic Repeat Clusters of High 20-mer Copy Number
Transposable elements identified by RepeatMasker are classified based on the nomenclature of Table S2. Depicted genes include both nr-KOME cDNAs and FGENESH predictions.
Number of FGENESH Predictions in All Three Rice Assemblies
Filtering refers to the process in which we remove predictions where 50% of the coding region is attributable to any combination of RepeatMasker TEs or 20-mers of copy number over ten. EST confirmation requires 100 bp of exact match
Characteristics of FGENESH Predictions and nr-KOME cDNAs
Predicted genes do not included UTRs. Mean (median) are both given
Figure 3Overlapping FGENESH Predictions in All Three Rice Assemblies
Two predictions are shared when 50% of their coding regions can be aligned. Because of imprecision in the predictions and overlap criteria, we get slightly different numbers for each assembly, and these are encoded through multiple color-coded numbers in the Venn diagram. EST confirmation requires 100 bp of exact match. Unlike the genes, we do not bother to show a different number for each assembly, because they are very similar.
Variation between indica and japonica Defined by SNP and Insertion–Deletion (Indel) Rates
Variation rates for 5′ UTR, coding, intron, and 3′ UTR refer to gene regions defined by nr-KOME. To demonstrate where the high SNP rates come from, we consider regions of 20-mer copy number under ten and RepeatMasker TEs
Figure 4Functional Classifications from GO, Focused on Plant-Specific Categories Outlined by Gramene
(A) compares predicted genes from Arabidopsis and Beijing indica. (B) compares predicted genes from Beijing indica with nr-KOME cDNAs. We ignore categories with less than 0.1% of the genes.
Figure 5A Sample Bioverse-Predicted Interaction Network for Defense Proteins and Their Direct Neighbors
The symbols are colored to indicate some of the major GO categories under “molecular function.” We draw a cross over the symbol for an NH gene. Rectangles indicate proteins that are manually classified as being R-genes. They appear on genes that are not colored as defense, because some genes have multiple functions, not because of an annotation error. The white circles with green outline are unannotated genes that might also belong to this network, at a lower confidence.
Figure 6Duplicated Segments in the Beijing indica Assembly
Depicted here are the plots for Chromosomes 2 (A) and 6 (B). Each data point represents the coordinated genomic positions in a homolog pair, consisting of one nr-KOME cDNA and its one and only TBlastN homolog in rice. Shown on the x-axis is the position of a gene on the indicated chromosome, and shown on the y-axis is the position of its homolog on any of the rice chromosomes, with chromosome number encoded by the colors indicated on the legend at the right.
Figure 7Graphical View of All Duplicated Segments
The 12 chromosomes are depicted along the perimeter of a circle, not in order but slightly rearranged so as to untangle the connections between segments. Overall, we cover 65.7% of the genome.
Summary of Duplicated Segments in the Beijing indica Assembly
We give start and stop positions on the pseudo-chromosome, segment sizes, number of homolog pairs, mean Ks rates, percentage of homolog pairs with Ks < 0.25, and flanking nr-KOME cDNAs. One set of numbers is for the initial analysis of those cDNAs with one and only one homolog. A second is for the analysis of additional cDNAs with higher-order homologs
a Computed total and mean omit the recent segmental duplication on Chromosomes 11 and 12
Chr, Chromosome
Figure 8Distribution of Substitutions per Silent Site (Ks) for Homolog Pairs in Segmental, Tandem, and Background Duplications
In (A), contributions from the recent segmental duplication on Chromosomes 11 and 12 are colored in red. The tandem duplication data are shown on two different scales, one to emphasize the magnitude of the zero peak (B) and another to highlight the exponential decay (C). Background duplications are shown in (D).
Figure 9A View of All Duplications Found on Rice Chromosome 2
In contrast to Figure 6, where we featured those cDNAs with one and only one TBlastN homolog, here we show all detectable TBlastN homologs, up to a maximum of 1,000 per cDNA.
Figure 10Ka/Ks Distribution for Homolog Pairs
Ka and Ks are the fraction of the available nonsynonymous and synonymous sites that are changed in the homolog pairs. Ka/Ks > 1 is an indicator of positive selection. Shown is the Ka/Ks distribution for segmental duplications (A) and for tandem duplications (B).