Literature DB >> 18334514

Sequence level analysis of recently duplicated regions in soybean [Glycine max (L.) Merr.] genome.

Kyujung Van¹, Dong Hyun Kim, Chun Mei Cai, Moon Young Kim, Jin Hee Shin, Michelle A Graham, Randy C Shoemaker, Beom-Soon Choi, Tae-Jin Yang, Suk-Ha Lee.

Abstract

A single recessive gene, rxp, on linkage group (LG) D2 controls bacterial leaf-pustule resistance in soybean. We identified two homoeologous contigs (GmA and GmA') composed of five bacterial artificial chromosomes (BACs) during the selection of BAC clones around Rxp region. With the recombinant inbred line population from the cross of Pureunkong and Jinpumkong 2, single-nucleotide polymorphism and simple sequence repeat marker genotyping were able to locate GmA' on LG A1. On the basis of information in the Soybean Breeders Toolbox and our results, parts of LG A1 and LG D2 share duplicated regions. Alignment and annotation revealed that many homoeologous regions contained kinases and proteins related to signal transduction pathway. Interestingly, inserted sequences from GmA and GmA' had homology with transposase and integrase. Estimation of evolutionary events revealed that speciation of soybean from Medicago and the recent divergence of two soybean homoeologous regions occurred at 60 and 12 million years ago, respectively. Distribution of synonymous substitution patterns, K(s), yielded a first secondary peak (mode K(s) = 0.10-0.15) followed by two smaller bulges were displayed between soybean homologous regions. Thus, diploidized paleopolyploidy of soybean genome was again supported by our study.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2008 PMID： 18334514 PMCID： PMC2650623 DOI： 10.1093/dnares/dsn001

Source DB: PubMed Journal: DNA Res ISSN： 1340-2838 Impact factor: 4.458

Introduction

Legumes have begun to draw much attention through recent genomic and phylogenetic studies.[1] The crop legumes, such as Lotus, Medicago, Pisum, Glycine, Phaselous, and Vigna, also receive attention from researchers because they are economically important.[2] Asian countries have a long history of making different food products, such as soymilk, tofu, soybean sprouts, etc., with soybean seeds because its seed obtains high protein and oil content. Thus, soybean is considered a very valuable crop among legumes.[3] As approximately 20 000 species belong in the legume family, a wide range of genetic and morphological diversities can be observed.[2] Unlike many other plants, legumes have a symbiotic relationship with the soil-borne bacteria, Rhizobia. Medicago truncatula and Lotus japonicus were selected as model legume plants because these legumes had not only plant–Rhizobium interactions for nitrogen fixation but also small genomes suitable for full genome sequencing.[4,5] Most Papilionoids are diploids except Glycine. An ancient genome duplication occurred in Glycine, leading to 2n = 38, 40 or 78–80 depending on annual/perennials or geographic locations.[6,7] Polyploidy has had an evolutionary impact on the structure of the soybean genome.[8-10] Using restriction fragment-length polymorphism (RFLP) analysis with nine populations (Glycine max × G. soja and G. max × G. max) of the Glycine subgenus soja, it was shown that the soybean genome presents about 2.55 copies per digest. This suggests that an additional round of genome duplication might have occurred in at least one of the original genomes.[8] Other studies have supported those observations. RFLP and simple sequence repeat (SSR) analyses showed that parts of linkage groups (LGs) B1/S, H, and F of soybean genome shared homoeologous regions.[9] Other genetic mapping analysis suggested that extensive rearrangements and additional duplications were present in soybean genome.[10] Also, high similarity in physical organization between soybean duplicated regions and a high percentage of microsynteny were shown by characterizing bacterial artificial chromosome (BAC) clones of soybean and other model plants.[11,12] In addition, BACs containing FAD2 genes also contained a number of syntenic genes and were positioned on LG I and O, again indicating duplication of soybean genome.[13] Fluorescence in situ hybridization of BACs visualized segmental duplications within the soybean genome.[14] M. truncatula genome also presents segmental duplications identified by high-throughput genome sequencing.[15] The processes of genome evolution and patterns of divergence can be studied by duplicate gene analysis.[16] Because the full genome sequence of many plants is not yet available, ESTs provide resources for studying evolutionary events such as ancient bursts of gene duplications. Because the accumulation of synonymous substitutions occurs stochastically over time, the level of divergence (age of duplication) is estimated by nucleotide substitution in coding sequences[2,17] Putative genome duplications events were identified with large EST collections from eight plant species using synonymous substitution measurements (Ks) of duplicated genes.[18] Soybean was estimated to have had two major genome duplications events at 15 million years ago (MYA) and 44 MYA. A genome duplication event also was observed in M. truncatula at ∼58 MYA. With different calibration, duplications also were observed in both soybean and M. truncatula.[17] A mutigene approach combined with a phylogenetic approach suggested soybean and Medicago shared a round of gene duplications, along with about 7000 other legume plants.[19] Xanthomonas axonopodis pv. glycine (Xag) causes bacterial leaf pustule (BLP) in soybean that occurs in Korea and the southern United States, where hot and humid weather conditions are prevalent.[20] Typically, small yellow-to-brown lesions with a raised lesion are formed in early development and develop into large necrotic lesions causing substantial losses in yield through premature defoliation.[21-24] Twenty consensus LGs of soybean genome, representing the 20 soybean chromosomes, were reported[25] with a joined map from three different populations spanning 2400 cM in length. A total of 420 SSR markers were added to the integrated genetic linkage map and its length was expanded to 2523.6 cM of Kosambi map distance across 20 LGs.[26] And, 1141 single-nucleotide polymorphism (SNP) markers were later located on the soybean genetic map.[27] Among 20 LGs, we are interested in LG D2 because the recessive gene conditioning resistance to BLP, rxp,[28] was mapped to LG D2 only 3.9 cM away from Satt372.[29] Also, the Rxp locus linked to the malate dehydrogenase (Mdh) locus with an estimated recombination frequency of 15.2 ± 3.8%.[30] In the process of BAC clone selection for ‘chromosome walking’ around Rxp region, we were able to create two contigs, which represent homoeologous regions of the soybean genome. Here, we describe the consequences of the duplication events around the Rxp region. Annotation, gene arrangement, and evolution events estimated by Ks (the number of synonymous substitution per synonymous site) will also be presented.

Methods

Primary BAC library screening

The constructed ‘Iowa State’ BAC library of soybean ‘Williams 82’ (gmw1)[31] was used for selection of BAC clones around Rxp locus. For the first round of PCR-based library screening, Satt 372 (forward, 5′-CAG AAA AGG AAT AAT AAC AAC ATC AC-3′; reverse, 5′-GCG AAA ACA TAA TTC ACA CAA AAG ACA G-3′), Satt486 (forward, 5′-GCG CAT GCA TTA CCA TAG GCT ATA ATA -3′; reverse, 5′-GGG GTC ATG CAT AAT AGA GAT AGA ACC A-3′), and Satt498 (forward, 5′-CAA CCC CGA AAT ACA ACT AAT GTT-3′; reverse, 5′-TGG TGA GGC TCA TTT TCA TAA GA-3′) were used as PCR primers, and pooled DNAs made by copies of the library by combining the overnight cell culture[31] were used as templates. Basic PCR protocols were followed as described with minor alternations,[31] using a PTC-110 Peltier Thermal Cycler (MJ Research, Inc., Watertown, MA, USA). The components of the reaction mixture in 20 µL of total volume were 0.5 U of Taq polymerase (Invitrogen, Carlsbad, CA, USA) and the rests of components were the same.[31] Cycling conditions started with initial denaturation at 94°C for 3 min, followed by 35 cycles of 94°C for 30 s, 48°C for 30 s, and 72°C for 30 s and the final step was at 72°C for 2 min. The amplified PCR products were analyzed in 1.5% ethidium bromide-stained agarose. Several rounds of the BAC library were screened systematically with in order, full-plate super pool DNA, individual full-plate pools, row and column super pools, and row and column pools. All PCRs were performed as describe earlier and DNA of ‘Williams 82’ was used as positive control for all screening processes.

Shotgun plasmid library and DNA sequencing

After BAC DNAs were prepared by a Plasmid Midi Kit (Qiagen, Hilden, Germany) and the insert size of each selected BAC clone was estimated,[31] the random plasmid library for shotgun sequencing was constructed with 10–15 µg of the extracted BAC clone DNA and pUC118 vector using Takara BKL Kit (Takara Bio, Inc., Otsu, Japan). The rest of methods were performed as described previously.[32] Full sequencing of each BAC DNA was performed with the BigDye Terminator (v. 3.1) cycle sequencing kit (Applied Biosystems, Foster City, CA, USA). Cycle conditions for sequencing and analysis of BAC sequences were described.[32] Also, the individual sequences were assembled with Phred/Phrap software and the remaining gaps of each clone were closed by direct sequencing, using plasmid DNA.[32] Image v. 3.0 and FPC v. 4.7.9 were used for confirmation of BAC contig assignment.[33,34]

Secondary BAC library screening and sequence analysis

After the sequences of each BAC clone were aligned,[35] BAC end sequences (BES) were selected for extending BAC contigs. Primer3 program[36] was used to design primers for secondary BAC library screening. With primers derived from BES, the BAC library screening was performed again as described earlier. Addition to ‘Iowa State’ BAC library, ‘Missouri’ soybean BAC library (gmw2) consisted of BstyI partially digested ‘Williams 82’ DNA was also used for screening.[37] After BAC contigs were confirmed, alignment between BAC contigs and its alignment results were inspected with GBrowse (http://www.gmod.org/?q=node/71) and SynBrowse (http://www.synbrowse.org/). Also, gene annotation was conducted with the web-based gene prediction programs FGENESH (http://sun1.softberry.com/berry.phtml) and GeneMark (http://exon.gatech.edu/GeneMark/) against Medicago (legume plant) database. Putative amino acid sequences from the predicted genes were used as queries for searching similar known proteins using BLASTP. With each predicted gene of GmA at first, EST information was searched against G. max EST database at G. max Genome Database (http://bionary.agry.purdue.edu/GmaxGDB/index.php). Nucleotide blast or tblastn (http://www.ncbi.nlm.nih.gov/blast/Blast.cgi) was also used for searching EST information with each predicted gene of GmA′ or MtA, if no ESTs corresponding to the predicted genes in GmA were identified. The rate of non-synonymous nucleotide substitution (Ka) and the fraction of synonymous substitutions (Ks) were obtained with the CODEML program[38] of the PAML package.[39] Ks was used to estimate the divergence time between two sequences. So, coding sequences of the predicted gene from two contigs were used for analysis of Ks, as described.[35] Divergence times (T) were calculated using a synonymous mutation rate of 6.1 × 10−9 substitutions per synonymous site per year[18,40] as T = Ks/(2 × 6.1 × 10−9).

SNP detection, SNP genotyping, and generation of linkage map

To locate each BAC contig in LGs, contig-specific regions longer than 4.5 kb were surveyed. Seven different primer sets (Supplementary Table S1) were designed from these contig-specific regions using Primer3 (http://primer3.sourceforge.net/). And, the detection of SNP in the contig-specific regions between two soybean genotypes, Pureunkong and Jinpumkong 2, was followed.[3] SNP capture probe (5′-GTT TTT TCA TCA ATC TTC CTC TAA A-3′) was designed to be complementary to the 5′ region from the SNP site within an amplicon using SBEPrimer version 1.141 and single base extension reactions followed by fluorescence polarization (FP) measurements were performed on a Victor3 microplate reader (PerkinElmer Life Science, Boston, MA, USA).[42] SNP primers were tested using genomic DNA of each parent and a mixture of both parents as artificial heterozygotes. The SNP primer was accepted as an SNP marker, only if the results of genotyping by AcycloPrime FP analysis were confirmed by sequencing data and then used for genotyping in a segregating population, an F2-derived soybean population of 90 recombinant inbred lines (RILs) from the cross of Pureunkong and Jinpumkong 2.[43] Genotyping data were automatically transferred to Microsoft Excel and the genotypes of the segregation population were determined, if the clusters were separated at least 40 mp (thousandth of the polarization unit) apart, at least seven times higher than standard deviation of the negatives (>99% at significant level).[44] The construction of the linkage map with SNP marker genotyping data and integration of these markers on LGs were followed.[43] SNP genotyping data from the heterozygous line were considered as missing data. The five SSRs located on GmA′ sequences were additionally used for accurate mapping of GmA′, after the SSRs were identified by Sputnik: DNA microsatellite repeat search utility (http://cbi.labri.fr/outils/Pise/sputnik.html). LGs were designated according to the USDA genetic map,[26] and MapChart v. 2.1 was used for generation of linkage map.[45]

Accession numbers

Accession nos gmw1-20O10, EU028328; gmw1-24M16, EU028329; gmw1-29F06, EU028330; gmw1-89M01, EU028331; gmw2-77P21, EU028332. Sequence data from this article can be found in the GenBank/EMBL data libraries.

Results

Identification of soybean duplicated regions by BAC selection

To obtain BAC clones around the Rxp locus, we screened the gmw1 BAC library with three SSR markers, Satt372, Satt486, and Satt498. Among a total of six BAC clones identified by these SSR markers, we selected gmw1-29F06 and gmw1-24M16 for determining DNA sequences because they represented an SSR marker with long length (Fig. 1).

Figure 1

Schematic relationships between homologous regions (GmA and GmA′) containing the Rxp locus from LGs A1 (red) and D2 (blue). The positions (in cM) and their corresponding SSR markers are located on the upper and lower sides of the black solid and dashed bars for LG D2 and A1, respectively. Linkage map of LG D2 was taken from the Soybean Breeders Toolbox (http://soybase.org) and the mapped position of GmA′ is shown on the detailed genetic linkage map of LG A1 in the RIL population of Pureunkong × Jinpumkong 2. GmA was composed of gmw1-29F06 and gmw1-24M16. gmw1-20O10, gmw1-89M01 and gmw2-77P21 were made GmA′. DNA sequences of gmw1-29F06 and gmw1-24M16 were aligned and created the GmA contig, comprised of 264 239 bp, including overlapped DNA sequences of ∼73 kb (Fig. 1). Primers were then designed from BES of GmA for extending the contig. Clone gmw1-20O10 (120 kb) was selected from BES of gmw1-29F06 and primers designed from BES from gmw1-24M16 selected gmw1-89M01 to create an apparent extension of 175 kb. We were able to extend the contig longer with full sequenced gmw1-20O10 and gmw1-89M01 clones. The full DNA sequences of gmw1-20O10 and gmw1-89M01 were compared with GmA, but DNA sequences of the expected overlapped regions showed only an approximate 90% match. To close the gap between gmw1-20O10 and gmw1-89M01, we were able to select gmw2-77P21 (94 kb) from another soybean BAC library, gmw2. After the DNA sequences of gmw2-77P21 were aligned with gmw1-20O10 and gmw1-89M01, the GmA′ contig (gmw1-20O10_ gmw2-77P21_ gmw1-89M01, 292 895 bp) was formed with 100% match (Fig. 1).

Mapping of soybean duplicated regions

To locate GmA′ on the soybean genetic linkage map, SNP genotyping was performed. First, unique regions longer than 4.5 kb in GmA′ were surveyed. Seven different unique regions were identified, and seven different primer sets were randomly designed from these seven contig-specific regions (Supplementary Table S1). One SNP locus between Pureunkong (deletion) and Jinpumkong 2 (A) was identified by primers (forward, 5′-TTC GTG CTA AGT GGA ACT TCT G-3′; reverse, 5′-TAC AAC AAC GAT GTT CAT GAC G-3′) designed between 159 723 and 160 465 bp of GmA′. SNP genotyping of GmA′ was conducted with the RIL population from the cross of Pureunkong and Jinpumkong 2. The SNP marker locus was incorporated into the frame map,[43] placing GmA′ to the top of LG A1, 1.9 cM away from Satt684 in LG A1 (Fig. 1). Five SSRs identified by Sputnik on GmA′ were additionally analyzed. Only one designed SSR showed polymorphism between Pureunkong and Jinpumkong 2 (data not shown), and this was turned out to be Satt684, which was positioned between 64 495 and 64 682 bp of GmA′. On the basis of all genotyping and mapping data, we are able to determine that the duplicated regions are located on LG A1 for GmA′ and LG D2 for GmA (Fig. 1).

Alignments, annotations, and Ks estimation

After BAC contigs were confirmed and inspected with GBrowse and SynBrowse, genes were annotated with FGENESH or GeneMark against the Medicago database. Fig. 2 shows a schematic representation of approximate gene lengths, gene locations, and homologous regions (linked by shaded lines). The 54 and 58 genes were predicated in GmA and GmA′, respectively (Fig. 2). Gene density along these two sequenced BAC contigs was approximately one gene per 5.0 kb. A similar gene density (45 predicted genes along 219 028 bp) was also detected in Medicago along Contig 962B (MtA, as of January 2007 at http://www.medicago.org). The Medicago contig showed homology with the two soybean contigs. Gene order was conserved among syntenic blocks, except for one case (GmA_18 versus GmA′_10 and GmA_27 versus GmA′_10), and the same orientation between the predicted genes was observed. Gene order was maintained in Medicago, although linearity was fragmented (Fig. 2).

Figure 2

Comparative genome alignments among M. truncatula (MtA) and G. max (GmA and GmA′) based on discontinuous Megablast results. Colored arrows (red, common genes between Medicago and soybean contigs; black, common genes between soybean contigs; gray, unique genes) indicate the positions and orientation of predicated genes and their length presents the length of predicated gene. And, their locations were linked among three contigs, if the predicted genes were founded on all contigs. Darker color is used in filled link as higher e-value between two contigs was shown (Supplementary Table S2). Numbers indicate the number of genes annotated by FGENESH. A detailed description of gene annotations is listed in Supplementary Table S2. Supplementary Table S2 lists all pairwise comparisons of the predicted genes among homoeologous contigs. Segments showing no homology with known genes were excluded for this table. Also, the EST information corresponding to the predicted genes was included in Supplementary Table S2 after three BLAST programs, G. max Genome Database, nucleotide blast, and tblastn, were run. Eight regions were unique in GmA, whereas GmA′ had 13 unique regions. But, 22 among 45 segments of MtA were not similar to any of segments from the two soybean homoeologous regions. Twenty seven pairwise comparisons between soybean homoeologous regions were also observed in Supplementary Table S2. After each pairwise comparison was performed against BLASTP, nine of the 27 comparisons between soybean homoeologous regions showed >90% identity with putative genes with pretty low e-value. However, wide range of the conservation level was detected in the putative promoter regions and the introns between GmA and GmA′, averaging 59.7% and 54.9% in similarity for the putative promoter regions and the introns, respectively (Supplementary Table S2). Many homoeologous regions contained kinases and proteins related to signal transduction pathway. Interestingly, some unique segments from GmA and GmA′ showed homology with transposase (GmA′_07) or integrase (GmA_13, 14 and 15) (Supplementary Table S2). Among twenty seven pairwise comparisons between soybean homoeologous regions, nine comparisons showed alignment with MtA (Supplementary Table S2). Using the maximum-likelihood method in the CODEML program,[39] synonymous (Ks) and non-synonymous (Ka) distance were estimated. This method was based on the F3 x 4 model of codon substitution,[38] explaining both transition/transversion and codon usage biases. Supplementary Table S3 shows the results of analysis between homologous gene pairs from each contig, along with percent identity of amino acid and cDNA sequences. The median Ka value (0.0426) between two soybean contigs was about 3.5 times smaller than the median Ks value (0.1472). Only one case in the Ka/Ks ratio was higher than 1 (1.5479, GmA_39 versus GmA′_40) (Supplementary Table S3). This Ka/Ks ratio might be non-significant because of moderate length of exons (282, 51, and 159 bp). When the substitutions per synonymous site (Ks) were plotted against the fraction of duplication events, secondary peaks were observed in the distribution for two contigs (data not shown). The first secondary peak (mode Ks = 0.10–0.15) followed by two smaller bulges (mode Ks = 0.20–0.25 and 0.30–0.35) was displayed, indicating a burst of gene duplications in soybean. For GmA versus MtA, the median Ka value was 0.2888, which is 2.7 times smaller than the median Ks value (0.7755). MtA_12 showed homology with both GmA_21 and GmA_23. The Ks value for GmA_23 versus MtA_12 was extremely high because they aligned along 107 amino acids (Supplementary Table S3). The median Ka value (0.2876) between GmA′ and MtA was about 2.8 times smaller than the median Ks value (0.8003). To determine the timing of the duplication event giving rise to the two contigs, the Ks value was used. Synonymous substitutions are thought to be evolutionarily neutral because the mutations cause no amino acid change[35] and therefore accumulate stochastically over time. Ks values less than 0.05 and greater than 1 were not included for searching for mixtures of normal distributions.[19] Divergence times (T) were estimated with Ks value and assumption of a mutation rate of 6.1 × 10−9 substitutions per synonymous site per year.[18,40] T ranged from 5.55 to 39.92 MYA between GmA and GmA′, and the median T was 12.3 MYA with low Ks value (0.1498) (Fig. 3, Supplementary Table S3). With MtA only included 0.05 <Ks < 1.0, median Ks values were 0.7654 (0.5986 to 0.9342) and 0.6877 (0.4592 to 0.9128) for GmA and GmA′, respectively. Therefore, MtA and the soybean homoeologous contigs diverged at 56.4–62.7 MYA (Fig. 3, Supplementary Table S3). The two soybean homoeologous contigs were duplicated more recently, agreeing with the previous study.[17]

Figure 3

Ks values and estimation of evolutionary events in three contigs. The number shown above the self-comparison diagonal represents estimation of median Ks values. Supplementary Table S3 showed statistics for Ks values between homologous regions of three contigs (GmA, GmA′ and MtA). And, divergence times in millions of years calculated as18,40 T = K /(2 × 6.1 × 10−9) are shown below the self-comparison diagonal. Colored boxes represent different evolutionary events: orange, Medicago–soybean speciation; sky blue, segmental duplication in soybean. Estimated dates of speciation and duplication events are given in the phylogenetic tree.

Discussion

Paleopolypoidy of the soybean genome

Diploidization or gene duplication is a process of switching from tetrasomic to disomic inheritance and a common process in plant genome evolution.[13,46] Many studies have suggested that paleopolyploidy is a common phenomenon in most plant species.[17,18] Previous studies suggested that the soybean genome has undergone two or more large-scale duplications and is probably an ancient polyploid.[8,11,18,47] In the present study, we identified and evaluated the duplication events in soybean genome with two contigs. In the process of full DNA sequencing of BAC clones, nucleotide sequences of gmw1-20O10 and gmw1-89M01 were not aligned perfectly with GmA (gmw1-29F06_gmw1-24M16). Therefore, another round of BAC library screening was performed to close the gap between gmw1-20O10 and gmw1-89M01, and the GmA′ (gmw1-20O10_ gmw2-77P21_ gmw1-89M01) was made by selection of gmw2-77P21 (Fig. 1). Although BAC-end sequences were used for BAC by BAC selection, the alignment of our two contigs was not perfect and gaps in alignment were observed. To locate these two homologous contigs, SNP genotyping was performed with the one SNP between Pureunkong (deletion) and Jinpumkong 2 (A). This SNP marker locus for GmA′ was located 1.9 cM away from Satt684 on LG A1, which SSR marker analysis was also able to be positioned between 64 495 and 64 682 bp of GmA′ (Fig. 1). Linkage maps from the Soybean Breeders Toolbox (http://soybase.org) were compared to locate the duplicated region (GmA versus GmA′). A comparison between the soybean composite maps for LG A1 and D2 indicated that homoeologous regions exist between them. In addition to the two homoeologous contigs, five RFLP markers were also common between the two LGs. Therefore, it suggested that GmA and GmA′ are indeed homoeologous.

Genome dynamics among homoeologous regions

Both soybean homoeologous contigs showed a gene density of 1 gene per 5 kb in this present study (Fig. 2). The gene density of soybean on LG G was estimated with 28 BAC ends[11] and subclone sequences from two contigs, to be approximately 1 gene per 14 kb. Arabidopsis and tomato showed similar gene density with an average of 1 gene per 5 kb, although more than a sevenfold difference in genome size is present between these two species.[11,48] Wheat, barley, and rice were compared near the Lrk locus and showed maximal density as 1 gene per 4–5 kb.[49] The unusual relationship between physical distance and genetic distance for the homoeologous region on LG A1 was revealed by the comparison of the physical map and the genetic map in the distal region of LG A1 (Fig. 1). The physical distance between two markers (gmw1-89M01-32 and Satt684) on GmA′ was ∼100 kb, whereas its genetic distance was observed to be 1.9 cM on the physical map of LG A1. Exhibition of 52.6 kb/cM in a physical-to-genetic distance ratio in this homoeologous region might represent high recombination region because not only duplication of genes and dispersed gene duplicates, which was single-gene duplications that were on different chromosome, occurred more often in high-recombination regions[50] but also the distal chromosomal regions showed high recombination rate.[50] And, rice genome showed high recombination frequency with 244 kb as the average physical distance per centimorgan, although a physical-to-genetic distance ratio was different depending on position along the chromosome.[51] High resolution mapping with various markers and genome sequencing would clarify the relationship between physical distance and genetic distance for this homoeologous region on LG A1. Genic regions of the two soybean contigs and MtA retained gene structure in both order and orientation (Fig. 2). Similar conservations were also observed.[13,17,52] Sequence similarity was >60% in intergenic regions of most soybean homologous regions. This high level of similarity was also seen in previous studies,[11,13,52] suggestive of either a relatively recent duplication or concerted evolution.[53] A homology search using BLASTP identified high homology with kinases and proteins related to signal transduction pathway within the homoeologous regions. Among them, the first aligned segment (GmA_01 versus GmA′_10) was very similar to putative receptor-like protein kinase INRPK1 (Ipomoea nil receptor-like protein kinase), which shows homology with Xa21, the rice pathogen recognition receptor.[54] INRPK1 and Xa21, as typical RPKs, were composed of extracellular leucine-rich repeat domain, transmembrane domain, and cytoplasmic kinase domain. Twenty one dominant loci responsible for resistance to bacterial blight disease indicated that a multiple gene family was involved in resistance to Xanthomonas oryzae pv. oryzae. And, sequence analysis between two classes of Xa21 defined by the predicted amino acid sequence level suggested duplication was one of the roles in evolution of the Xa21 gene family.[55] So, like Xa21, the soybean homoeologous regions contained important genes related to plant defense could be duplicated. However, it is difficult to say that Rxp encodes RPK for recognition of pathogen (Xag) by the extracellular domain and transduction of pathogen attack by an intracellular kinase because INRPK1 showed only 30% of homology with Xa21 at the amino acid sequence level. Supplementary Table S2 provides the information on insertion/deletion of sequences in terms of genome dynamics in Medicago and soybean. Several blocks of segments from each contig showing no homology with any other homologous regions were identified, indicating insertion or deletion of sequences within homologous contigs. Some segments draw attention because they were similar to transposase from Medicago (GmA′_07) and integrase from Medicago (GmA_13, 14 and 15). Potential transposable element (TE) activity is interesting because TE could be used for transposon mutagenesis in gene cloning and functional genomics in plants.[56,57] Many researchers have put efforts into identifying and cloning of soybean TE via finding the mutable alleles because no active TE has yet been isolated from soybean.[57] The w4-mutable line contained an autonomous TE at the W4 locus and the unstable k2 Mdh1-n y20 chromosomal region caused by a non-autonomous TE were identified in soybean.[57,58] Interestingly, this unstable chromosomal region was mapped on LG H and this Mdh gene also located on LG D2,[30] near the Rxp locus studied in this research.

Genome evolution

With comparisons of the sequences of the same gene from two species or gene family, counting of the number of non-synonymous changes (amino acid sequences change) and synonymous change (no change of amino acids sequences) is a good indicator of the degree of divergence between two sequences.[59] A total of 23 homologous regions between two soybean contigs showed Ka values ranging from 0.0168 to 0.2807 and Ks values ranging from 0.0318 to 0.4797 (Supplementary Table S3). It suggested that the Ka/Ks ratio could test for assessing the protein-coding potentials of genomic regions.[60] Depending on Ks as the background rate of evolution, the selection pressure in protein-coding regions could be explained by deviations of this ratio.[59] In this study, the ratio was less than 1 except for one comparison (GmA_39 versus GmA′_40) because of very low Ks. Short length of exons or highly divergent sequences (<70% nucleotide identity) could cause the Ka/Ks ratio to be higher than 1.[60] Although their percentage of identity at nucleotide level was >95%, this homologous region (GmA_39 versus GmA′_40) showed very low Ks because of very short length of exon 2 (50 bp). Paleopolyploid species like soybean indicate the presence of duplications by showing secondary peaks in the age distributions of paralogous pairs.[17] Three secondary peaks were observed in this study. Distributions were indicated with a major and first peak at mode Ks = 0.10–0.15. These data were consistent, and the first secondary peak at the same mode also identified after comparisons of pairs of paralogous genes in 14 model plant species including soybean.[17] In addition, two minor bulges were identified in our study, indicating additional duplication events in this homologous region between two contigs. To estimate divergence time for gene duplication, the Ks values were used, assuming rates of synonymous substitution of 6.1 × 10−9 substitutions per synonymous site per year.[18,40,61] This soybean homologous region was mainly duplicated at 12.3 MYA in this study and the speciation event of soybean from Medicago at 60 MYA was also suggested (Fig. 3), agreeing with the rapid diversification between 50–60 MYA in legumes.[2,61] However, estimated ages of the secondary peaks could be different depending on the assumed substitution rates. It estimated a rate of silent-site substitution of 6.1 per silent site per billion years.[18,40,61] But, a synonymous rate of 1.5 × 10−8 substitutions per synonymous site per year for dicots was used to calculate the absolute date for duplication events.[17,62] With a synonymous rate of 1.5 × 10−8 substitutions, the average divergent time was 6.3 MYA in this study (median = 5.0 MYA), similar to estimation of the recent duplication (3.3–5.0 MYA) in soybean.[17] However, these estimates are only approximations because the rate of synonymous substitution is different among genes and species and generation time is also the factor for controlling mutational rate.[17] Our study provides additional evidence of the paleopolyploidy of the soybean genome. We also showed that organization and sequence homology between duplicated segments were very similar. In this study, homoeologous regions were so similar that the contig on LG A1 was originally sequenced instead of that on LG D2, even though BAC-end sequences located near Rxp locus on LG D2 were used for BAC selection in genome sequencing. Thus, in future studies, to avoid walking in the wrong direction, BAC by BAC soybean genome sequencing should be performed in concert with whole-genome physical mapping because of high level of similarity between homologous contigs.

Supplementary Data

Supplementary data are available online at www.dnaresearch.oxfordjournals.org.

44 in total

1. DNA sequence evidence for the segmental allotetraploid origin of maize.

Authors: B S Gaut; J F Doebley
Journal: Proc Natl Acad Sci U S A Date: 1997-06-24 Impact factor: 11.205

2. Primer-design for multiplexed genotyping.

Authors: Lars Kaderali; Alina Deshpande; John P Nolan; P Scott White
Journal: Nucleic Acids Res Date: 2003-03-15 Impact factor: 16.971

3. Molecular mapping of Rxp conditioning reaction to bacterial pustule in soybean.

Authors: J M Narvel; L R Jakkula; D V Phillips; T Wang; S H Lee; H R Boerma
Journal: J Hered Date: 2001 May-Jun Impact factor: 2.645

4. A soybean transcript map: gene distribution, haplotype and single-nucleotide polymorphism analysis.

Authors: Ik-Young Choi; David L Hyten; Lakshmi K Matukumalli; Qijian Song; Julian M Chaky; Charles V Quigley; Kevin Chase; K Gordon Lark; Robert S Reiter; Mun-Sup Yoon; Eun-Young Hwang; Seung-In Yi; Nevin D Young; Randy C Shoemaker; Curtis P van Tassell; James E Specht; Perry B Cregan
Journal: Genetics Date: 2007-03-04 Impact factor: 4.562

5. Sequence conservation of homeologous bacterial artificial chromosomes and transcription of homeologous genes in soybean (Glycine max L. Merr.).

Authors: Jessica A Schlueter; Brian E Scheffler; Shannon D Schlueter; Randy C Shoemaker
Journal: Genetics Date: 2006-08-03 Impact factor: 4.562

6. Genetic analysis and molecular mapping of a pale flower allele at the W4 locus in soybean.

Authors: Min Xu; Reid G Palmer
Journal: Genome Date: 2005-04 Impact factor: 2.166

7. A codon-based model of nucleotide substitution for protein-coding DNA sequences.

Authors: N Goldman; Z Yang
Journal: Mol Biol Evol Date: 1994-09 Impact factor: 16.240

8. Syntenic relationships between Medicago truncatula and Arabidopsis reveal extensive divergence of genome organization.

Authors: Hongyan Zhu; Dong-Jin Kim; Jong-Min Baek; Hong-Kyu Choi; Leland C Ellis; Helge Küester; W Richard McCombie; Hui-Mei Peng; Douglas R Cook
Journal: Plant Physiol Date: 2003-03 Impact factor: 8.340

9. Comparative genomic analysis of sequences sampled from a small region on soybean (Glycine max) molecular linkage group G.

Authors: Dawn Foster-Hartnett; Joann Mudge; Dana Larsen; Dariush Danesh; Huihuang Yan; Roxanne Denny; Silvia Peñuela; Nevin D Young
Journal: Genome Date: 2002-08 Impact factor: 2.166

10. Estimates of conserved microsynteny among the genomes of Glycine max, Medicago truncatula and Arabidopsis thaliana.

Authors: H H Yan; J Mudge; D-J Kim; D Larsen; R C Shoemaker; D R Cook; N D Young
Journal: Theor Appl Genet Date: 2003-01-25 Impact factor: 5.699

14 in total

1. Relative evolutionary rates of NBS-encoding genes revealed by soybean segmental duplication.

Authors: Xiaohui Zhang; Ying Feng; Hao Cheng; Dacheng Tian; Sihai Yang; Jian-Qun Chen
Journal: Mol Genet Genomics Date: 2010-11-16 Impact factor: 3.291

2. Structural and functional divergence of a 1-Mb duplicated region in the soybean (Glycine max) genome and comparison to an orthologous region from Phaseolus vulgaris.

Authors: Jer-Young Lin; Robert M Stupar; Christian Hans; David L Hyten; Scott A Jackson
Journal: Plant Cell Date: 2010-08-20 Impact factor: 11.277

3. Fine mapping of a resistance gene to bacterial leaf pustule in soybean.

Authors: Dong Hyun Kim; Kil Hyun Kim; Kyujung Van; Moon Young Kim; Suk-Ha Lee
Journal: Theor Appl Genet Date: 2010-01-20 Impact factor: 5.699

4. Differential accumulation of retroelements and diversification of NB-LRR disease resistance genes in duplicated regions following polyploidy in the ancestor of soybean.

Authors: Roger W Innes; Carine Ameline-Torregrosa; Tom Ashfield; Ethalinda Cannon; Steven B Cannon; Ben Chacko; Nicolas W G Chen; Arnaud Couloux; Anita Dalwani; Roxanne Denny; Shweta Deshpande; Ashley N Egan; Natasha Glover; Christian S Hans; Stacy Howell; Dan Ilut; Scott Jackson; Hongshing Lai; Jafar Mammadov; Sara Martin Del Campo; Michelle Metcalf; Ashley Nguyen; Majesta O'Bleness; Bernard E Pfeil; Ram Podicheti; Milind B Ratnaparkhe; Sylvie Samain; Iryna Sanders; Béatrice Ségurens; Mireille Sévignac; Sue Sherman-Broyles; Vincent Thareau; Dominic M Tucker; Jason Walling; Adam Wawrzynski; Jing Yi; Jeff J Doyle; Valérie Geffroy; Bruce A Roe; M A Saghai Maroof; Nevin D Young
Journal: Plant Physiol Date: 2008-10-08 Impact factor: 8.340

5. Dynamic rearrangements determine genome organization and useful traits in soybean.

Authors: Kyung Do Kim; Jin Hee Shin; Kyujung Van; Dong Hyun Kim; Suk-Ha Lee
Journal: Plant Physiol Date: 2009-08-14 Impact factor: 8.340

6. Comparison of homoeolocus organisation in paired BAC clones from white clover (Trifolium repens L.) and microcolinearity with model legume species.

Authors: Melanie L Hand; Noel O I Cogan; Timothy I Sawbridge; German C Spangenberg; John W Forster
Journal: BMC Plant Biol Date: 2010-05-24 Impact factor: 4.215