| Literature DB >> 17880721 |
Jessica A Schlueter1, Jer-Young Lin, Shannon D Schlueter, Iryna F Vasylenko-Sanders, Shweta Deshpande, Jing Yi, Majesta O'Bleness, Bruce A Roe, Rex T Nelson, Brian E Scheffler, Scott A Jackson, Randy C Shoemaker.
Abstract
BACKGROUND: Soybean, Glycine max (L.) Merr., is a well documented paleopolyploid. What remains relatively under characterized is the level of sequence identity in retained homeologous regions of the genome. Recently, the Department of Energy Joint Genome Institute and United States Department of Agriculture jointly announced the sequencing of the soybean genome. One of the initial concerns is to what extent sequence identity in homeologous regions would have on whole genome shotgun sequence assembly.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17880721 PMCID: PMC2077340 DOI: 10.1186/1471-2164-8-330
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
General BAC information
| Ratioe of | |||||||||||
| BAC | Linkage group | Genbank accession | SNP IDb | Length (bp) | Phase | Gap | ORFsc | Averaged EST coverage | EST- based coverage | Overall gene homeologyf | Gene densityg |
| gmw2-133d1 | F | 8001 | 117591 | III | 0 | 13 | 32.6 | 38.2 | 3 of 13 | 1/9.05 | |
| gmw1-93l19 | M | 51037 | III | 0 | 5 | 62.4 | 50.5 | 3 of 5 | 1/10.2 | ||
| gmw1-105h23 | O | 30491 | 134287 | III | 0 | 18 | 82.0 | 76.4 | 18 of 18 | 1/7.46 | |
| gmw1-15k6 | I | 26051 | 148858 | III | 0 | 22 | 77.0 | 71.1 | 18 of 22 | 1/6.77 | |
| gmw1-11j16 | L | 69947 | III | 0 | 9 | 82.2 | 83.0 | 2 of 9 | 1/7.77 | ||
| gmw1-45m6 | a | 143028 | III | 0 | 7 | 53.6 | 53.0 | 1 of 7 | 1/20.4 | ||
| gmw1-5g16 | O | 115953 | II | 2 | 11 | 74.0 | 68.8 | 4 of 11 | 1/9.66 | ||
| gmw1-103e11 | I | 89397 | III | 0 | 12 | 78.6 | 81.3 | 4 of 12 | 1/7.45 | ||
| gmw1-58k3 | O | 177331 | II | 2 | 8 | 50.7 | 47.5 | 3 of 8 | 1/22.2 | ||
| gmw1-57d24 | D1a | 20113 | 162359 | II | 2 | 19 | 75.0 | 71.5 | 3 of 19 | 1/9.02 | |
| gmw1-27d20 | D1b | 16079 | 227022 | I | 6 | 24 | 65.4 | 61.9 | 3 of 24 | 1/9.46 | |
| gmw1-74i13 | C1 | 5981 | 173654 | III | 0 | 18 | 68.3 | 70.4 | 13 of 18 | 1/9.65 | |
| gmw1-52d3 | C2 | 98675 | III | 0 | 10 | 59.2 | 62.1 | 9 of 10 | 1/9.87 | ||
| gmw1-13o17 | D1a | 89030 | II | 5 | 9 | 41.5 | 48.0 | 1 of 9 | 1/11.1 | ||
| gmw1-8g7 | a | 53292 | III | 0 | 4 | 32.6 | 30.7 | 1 of 4 | 1/13.3 | ||
| UMb001-24d13 | E | 13567 | 111223 | II | 1 | 8 | 84.0 | 79.3 | 3 of 8 | 1/13.9 | |
| UMb001-5f5 | A2 | 42937 | 65475 | II | 2 | 5 | 91.9 | 94.6 | 3 of 5 | 1/10.9 | |
| Average | 119303 | 14 | 59.1 | 59.05 | 1/11.1 | ||||||
a Unmappable; no polymorphic SSRs identified or any matches of CDS to SNP data
b SNP Ids are taken directly from Choi et al. (2007). EST sequence from which SNP derived found in Methods and Materials.
c Does not include ORFs that are alternatively spliced
d An average across the BAC of the number of bp supported by an EST or cDNA divided by the total number of bp for each annotation
e A ratio of the total number of bp on the BAC that are annotated divided by the total number of bases that have EST or cDNA support
f Count is based upon the number of homeologs shared between BACs out of the total number of genes
g Gene density is in 1 gene per × number of kilobases
Figure 1Summary of genic conservation from putative homeologous BACs in soybean. Duplicate genes from six soybean BACs (3 different pairs) show the range of gene conservation found in the soybean genome. Each block-arrow represents a predicted gene structure. Black arrows are genes with no homeolog. Colored arrows are genes with a homeolog. A heat map for percent nucleotide identity shows the average nucleotide identity between duplicate genes for each conserved homeolog. Gray boxes between structures show homoelogous relationships. All gene structure predictions are available online [30]. The first BAC pair has been reprinted with permission from The Plant Genome [19].
Duplicate gene homeology/paralogy between BAC pairs
| BAC homeologs | Putative function | # of exons | Coding lengtha | Nucleotide identity | Protein identity | Protein similarity | Ks | Ka | Date (Mya) |
| gmw1-74i13 gmw1-52d3 | b | b | b | 89.8 | 88.0 | 90.7 | 0.1490 | 0.0335 | 12.2 |
| gmw1-105h23 gmw1-15k6 | d | d | d | 90.7 | 88.9 | 90.4 | 0.1061 | 0.0326 | 8.70 |
| UMb001-24d13 | DNA binding | 6 | 1338 | 92.7 | 88.7 | 92.2 | 0.1177 | 0.0468 | 9.65 |
| UMb001-5f5 | DNA binding | 7 | 1473 | ||||||
| UMb001-24d13 | Gamma response I | 9 | 987 | 95.9 | 95.7 | 96.3 | 0.1405 | 0.0152 | 11.52 |
| UMb001-5f5 | Gamma response I | 9 | 984 | ||||||
| UMb001-24d13 | Selenium binding | 4 | 1881 | 56.3 | 54.6 | 56.4 | 0.1709 | 0.0575 | 14.01 |
| UMb001-5f5 | Selenium binding | 5 | 585 | ||||||
| gmw1-103e11 | 7 | 510 | 96.4 | 95.8 | 97.2 | 0.0933 | 0.0188 | 7.65 | |
| gmw1-5g16 | 7 | 1002 | |||||||
| gmw1-103e11 | Beta-fructofuranosidase | 6 | 1944 | 94.4 | 92.7 | 94.1 | 0.0716 | 0.0276 | 5.87 |
| gmw1-5g16 | Beta-fructofuranosidase | 6 | 1956 | ||||||
| gmw1-103e11 | Galactinol synthase | 4 | 732 | 90.5 | 93.5 | 94.7 | 0.3208 | 0.0316 | 26.30 |
| gmw1-5g16 | Galactinol synthase | 3/4 | 669/987 | ||||||
| gmw1-103e11 | RAD-like protein | 6/7 | 564/900 | 96.9 | 92.9 | 97.6 | 0.0432 | 0.0442 | 3.54 |
| gmw1-5g16 | RAD-like protein | 5 | 240 | ||||||
| gmw2-133d1 | GTPase | 14 | 3183 | 96.9 | 98.1 | 99.1 | 0.1055 | 0.0084 | 8.65 |
| gmw1-93l19 | GTPase | 16 | 3480 | ||||||
| gmw2-133d1 | Cellulose synthase | 9 | 2211 | 67.6 | 65.1 | 67.0 | 0.1109 | 0.0438 | 9.09 |
| gmw1-93l19 | Cellulose synthase | 5 | 924 | ||||||
| gmw2-133d1 | Chain A protein | 1 | 1608 | 81.1 | 76.4 | 80.1 | 0.1856 | 0.077 | 15.21 |
| gmw1-93l19 | Chain A protein | 1 | 1452 | ||||||
| gmw1-13o17 | Raffinose synthase | 5 | 2277 | 66.4 | 71.5 | 81.5 | 2.5495 | 0.2051 | 208.98 |
| gmw1-8g7 | Raffinose synthase | 6 | 2190 | ||||||
| gmw1-57d24 | Phospholipase C | 8 | 1308 | 80.5 | 78.7 | 87.6 | 0.5457 | 0.114 | 44.73 |
| gmw1-58k3 | Phospholipase C | 8 | 1299 | ||||||
| gmw1-57d24 | COMT | 5 | 747 | 79.7 | 79.0 | 88.3 | 0.6442 | 0.1204 | 52.80 |
| gmw1-58k3 | COMT | 4/5 | 615/354 | ||||||
| gmw1-58k3 | COMT | 4/5 | 615/354 | 73.6 | 76.3 | 87.7 | 1.7076 | 0.1667 | 139.97 |
| gmw1-27d20 | COMT | 5 | 744 | ||||||
| gmw1-58k3 | Otubain | 6 | 1992 | 53.7 | 42.5 | 53.3 | 4.024 | 0.3023 | 329.84 |
| gmw1-27d20 | Otubain | 7 | 1860 | ||||||
| gmw1-57d24 | CBS | 6/8 | 399/687 | 74.9 | 73.7 | 89.5 | 2.0095 | 0.1562 | 164.71 |
| gmw1-27d20 | CBS | 8 | 678 | ||||||
| gmw1-57d24 | COMT | 5 | 747 | 74.1 | 81.6 | 91.0 | 1.5875 | 0.1196 | 130.12 |
| gmw1-27d20 | COMT | 5 | 744 | ||||||
a Coding length in base pairs based upon CDS (from start to stop not including introns).
b The values for homeologs between gmw1-74i13 and gmw1-52d3 are previously reported (Schlueter et al. 2006). Identity, similarity, Ks, Ka and Dates shown are average across BACs.
c The values for homeologs between gmw1-105h23 and gmw1-15k6 are previously reported (Schlueter et al. 2007). Identity, similarity, Ks, Ka and Dates shown are average across BACs.
d Recalculated average not including the highly divergent homeologs from gmw1-13o17, gmw1-8g7, gmw1-57d24, gmw1-58k3 and gmw1-27d20.
e Recalculated average for just the highly divergent homeologs from gmw1-13o17, gmw1-8g7, gmw1-57d24, gmw1-58k3 and gmw1-27d24.
Figure 2Reassembly of highly identical homeologous soybean BACs. Output of Phred/Phrap batch re-assembly of traces from gmw1-105h23 and gmw1-15k6 as viewed using Consed. Grey boxes represent the assembled contigs and are scaled in base pairs across each contig. Contig numbers are shown in pink boxes and are arbitrarily assigned by Phred/Phrap during sequence assembly. The blue and green boxes above each assembly show the predicted gene positions for gmw1-15k6 and gmw1-105h23, respectively. The green line-plot above each contig shows the average clone pair consistency. Sequence matches within and between contigs were determined with Cross-Match as part of Consed. Black lines within and between contigs show sequence matches that are in reverse orientation, while the orange lines show sequence matches in the same orientation. The bars between sequence matches correspond to the length of the match. Purple peak-shaped lines between contigs show clone pairs that span a gap. Below each contig is a purple line containing either blue (gmw1-15k6) or green (gmw1-105h23) tick marks; these are the tags that distinguish between traces from each BAC.
Assessment and quantification of reassembly of duplicate BAC sequences
| Assembly number | Parameters | Total # contigs | # contigs (> 100)a | % Coverage of old contigsb | % Identity to old contigsc | % Coverage +103e11d | % Identity +103e11d |
| 1 | standard | 551 | 44 | 98.52% | 99.07% | 98.44% | 97.39% |
| 2 | revise_greedy | 2538 | 45 | 91.41% | 99.08% | 92.74% | 98.43% |
| 3 | forcelevel 5 | 2140 | 40 | 96.13% | 99.21% | 95.56% | 98.52% |
| 4 | minmatch 30 | 2184 | 50 | 94.77%e | 98.92%e | 95.51% | 97.91% |
| 5 | forcelevel 3 | 2326 | 43 | 98.40% | 98.60% | 97.74% | 97.96% |
| 6 | forcelevel 5 minmatch 30 | 1781 | 43 | 88.75%e | 99.18%e | 86.17% | 98.04% |
| 7 | forcelevel 3 minmach30 | 1950 | 46 | 93.38%f | 99.18%f |
a Total number of contigs that contain greater than 100 sequence traces
b Total length of the resulting contigs (not including any overlapping regions) divided by the length of the originally assembled BAC
c Percent identity as calculated from Vmatch
d Recalculated percent coverage and percent identity to include contigs containing traces from gmw1-103e11; these contigs did not meet the 80% sequence identity cutoff for Vmatch
e One contig from gmw1-103e11 met the cutoff criteria of 80% sequence identity for Vmatch and was included in this estimation. The second contig was included in the +103e11 calculations
f This parameter set matches the parameter set that was determined to give the best reassembly of gmw1-103e11 as a single BAC reassembly. Both resulting contigs met the 80% sequence identity cutoff for Vmatch and are included in these averages.
Figure 3Repetitive sequences in BAC gmw1-103e11. Gene positions and repetitive sequences found in the region of 30,000 bp to 53,000 bp on gmw1-103e11. Predicted gene structures are shown as green boxes and arrows, with the boxes representing exons and lines being introns. Black tick marks on a gene show the start position of a repeated PPR domains within the gene. The blue boxes show the repetitive sequences identified by Vmatch. Orange gene alignments reflect the realignment of predicted gene structures back to the genomics sequence.
Figure 4Sequence composition of highly represented sequences in a small-subset of JGI sequence traces. A pie-chart representation of repetitive sequences from assembly of 80,000 JGI soybean whole-genome shotgun trace files. BAC corresponds to any contig that showed greatest identity to already assembled soybean BAC sequence. Mdh refers to a previously sequenced region of soybean containing repetitive sequence. No hit means that there was no blast-based match to the nonredundant database. Other was a best match to a sequence (BAC or genomic) from another organism that was not characterized. Satellite refers to known Sb92 or Str120 centromeric repeat sequences. The rest of the categories are as described in the figure legend.