| Literature DB >> 18793465 |
Ronald L Frank1, Cyriac Kandoth, Fikret Ercal.
Abstract
BACKGROUND: Gene family identification from ESTs can be a valuable resource for analysis of genome evolution but presents unique challenges in organisms for which the entire genome is not yet sequenced. We have developed a novel gene family identification method based on negative selection patterns (NSP) between family members to screen EST-generated contigs. This strategy was tested on five known gene families in Arabidopsis to see if individual paralogs could be identified with accuracy from EST data alone when compared to the actual gene sequences in this fully sequenced genome.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18793465 PMCID: PMC2537573 DOI: 10.1186/1471-2105-9-S9-S2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Graph representing potential paralogs with dS/dN >= 2. Edges are labeled with the dS/dN ratios, followed by the number of substitutions (Sd+Sn) seen An edge indicates dS/dN>=2.00; No edge indicates dS/dN<2.00 OR dS/dN=NA.
dS/dN calculations for phenylalanine ammonia-lyase (PAL) contigs
| Contig1 | Contig4 | 38.50 | 18.50 | 62.83 | 219.17 | 0.61 | 0.08 | 1.27 | 0.09 | 14.22 | 7.26 |
| Contig1 | Contig3 | 42.17 | 39.83 | 65.17 | 216.83 | 0.65 | 0.18 | 1.49 | 0.21 | 7.07 | 3.52 |
| Contig1 | Contig9 | 42.33 | 36.67 | 65.50 | 216.50 | 0.65 | 0.17 | 1.48 | 0.19 | 7.73 | 3.82 |
| Contig1 | Contig6 | 52.50 | 138.50 | 63.33 | 197.67 | 0.83 | 0.70 | NA | 0.00 | NA | 1.18 |
| Contig1 | Contig8 | 53.50 | 138.50 | 63.33 | 197.67 | 0.84 | 0.70 | NA | 0.00 | NA | 1.21 |
| Contig1 | Contig7 | 45.17 | 153.83 | 62.83 | 210.17 | 0.72 | 0.73 | 2.39 | 2.80 | 0.85 | 0.98 |
| Contig4 | Contig3 | 211.83 | 133.17 | 286.33 | 985.67 | 0.74 | 0.14 | 3.22 | 0.15 | 21.64 | 5.48 |
| Contig4 | Contig9 | 143.00 | 105.00 | 191.67 | 639.33 | 0.75 | 0.16 | 3.94 | 0.19 | 21.27 | 4.54 |
| Contig4 | Contig6 | 152.50 | 490.50 | 201.50 | 665.50 | 0.76 | 0.74 | NA | 0.00 | NA | 1.03 |
| Contig4 | Contig8 | 80.33 | 225.67 | 99.83 | 314.17 | 0.80 | 0.72 | NA | 0.00 | NA | 1.12 |
| Contig4 | Contig7 | 65.00 | 233.00 | 93.17 | 326.83 | 0.70 | 0.71 | 2.00 | 2.25 | 0.89 | 0.98 |
| Contig3 | Contig9 | 1.50 | 14.50 | 197.17 | 633.83 | 0.01 | 0.02 | 0.01 | 0.02 | 0.33 | 0.33 |
| Contig3 | Contig6 | 160.50 | 486.50 | 206.83 | 660.17 | 0.78 | 0.74 | NA | 0.00 | NA | 1.05 |
| Contig3 | Contig8 | 81.83 | 222.17 | 102.83 | 311.17 | 0.80 | 0.71 | NA | 0.00 | NA | 1.11 |
| Contig3 | Contig7 | 70.50 | 228.50 | 96.00 | 324.00 | 0.73 | 0.71 | 2.90 | 2.11 | 1.37 | 1.04 |
| Contig9 | Contig6 | 150.00 | 454.00 | 196.17 | 613.83 | 0.76 | 0.74 | NA | 0.00 | NA | 1.03 |
| Contig9 | Contig8 | 81.33 | 220.67 | 103.17 | 310.83 | 0.79 | 0.71 | NA | 0.00 | NA | 1.11 |
| Contig9 | Contig7 | 71.67 | 229.33 | 96.33 | 323.67 | 0.74 | 0.71 | 3.61 | 2.17 | 1.66 | 1.05 |
| Contig6 | Contig8 | 2.00 | 4.00 | 108.33 | 305.67 | 0.02 | 0.01 | 0.02 | 0.01 | 1.42 | 1.41 |
| Contig6 | Contig7 | 68.33 | 200.67 | 97.50 | 301.50 | 0.70 | 0.67 | 2.04 | 1.64 | 1.25 | 1.05 |
| Contig8 | Contig7 | 68.33 | 199.67 | 97.50 | 301.50 | 0.70 | 0.66 | 2.04 | 1.61 | 1.27 | 1.06 |
SNAP output results for all 21 pairwise comparisons of 7 contigs in which an ORF was identified. A ds/dn value greater than 2.00 was chosen as threshold to indicate contigs that potentially represent distinct gene family members.
a – See for explanation of abbreviations and calculations.
MapViewer locus for ESTs of NSP generated contigs
| GeneB | contig3 | CK121258 | AT4G39330 | AtCAD1 | |
| CB074210 | AT4G39330 | AtCAD1 | |||
| GeneC | contig1 | BP561562 | ELI3-1 | AtCAD4 | |
| BP796450 | ELI3-1 | AtCAD4 | |||
| CD530744 | ELI3-1 | AtCAD4 | |||
| GeneA | contig1 | AV823314 | ERF1-3 | AteRF1-3 | |
| GeneB | contig3 | AV822373 | ERF1-2 | AteRF1-2 | |
| BP803175 | ERF1-2 | AteRF1-2 | |||
| Z18188 | ERF1-2 | AteRF1-2 | |||
| GeneC | contig6 | AV825957 | ERF1-1 | AteRF1-1 | |
| BE845168 | ERF1-1 | AteRF1-1 | |||
| GeneA | contig1 | 8720101 | PAL1 | AtPAL1 | |
| 8736225 | PAL1 | AtPAL1 | |||
| GeneB | contig3 | 8722848 | AT3G10340 | AtPAL4 | |
| 8723431 | AT3G10340 | AtPAL4 | |||
| 8728745 | AT3G10340 | AtPAL4 | |||
| 8730514 | AT3G10340 | AtPAL4 | |||
| 9780248 | AT3G10340 | AtPAL4 | |||
| 9788228 | AT3G10340 | AtPAL4 | |||
| GeneC | contig6 | 8690351 | PAL2 | AtPAL2 | |
| 8724245 | PAL2 | AtPAL2 | |||
| 8725529 | PAL2 | AtPAL2 | |||
| 19869024 | PAL2 | AtPAL2 | |||
| 19869200 | PAL2 | AtPAL2 | |||
| 37426635 | PAL2 | AtPAL2 | |||
| GeneC | contig8 | 9786707 | PAL2 | AtPAL2 | |
| 37426640 | PAL2 | AtPAL2 | |||
| GeneC | contig4 | 8719100 | PAL2 | AtPAL2 | |
| 14580232 | PAL2 | AtPAL2 | |||
| 19855615 | PAL2 | AtPAL2 | |||
| 49165014 | PAL2 | AtPAL2 | |||
| 59667557 | PAL2 | AtPAL2 | |||
| GeneA | contig1 | 5761694 | AT1G18540 | AtL6A | |
| 8724065 | AT1G18540 | AtL6A | |||
| 19802678 | AT1G18540 | AtL6A | |||
| GeneB | contig4 | 19868834 | AT1G74060 | AtL6B | |
| 23303389 | AT1G74060 | AtL6B | |||
| GeneC | Contig6 | 8714872 | AT1G74050 | AtL6C | |
| GeneA | contig1 | AV518555 | VAR2 | AtFtsH2 | |
| AV558102 | VAR2 | AtFtsH2 | |||
| AV800962 | VAR2 | AtFtsH2 | |||
| BP785237 | VAR2 | AtFtsH2 | |||
| GeneB | contig6 | BP626558 | FTSH8 | AtFtsH8 | |
Individual ESTs of representative contigs for putative gene family members of the 5 Arabidopsis families tested were located to a specific locus by NCBI MapViewer.
Percent similarity for NSP generated contigs aligned with actual ribosomal protein L6 genes
| 100 | 98 | 98 | 82 | 83 | 82 | 84 | |
| 83 | 83 | 82 | 99 | 99 | 99 | 93 | |
| 84 | 84 | 83 | 93 | 93 | 93 | 99 | |
Representative contigs for 3 putative gene family members, GeneA, GeneB, and GeneC, identified by the NSP method were aligned with actual Arabidopsis gene family members and percent similarity determined.
Percent similarity for NSP generated contigs aligned with actual CAD genes
| NSSa | 99 | NSS | NSS | |
| 99 | NSS | NSS | NSS | |
| NSS | NSS | 78 | 82 | |
| NSS | 76 | 100 | 87 | |
| NSS | 72 | 84 | 100 | |
| 79 | NSS | NSS | NSS | |
| NSS | 78 | 72 | NSS | |
| NSS | NSS | NSS | NSS | |
| NSS | NSS | NSS | NSS | |
Representative contigs for 4 putative gene family members, GeneA, GeneB, GeneC, and GeneD identified by the NSP method were aligned with actual Arabidopsis gene family members and percent similarity determined.
a – No significant similarity as returned by bl2seq program.
Percent similarity for NSP generated contigs aligned with actual release factor genes
| 82 | 83 | 99 | |
| 88 | 97 | 83 | |
| 99 | 85 | 82 | |
Representative contigs for 3 putative gene family members, GeneA, GeneB, and GeneC, identified by the NSP method were aligned with actual Arabidopsis gene family members and percent similarity determined.
Percent similarity for NSP generated contigs aligned with actual FtsH genes
| NSS | 71 | 73 | NSS | |
| 100 | 97 | 86 | 83 | |
| NSS | 78 | 70 | NSS | |
| NSS | 79 | 78 | NSS | |
| NSS | 73 | 73 | NSS | |
| 72 | 73 | 69 | 77 | |
| NSS | 68 | 73 | NSS | |
| 88 | 85 | 99 | 100 | |
| NSS | 68 | NSS | NSS | |
| NSS | 75 | 74 | NSS | |
| NSS | 77 | 77 | NSS | |
| NSS | NSS | NSS | NSS | |
Representative contigs for 2 putative gene family members, GeneA and GeneB, identified by the NSP method were aligned with actual Arabidopsis gene family members and percent similarity determined.