| Literature DB >> 19042974 |
Genis Parra1, Keith Bradnam, Zemin Ning, Thomas Keane, Ian Korf.
Abstract
Genome sequencing projects have been initiated for a wide range of eukaryotes. A few projects have reached completion, but most exist as draft assemblies. As one of the main reasons to sequence a genome is to obtain its catalog of genes, an important question is how complete or completable the catalog is in unfinished genomes. To answer this question, we have identified a set of core eukaryotic genes (CEGs), that are extremely highly conserved and which we believe are present in low copy numbers in higher eukaryotes. From an analysis of a phylogenetically diverse set of eukaryotic genome assemblies, we found that the proportion of CEGs mapped in draft genomes provides a useful metric for describing the gene space, and complements the commonly used N50 length and x-fold coverage values.Entities:
Mesh:
Substances:
Year: 2008 PMID: 19042974 PMCID: PMC2615622 DOI: 10.1093/nar/gkn916
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Reducing the number of orthologs in the original set of 458 CEGs
| 458 CEGs | 248 CEGs | |||||
|---|---|---|---|---|---|---|
| Average number of orthologs per CEG | Percentage CEGs with more than one ortholog | Percentage CEGs with more than two orthologs | Average number of orthologs per CEG | Percentage CEGs with more than one ortholog | Percentage CEGs with more than two orthologs | |
| 2.49 ± 1.89 | 65.7 | 34.7 | 2.04 ± 1.47 | 52.4 | 21.3 | |
| 1.34 ± 0.80 | 22.4 | 6.7 | 1.17 ± 0.55 | 11.1 | 2.3 | |
| 1.32 ± 0.69 | 22.9 | 6.5 | 1.16 ± 0.45 | 12.7 | 2.8 | |
| 2.84 ± 2.67 | 62.4 | 37.3 | 2.13 ± 1.73 | 49.6 | 23.2 | |
| 1.31 ± 0.65 | 23.8 | 4.8 | 1.10 ± 0.35 | 8.8 | 0.8 | |
| 1.20 ± 0.49 | 17.2 | 3.2 | 1.11 ± 0.39 | 8.8 | 2.0 | |
For the sets of 458 and 258 CEGs, the average number of orthologs per CEG, and percentages of CEGs with more than one and two orthologs are listed. SDs are shown for the average number of orthologs per CEG.
Assembly statistics and results of mapping 248 CEGs in C. briggsae, H. sapiens and T. gondii
| Contigs | Scaffolds | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Species | Assembly details | N50 length (Kb) | Total number of genes (%) | Number of mapped CEGs (%) | N50 length (Kb) | Total number of genes (%) | No. of mapped CEGs (%) | ||
| 2× | 34 456 | 2.3 | 9912 (46.1) | 110 (44.3) | 10 297 | 14.2 | 11 006 (57.0) | 132 (53.2) | |
| 108 Mb | 4× | 20 421 | 7.4 | 15 372 (79.7) | 200 (80.6) | 2268 | 16.4 | 17 738 (91.9) | 226 (91.1) |
| 19 296 genes | 6× | 11 399 | 16.4 | 17 470 (90.5) | 227 (91.5) | 1028 | 465 | 18 809 (97.4) | 238 (95.9) |
| 8× | 7 363 | 28.9 | 18 311 (94.8) | 231 (93.1) | 971 | 983 | 19 071 (98.8) | 241 (97.2) | |
| 10× | 5614 | 37.4 | 18 578 (96.3) | 243 (97.9) | 675 | 1032 | 19 106 (99.0) | 245 (98.8) | |
| CB25 12× | 5341 | 40.7 | 18 530 (96.0) | 239 (96.4) | 899 | 474 | 19 141 (99.1) | 244 (98.3) | |
| 0.7× | 39 143 | 0.8 | 889 (11.4) | 10 (4.0) | – | – | – | – | |
| 63 Mb | 1× | 45 663 | 1.1 | 1 499 ( | 19 (7.6) | – | – | – | – |
| 7 793 genes | 2× | 36 333 | 2.8 | 3 813 (48.9) | 82 (33.0) | – | – | – | – |
| 4× | 10 594 | 13.9 | 6358 (81.5) | 163 (65.3) | – | – | – | – | |
| 6× | 4 198 | 95.7 | 7 557 (96.9) | 199 (80.2) | 586 | 1000 | 7745 (99.3) | 212 (85.6) | |
| 10× | 3 922 | 397 | 7 678 (98.3) | 207 (83.5) | 669 | 2474 | 7793 (100) | 213 (85.9) | |
| draft 1.9× | 590 603 | 3.1 | 4 963 (20.9) | 52 (21.0) | 130 283 | 51.9 | 7930 (33.4) | 105 (42.3) | |
| 3 253 Mb | draft 3× | 795 203 | 4.1 | 7 414 (31.2) | 88 (35.5) | 449 727 | 13.5 | 10 189 (43.0) | 125 (50.4) |
| 23 713 genes | draft 4.2× | 435 593 | 13.1 | 12 006 (50.6) | 142 (57.2) | 81 459 | 2425 | 19 333 (81.5) | 225 (90.7) |
| draft 5.3× | 368 201 | 14.7 | 13 009 (54.8) | 148 (59.7) | 28 863 | 692 | 20 557 (86.7) | 228 (91.9) | |
| draft 6× | 296 517 | 19.1 | 12 739 (53.7) | 149 (60.1) | 62 471 | 436 | 18 769 (79.2) | 212 (85.4) | |
| draft 6.6× | 292 555 | 28.8 | 15 592 (65.7) | 179 (72.2) | 77 769 | 8217 | 20 978 (88.5) | 238 (95.9) | |
| draft 7.1× | 131 620 | 44.3 | 16 205 (68.3) | 198 (79.8) | 16 098 | 1042 | 19 895 (83.9) | 230 (92.7) | |
‘CB25’ refers to the 2002 published assembly of C. briggsae. ‘Total number of genes’ refers to the number of genes from the final (highest coverage) assembly for each species that are present in each lower coverage assembly (see Methods section). The total number of genes is listed beneath each species name, along with the estimated genome size. The ‘Mapped CEGs’ column lists numbers of the 248 CEGs that were mapped in the genome of each species. Results are shown for both contig- and scaffold-based assemblies. Figures in parentheses show values as percentages.
Results of mapping CEGs against the genomes of various eukaryotes
| Species | Genome size (Gb) | Coverage | Full-length mapped CEGs (%) | CEGs in annotations (%) | Full-length + partially mapped CEGs (%) | Paralogy index (%) | G1 Map (%) | G4 Map (%) | G1 Identity (%) | G4 Identity (%) |
|---|---|---|---|---|---|---|---|---|---|---|
| Mammals (placental) | ||||||||||
| 2.532 | 7.5× | 243 (98.0) | 241 (97.2) | 247 (99.6) | 37.4 | 100 | 96.9 | 38.2 | 65.5 | |
| 3.247 | 7.1× | 244 (98.4) | 243 (98.0) | 246 (99.2) | 33.3 | 98.5 | 95.4 | 37.8 | 65.1 | |
| 3.350 | 6.6× | 240 (96.8) | 241 (97.2) | 247 (99.6) | 39.6 | 100 | 96.9 | 38.1 | 65.1 | |
| 3.097 | 5.3× | 238 (96.0) | 237 (95.5) | 248 (100) | 36.6 | 100 | 100 | 38.0 | 65.1 | |
| 3.000 | 2× | 144 (58.1) | – | 188 (75.8) | 17.4 | 69.7 | 75.4 | 36.3 | 61.1 | |
| 3.718 | 2× | 114 (46.0) | – | 170 (68.5) | 15.8 | 65.2 | 64.6 | 34.3 | 59.0 | |
| 3.414 | 1.9× | 114 (46.0) | – | 169 (68.1) | 17.5 | 65.2 | 67.7 | 33.7 | 58.7 | |
| Vertebrates | ||||||||||
| 2.073 | 6× | 185 (74.6) | 175 (70.6) | 210 (84.7) | 27.1 | 75.7 | 86.1 | 35.7 | 63.9 | |
| 1.100 | 6.6× | 208 (83.9) | 204 (82.3) | 212 (85.4) | 13.0 | 83.1 | 87.7 | 38.0 | 64.6 | |
| 1.511 | 7.7× | 237 (95.6) | 217 (87.5) | 243 (98.0) | 24.6 | 98.5 | 96.9 | 38.7 | 65.0 | |
| 0.393 | 8.7× | 243 (98.0) | 235 (94.7) | 248 (100) | 20.6 | 98.5 | 100 | 38.4 | 65.4 | |
| Insects | ||||||||||
| 0.278 | 10.2× | 245 (98.8) | 243 (98.0) | 247 (99.6) | 9.4 | 100 | 98.4 | 37.6 | 66.1 | |
| 0.231 | 7.5× | 228 (91.9) | 173 (69.7) | 243 (98.0) | 6.1 | 98.5 | 98.4 | 38.7 | 65.9 | |
| Nematodes | ||||||||||
| 0.108 | 12× | 246 (99.2) | 242 (97.6) | 247 (99.6) | 8.1 | 100 | 98.4 | 35.0 | 62.9 | |
| 0.150 | 9.5× | 245 (98.8) | – | 248 (100) | 53.5 | 98.5 | 98.4 | 34.6 | 62.1 | |
| 0.152 | 9× | 238 (96.0) | – | 245 (98.8) | 15.5 | 98.5 | 100 | 34.9 | 62.9 | |
| 0.065 | >30× | 233 (94.0) | – | 238 (96.0) | 7.7 | 97.0 | 98.4 | 34.8 | 61.5 | |
| Chordates | ||||||||||
| 0.173 | 11× | 239 (96.4) | 203 (81.8) | 243 (98.0) | 6.3 | 95.5 | 100 | 37.5 | 64.8 | |
| Plants | ||||||||||
| 0.480 | 7.5× | 244 (98.4) | 246 (99.2) | 248 (99.6) | 71.3 | 100 | 100 | 35.0 | 62.1 | |
| 0.430 | – | 244 (98.4) | 185 (74.6) | 246 (99.2) | 51.6 | 98.5 | 98.4 | 34.2 | 61.4 | |
| 0.120 | 12.8× | 231 (93.1) | 221 (89.1) | 233 (94.0) | 6.9 | 87.9 | 98.4 | 31.7 | 59.7 | |
| Fungi | ||||||||||
| 0.039 | >10× | 245 (98.8) | 236 (95.1) | 245 (98.8) | 3.7 | 97.0 | 100 | 33.3 | 58.8 | |
| 0.040 | 7× | 243 (97.9) | 237 (95.5) | 246 (99.6) | 4.1 | 98.5 | 98.4 | 33.0 | 59.3 | |
| Protozoan | ||||||||||
| 0.023 | – | 186 (75.0) | 204 (82.2) | 187 (75.4) | 4.3 | 56.1 | 96.9 | 25.6 | 52.4 | |
| 0.011 | 11× | 115 (46.4) | 135 (54.4) | 115 (46.4) | 3.4 | 18.2 | 67.7 | 26.7 | 44.7 |
Genome sizes are estimates from experimental data. Coverage refers to approximate values of sequence coverage for WGS genomes only. The ‘Full-length mapped CEGs’ column lists numbers and percentages (in parentheses) of the 248 CEGs that were mapped in the genome of each species. ‘CEGs in annotations’ refers to the number of CEGs found in the current set of gene annotations (when available) for each genome. The ‘Full-length + partially mapped CEGs’ column corresponds to the number of full-length CEGs that were mapped (column 4) plus the numbers of CEG fragments that were mapped. The ‘Paralogy index’ indicates the fraction of mapped CEGs for which we detected at least one potential paralog. G1 and G4 mapped percentage corresponds to the number of CEGs from the conservation groups (in Table 3) that have been partially mapped. G1 and G4 identity percent corresponds to the average percentage identity of the global pairwise alignment of the predicted CEGs against the CEGs of the six original species. The latest available versions of genomes were used for this analysis (see Supplementary Table S6 for more details) apart from C. intestinalis for which the v1.95 assembly was used. Genome sizes are estimates. Coverage refers to approximate values of sequence coverage for WGS genomes only.
Figure 1.Mapping results for six selected species in four subsets of core genes. Group 1 represents the least conserved of all CEGs and Group 4 the most conserved.
Figure 2.Summary of the three main patterns of results that can be expected when studying a new genome sequence. X-axis represents whether the mapping protocol uses subsets of CEGs that are the most or least conserved.