| Literature DB >> 28444372 |
Einat Hazkani-Covo1, William F Martin2.
Abstract
Fragments of organelle genomes are often found as insertions in nuclear DNA. These fragments of mitochondrial DNA (numts) and plastid DNA (nupts) are ubiquitous components of eukaryotic genomes. They are, however, often edited out during the genome assembly process, leading to systematic underestimation of their frequency. Numts and nupts, once inserted, can become further fragmented through subsequent insertion of mobile elements or other recombinational events that disrupt the continuity of the inserted sequence relative to the genuine organelle DNA copy. Because numts and nupts are typically identified through sequence comparison tools such as BLAST, disruption of insertions into smaller fragments can lead to systematic overestimation of numt and nupt frequencies. Accurate identification of numts and nupts is important, however, both for better understanding of their role during evolution, and for monitoring their increasingly evident role in human disease. Human populations are polymorphic for 141 numt loci, five numts are causal to genetic disease, and cancer genomic studies are revealing an abundance of numts associated with tumor progression. Here, we report investigation of salient parameters involved in obtaining accurate estimates of numt and nupt numbers in genome sequence data. Numts and nupts from 44 sequenced eukaryotic genomes reveal lineage-specific differences in the number, relative age and frequency of insertional events as well as lineage-specific dynamics of their postinsertional fragmentation. Our findings outline the main technical parameters influencing accurate identification and frequency estimation of numts in genomic studies pertinent to both evolution and human health.Entities:
Keywords: cancer genomics; mitochondria; numts; nupts; organelle insertions
Mesh:
Substances:
Year: 2017 PMID: 28444372 PMCID: PMC5570036 DOI: 10.1093/gbe/evx078
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
Number of Inferred Numts (Complex Numts in Parentheses) in Different Genomes as Determined by Strict, Intermediate, and Permissive Concatenation Criteria and for Different Distances (50 bp up to 10 kb) between Organelle DNA Fragments in Nuclear DNA
| #BLAST Hits | UBB | Distance | S | I | P | |
|---|---|---|---|---|---|---|
| 1390 | 620480/554892 | 50 bp | 915 (96) | 883 (122) | 845 (115) | |
| 500 bp | 845(154) | 821 (171) | 773 (165) | |||
| 3 kb | 813(181) | 744 (217) | 681(216) | |||
| 10 kb | 792(193) | 715(234) | 626(237) | |||
| 58 | 8877/8525 | 50 bp | 49(8) | 48(9) | 44(12) | |
| 500 bp | 48(9) | 47(10) | 43(13) | |||
| 3 kb | 47(9) | 46(10) | 42(13) | |||
| 10 kb | 46(10) | 45(11) | 40(14) | |||
| 3 | 330/330 | 50 bp | 3(0) | 3(0) | 3(0) | |
| 500 bp | 3(0) | 3(0) | 3(0) | |||
| 3 kb | 3(0) | 3(0) | 3(0) | |||
| 10 kb | 3(0) | 3(0) | 3(0) | |||
| 178 | 39441/31990 | 50 bp | 139(30) | 138(31) | 138(31) | |
| 500 bp | 138(31) | 136(31) | 135(32) | |||
| 3 kb | 136(30) | 135(31) | 134(32) | |||
| 10 kb | 134(32) | 134(32) | 133(33) | |||
| 52 | 4295/3898 | 50 bp | 45(6) | 45(6) | 44(7) | |
| 500 bp | 45(6) | 45(6) | 43(8) | |||
| 3 kb | 45(6) | 45(6) | 43(8) | |||
| 10 kb | 45(6) | 45(6) | 43(8) | |||
| 829 | 66090/55833 | 50 bp | 597(75) | 590(80) | 569(93) | |
| 500 bp | 591(77) | 578(85) | 551(100) | |||
| 3 kb | 589(78) | 571(89) | 524(111) | |||
| 10 kb | 576 (82) | 556 (97) | 487(130) | |||
| 7 | 402/344 | 50 bp | 4(1) | 4(1) | 4(1) | |
| 500 bp | 4(1) | 4(1) | 4(1) | |||
| 3 kb | 4(1) | 4(1) | 4(1) | |||
| 10 kb | 4(1) | 4(1) | 4(1) | |||
| 18 | 2298/1143 | 50 bp | 9(9) | 9(9) | 9(9) | |
| 500 bp | 6(6) | 6(6) | 6(6) | |||
| 3 kb | 6(6) | 6(6) | 6(6) | |||
| 10 kb | 3(3) | 3(3) | 3(3) | |||
| 6550 | 1720939/1185113 | 50 bp | 3095 (1538) | 2984 (1499) | 2660 (1325) | |
| 500 bp | 2900(1470) | 2768(1431) | 2249(1130) | |||
| 3 kb | 2820 (1468) | 2661 (1420) | 2072 (1071) | |||
| 10 kb | 2772 (1476) | 2572 (1409) | 1914 (1026) | |||
| 449 | 116749/114219 | 50 bp | 405(31) | 392(41) | 380(48) | |
| 500 bp | 393(40) | 378(49) | 360(57) | |||
| 3 kb | 376(48) | 366(54) | 348(62) | |||
| 10 kb | 364(54) | 357(59) | 338(68) |
Note.—The number of hits obtained by BLASTing mitochondrial DNA against the nuclear genome (BLASTN E value 0.001) is shown. UBB (Unique Bases by BLAST)—values in the column UBB indicate the sum of bases (before slash) or the unique bases (after slash), respectively, by BLAST. Unless otherwise indicated, the values shown in red (permissive concatenation with a maximum distance of 500 bp) were used in this study. Additional genomes appear in supplementary table S1, Supplementary Material online.
Number of Inferred Nupts (Complex Nupts in Parentheses) in Different Genomes as Determined by Strict, Intermediate, and Permissive Concatenation Criteria and for Different Distances (50 bp up to 10 kb) Between Organelle DNA Fragments in Nuclear DNA
| #BLAST hits | UBB | Distance | S | I | P | |
|---|---|---|---|---|---|---|
| 442 | 62144/47921 | 50 bp | 302(107) | 291(110) | 287(110) | |
| 500 bp | 282(114) | 278(114) | 270(114) | |||
| 3 kb | 271(120) | 267(120) | 257(118) | |||
| 10 kb | 257(115) | 251(115) | 238(111) | |||
| 48 | 62028/60350 | 50 bp | 40(4) | 40(4) | 40(4) | |
| 500 bp | 38(6) | 38(6) | 38(6) | |||
| 3 kb | 34(6) | 34(6) | 34(6) | |||
| 10 kb | 32(6) | 32(6) | 32(6) | |||
| 19 | 3622/3604 | 50 bp | 19(0) | 19(0) | 19(0) | |
| 500 bp | 16(3) | 16(3) | 16(3) | |||
| 3 kb | 13(6) | 13(6) | 13(6) | |||
| 10 kb | 10(3) | 10(3) | 10(3) | |||
| 374 | 144842/122167 | 50 bp | 260(68) | 259(68) | 257(69) | |
| 500 bp | 256(65) | 256(65) | 254(66) | |||
| 3 kb | 256(65) | 255(66) | 250(68) | |||
| 10 kb | 256(65) | 255(66) | 250(68) | |||
| 143 | 8798/4304 | 50 bp | 63(38) | 63(38) | 63(38) | |
| 500 bp | 56(31) | 56(31) | 56(31) | |||
| 3 kb | 50(25) | 50(25) | 50(25) | |||
| 10 kb | 48(25) | 48(25) | 48(25) | |||
| 3289 | 326917/108834 | 50 bp | 363(42) | 360(44) | 345(53) | |
| 500 bp | 358(47) | 356(48) | 338(59) | |||
| 3 kb | 353(48) | 346(51) | 324(62) | |||
| 10 kb | 347(51) | 334(59) | 306(71) | |||
| 29 | 1829/1811 | 50 bp | 29(0) | 29(0) | 28(1) | |
| 500 bp | 29(0) | 29(0) | 28(1) | |||
| 3 kb | 29(0) | 29(0) | 28(1) | |||
| 10 kb | 27(1) | 26(2) | 25(3) | |||
| 168 | 190296/176916 | 50 bp | 69(35) | 68(34) | 64(31) | |
| 500 bp | 53(35) | 53(35) | 47(30) | |||
| 3 kb | 38(27) | 38(27) | 30(21) | |||
| 10 kb | 38(27) | 38(27) | 30(21) | |||
| 22 | 3775/2586 | 50 bp | 13(9) | 13(9) | 13(9) | |
| 500 bp | 12(10) | 12(10) | 12(10) | |||
| 3 kb | 6(4) | 6(4) | 6(4) | |||
| 10 kb | 6(4) | 6(4) | 6(4) | |||
| 2850 | 1331500/988435 | 50 bp | 1746(666) | 1722(674) | 1642(651) | |
| 500 bp | 1662(673) | 1630(681) | 1483(638) | |||
| 3 kb | 1627(674) | 1579(681) | 1387(619) | |||
| 10 kb | 1595(684) | 1544(691) | 1313(619) | |||
| 305 | 58953/48812 | 50 bp | 238(65) | 235(66) | 233(65) | |
| 500 bp | 235(67) | 232(67) | 229(66) | |||
| 3 kb | 228(60) | 222(57) | 219(56) | |||
| 10 kb | 218(51) | 216(51) | 212(51) | |||
| 353 | 290772/215564 | 50 bp | 24(18) | 23(17) | 23(17) | |
| 500 bp | 24(18) | 23(17) | 23(17) | |||
| 3 kb | 23(17) | 23(17) | 23(17) | |||
| 10 kb | 23(17) | 23(17) | 23(17) | |||
| 860 | 314782/253383 | 50 bp | 505(324) | 504(325) | 499(328) | |
| 500 bp | 492(322) | 492(322) | 484 (324) | |||
| 3 kb | 373(207) | 372(206) | 361(209) | |||
| 10 kb | 343(183) | 343(183) | 330(185) |
Note.—The number of hits obtained by BLASTing mitochondrial DNA against the nuclear genome (BLASTN E value 0.001) is shown. UBB (Unique Bases by BLAST)—values in the column UBB indicate the sum of bases (before slash) or the unique bases (after slash), respectively, by BLAST. Unless otherwise indicated, the values shown in red (permissive concatenation with a maximum distance of 500 bp) were used in this study. Additional genomes appear in supplementary table S2, Supplementary Material online.
F—Number of inferred (A) numts and (B) nupts obtained by different clustering stringencies and concatenation distances. Clustering stringency is shown on the x-axis, different concatenation distances are depicted in colors (blue 50 bp, red 500 bp, yellow 3 kb, and purple 10 kb). Species with at least 400 BLAST hits are shown.
F—(A) A schematic phylogenetic tree and distribution of BLAST identity scores for permissive counting showing at least 20 inferred insertions with a concatenation distance of 500 bp for (B) numts and for (C) nupts. Distribution of BLAST identity scores for permissive counting showing at least 20 inferred insertions with a concatenation distance of 500 bp. Concatenated numt and nupt scores were calculated as a weighted mean. Scales are shown between 60% and 100% identity and the distribution is shown up to 30%. The sum of the cumulative distribution is one. Some species include columns above 30% as Chlorella (40%), Coccomyxa (40%) and Emiliania (40%) for nupts, and Cyanophora (40%) and Nematostella (70%) for numts.
F—Difference in percent identity between BLAST hits belongs to the same insertion. (A) For each inferred insertion, the maximum identity score-difference between separated BLAST hits was calculated. (B) Histograms of the difference of % identity for numts and nupts. Data from permissive concatenation with a distance up to 500 bp.
Repeats Identified by RepeatMasker in Complex Numts and Complex Nupts in Genomes Harboring at Least 80 Inferred Insertions
| Numts/Nupts | Complex Numts/Nupts Out of the Total Number | Number of Numts/Nupts With at Least 10 bp Spacer | Number of Numts/Nupts With Repeats | Repeats Identified | |
|---|---|---|---|---|---|
| 773 | 165 | 92 | 77 | 92 retroelements, 12 DNA transposons, 2 small RNA 1, satellite, 7 low complexity | |
| 2249 | 1130 | 196 | 82 | 64 retroelements, 25 DNA transposons, 8 small RNA, 1 simple repeat, 9 low complexity region | |
| 592 | 96 | 82 | 61 | 52 SINEs, 4 LTRs, 1 DNA element, 45 small RNA, 9 simple repeats, 25 low complexity regions | |
| 1483 | 638 | 113 | 37 | 4 retroelements, 13 DNA transposons, 18 small RNA, 2 simple repeats, 31 low complexity regions | |
Note.—Values apply to interruptions of the numt or nupt by a nonorganelle-DNA spacer of >10 bp.