| Literature DB >> 17488849 |
Anchal Vishnoi1, Rahul Roy, Alok Bhattacharya.
Abstract
Comparative genomic approaches are useful in identifying molecular differences between organisms. Currently available methods fail to identify small changes in genomes, such as expansion of short repetitive motifs and to analyse divergent sequences. In this report, we describe an anchor-based whole genome comparison (ABWGC) method. ABWGC is based on random sampling of anchor sequences from one genome, followed by analysis of sampled and homologous regions from the target genome. The method was applied to compare two strains of Mycobacterium tuberculosis CDC1551 and H37Rv. ABWGC was able to identify a total of 104 indels including 20 expansion of short repetitive sequences and five recombination events. It included 18 new unidentified genomic differences. ABWGC also identified 188 SNPs including eight new ones. The method was also used to compare M. tuberculosis H37Rv and M. avium genomes. ABWGC was able to correctly pick 1002 additional indels (size >100 nt) between the two organisms in contrast to MUMmer, a popular tool for comparative genomics. ABWGC was able to identify correctly repeat expansion and indels in a set of simulated sequences. The study also revealed important role of small repeat expansion in the evolution of M. tuberculosis strains.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17488849 PMCID: PMC1931498 DOI: 10.1093/nar/gkm209
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Cumulative normalized score of different sized anchors. Anchors, ranging from 25 to 300 nt, were extracted from random positions of M. tuberculosis H37Rv genome. The homologous anchors from M. avium were identified by BLAST. The mismatch score of each anchor pair was used to calculate CNS. The anchor numbers represent anchors that have been extracted sequentially in terms of the position in the genome.
Figure 2.Distribution of individual mismatch score. For every randomly picked anchor and its homologous sequence in the target genome (T), a mismatch score was computed as described in the Methods section. The individual scores were plotted against anchors that have been extracted sequentially in terms of position in the genome (S). Homologous anchors on positive strand are shown by plus (+) and those in complementary strand by cross (×). (a) Mycobacterium tuberculosis CDC1551 (S) and M. tuberculosis H37Rv (T); (b) M. tuberculosis CDC1551 (S) and random genome (T). The frequency distribution plot of the scores of anchors (c) M. tuberculosis CDC1551 (S) and M. tuberculosis H37Rv; (d) M. tuberculosis CDC1551 (S) and random genome (T).
Figure 3.The inter-anchor length distribution. The anchors were extracted from M. tuberculosis CDC1551 randomly as described in the Methods section. These anchors were ordered in terms of their position in the genome. The homologous of these anchors were from M. tuberculosis H37Rv. The distance between two consecutive anchors were computed for both genomes. These inter-anchor lengths of two genomes were then plotted. The deviation from the diagonal represents the difference between the inter-anchor length of a pair of homologous anchors.
M. tuberculosis CDC1551 anchors that mapped to the complementary strand of M. tuberculosis H37Rv
| Anchor start position | Anchor end position | Score | Strand | CDS | Function |
|---|---|---|---|---|---|
| 674519 | 674487 | 67 | Minus | Rv0578c | PE-PGRS |
| 1788224 | 1788125 | 0 | Minus | Rv1587c | Partial REP13E12 repeat protein |
| 1787355 | 1787256 | 0 | Minus | Rv1586c | Probable phiRv1 integrase |
| 1787238 | 1787139 | 0 | Minus | Rv1586c | Probable phiRv1 integrase |
| 1786632 | 1786533 | 0 | Minus | Rv1585c | Possible phage phiRv1 |
| 1786225 | 1786126 | 0 | Minus | Rv1583c | Possible phage phiRv1 |
| 1785944 | 1785845 | 0 | Minus | Rv1582c | Possible phage phiRv1 |
| 1785225 | 1785126 | 0 | Minus | Rv1582c | Possible phage phiRv1 |
| 1785007 | 1784908 | 0 | Minus | Rv1582c | Possible phage phiRv1 |
| 1783666 | 1783567 | 0 | Minus | Rv1579c | Possible phage phiRv1 |
| 1783346 | 1783247 | 0 | Minus | Rv1579c | Possible phage phiRv1 |
| 1782637 | 1782538 | 0 | Minus | lies in intergenic region | |
| 1781837 | 1781737 | 0 | Minus | Rv1576c | Possible phage phiRv1 |
| 1781292 | 1781193 | 0 | Minus | Rv1576c | Possible phage phiRv1 |
| 1780682 | 1780587 | 5 | Minus | Rv1576c | Possible phage phiRv1 |
| 1532953 | 1532854 | 0 | Minus | Rv1361c | PPE FAMILY PROTEIN |
| 1469637 | 1469538 | 0 | Minus | Rv1313c | POSSIBLE TRANSPOSASE |
Unique SNPs of M. tuberculosis H37Rv and M. tuberculosis CDC1551
| SNP position in H37Rv | SNP | SNP position in CDC1551 |
|---|---|---|
| 1789448 | T-C | 1780353 |
| 1789513 | G-A | 1780418 |
| 2266507 | .-A | 2267858 |
| 2266511 | A-T | 2267862 |
| 2266515 | T-. | 2267866 |
| 2266520 | G-T | 2267871 |
| 2266522 | C-G | 2267874 |
| 2338775 | G-A | 2341102 |
Partial list of genomic alterations between M. tuberculosis CDC1551 and M. tuberculosis H37Rv
| Nature of event | Start position | Size | CDS | Function |
|---|---|---|---|---|
| (A) | ||||
| Insertion | 71529 | 36 | Intergenic region (2) | |
| Insertion | 150882 | 179 | MT0132 | PE PGRS family protein (1) |
| Insertion | 483384 | 1357 | MT0413 | IS6110 (2) |
| Insertion | 483384 | 1357 | MT0414 | IS6110 (2) |
| Insertion | 483384 | 1357 | MT0415 | Hypothetical protein (2) |
| Insertion | 624648 | 83 | MT0556 | PE-PGRS family protein (1) |
| Insertion | 744075 | 532 | MT0676 | Glycosyl hydrolase, family 5 (1) |
| Insertion | 804401 | 215 | MT0730 | 50S ribosomal protein L23 (2) |
| Insertion | 804401 | 215 | MT0731 | 50S ribosomal protein L2 (2) |
| Insertion | 960065 | 105 | Intergenic region (2) | |
| Insertion | 1094076 | 11 | MT1006.1 | PE PGRS family protein (2) |
| Insertion | 1096183 | 11 | MT1008 | PE PGRS family protein (2) |
| Insertion | 1121702 | 14 | MT1033 | Hypothetical protein (1) |
| Insertion | 1191499 | 192 | MT1097 | PE-PGRS family protein (1) |
| Insertion | 1213836 | 44 | Intergenic region(1) | |
| Insertion | 1442915 | 55 | Insertion lies in the intergenic region(2) | |
| Insertion | 1480513 | 1674 | MT1360 | Adenylate cyclase (1) |
| Insertion | 1612509 | 21 | MT1479 | Hypothetical protein (1) |
| Insertion | 1632400 | 26 | MT1497.1 | PE-PGRS family protein (1) |
| (B) | ||||
| Expansion of repeat | 24704 | 18 (2) | Rv0020c | Hypothetical protein (2) |
| Insertion | 32351 | 36 | Rv0029 | Hypothetical protein (1) |
| Insertion | 206812 | 56 | Rv0175 | Probable conserved mce associated membrane protein (1) |
| Expansion of repeat | 335812 | 18 (3) | Rv0278c | PE-PGRS family protein (2) |
| Insertion | 337806 | 32 | Rv0279c | PE-PGRS family protein (2) |
| Expansion of repeat | 427312 | 15 (4) | Rv0355c | PPE family protein (2) |
| Expansion of repeat | 428188 | 59 (3) | Rv0355c | PPE family protein (2) |
| Insertion | 577286 | 57 | Rv0487 | Hypothetical protein (2) |
| Insertion | 577286 | 57 | Rv0488 | Probable conserved integral membrane protein (2) |
| Insertion | 840167 | 47 | Rv0747 | PE-PGRS family protein (1) |
| Insertion | 1212109 | 77 | Rv1087 | PE-PGRS family protein (2) |
| Insertion | 1217495 | 653 | Rv1091 | PE-PGRS family protein (1) |
| Insertion | 1267172 | 57 | Insertion lies in the intergenic region | |
| Insertion | 1895353 | 113 | Insertion lies in the intergenic region | |
| Insertion | 2062036 | 89 | Rv1818c | PE-PGRS family protein (1) |
| Insertion | 2074436 | 111 | Rv1829 | Hypothetical protein (2) |
| Insertion | 2074436 | 111 | Rv1830 | Hypothetical protein (2) |
| Expansion of repeat | 2163731 | 69 (3) | Rv1917c | PPE family protein (2) |
Reported by
(1) Fleischman et al., 2002.
(2) ABWGC.
aNumber of copies is shown in brackets.
Figure 4.Variation in repetitive motifs in M. tuberculosis H37Rv in comparison with M. tuberculosis CDC1551. Homologous inter-anchor regions of M. tuberculosis CDC1551 and M. tuberculosis H37Rv were aligned. Repetitive motifs were identified automatically. (a) The 18-mer repetitive motif is highlighted in red and blue colour; (b) Deletion of segment of nucleotides containing repetitive elements. Repetitive elements are highlighted in blue colour.
The tandem repeats identified in M. tuberculosis CDC1551 by ABWGC and tandem repeat finder
| ABWGC | Tandem repeat finder | ||||
|---|---|---|---|---|---|
| Indices | Size | Copy number | Indices | Size | Copy number |
| 1121703–1121718 | 15 | 4 | 1121702–1121771 | 15 | 4.7 |
| 1612487–1612508 | 21 | 2 | 1612426–1612545 | 21 | 5.7 |
| 1946383–1946440 | 57 | 2 | 1946383–1946530 | 57 | 2.6 |
| 1974051–1974207 | 78 | 8 | 1973745–1974339 | 78 | 7.6 |
| 2143312–2143357 | 45 | 2 | 2143312–2143406 | 45 | 2.1 |
| 2160645–2160921 | 69 | 8 | 2160645–2161098 | 69 | 6.6 |
| 3730841–3730859 | 18 | 2 | 3730827–3730875 | 9 | 5.4 |
| 3940714–3940750 | 36 | 2 | 3940650–3940756 | 18 | 5.9 |
| 3941039–3941114 | 75 | 2 | 3941039–3941193 | 75 | 2.1 |
| 4149054–4149113 | 59 | 3 | 4149054–4149281 | 59 | 3.9 |
Presence of indels in different strains of M. tuberculosis
| Size | H37Rv | F11 | C | Haarlem |
|---|---|---|---|---|
| (A) Insertion present in | ||||
| 37(71529) | P | N | N | N |
| 180(150882) | P | NA | N | N |
| 103(424203) | P | N | P | P |
| 83(624648) | P | N | N | N |
| 533(744075) | P | N | N | N |
| 216(804401) | P | P | P | P |
| 106(960065) | P | P | P | N |
| 12(1094076) | P | P | N | N |
| 12(1096183) | P | P | P | P |
| 15(1121703) | P | P | N | P |
| 119(1191499) | P | P | N | N |
| 45(1213836) | P | P | N | P |
| 56(1442915) | P | P | N | P |
| 21(1612487) | P | N | N | N |
| 26(1632401) | P | N | NA | NA |
| 207(1633340) | P | P | P | P |
| 53(1644353) | P | N | N | P |
| 10(1885207) | P | N | N | N |
| 57(1946384) | P | N | N | N |
| 156(1974210)* | P | P | P | P |
| 6808(1978716) | P | P | P | P |
| 15(2130692) | P | N | N | N |
| 45(2143313) | P | N | N | N |
| 276(2160446)* | P | P | P | N |
| 5000(2266058) | P | N | N | N |
| 115(2400403) | P | P | N | N |
| 940(2629977) | P | P | P | P |
| 767(2633468) | P | P | P | P |
| 21(2701714) | P | N | N | N |
| 676(2862694) | P | P | P | N |
| 55(2985372) | P | P | P | N |
| 71(3114712) | P | P | N | N |
| 59(3331026) | P | P | P | P |
| 68(3418973) | P | P | P | N |
| 2148(3524160) | P | P | P | N |
| 4059(3705273) | P | N | N | P |
| 18(3730860) | P | P | P | P |
| 78(3733427) | P | N | P | N |
| 75(3926618) | P | P | NA | NA |
| 15(3928764) | P | P | NA | NA |
| 35(3940688) | P | P | NA | NA |
| 75(3940715) | P | P | NA | NA |
| 117(3942726) | P | P | P | P |
| 18(4086509) | P | P | P | NA |
| Size | CDC1551 | F11 | C | Haarlem |
| (B) Insertion present in | ||||
| 18(24699) | P | P | P | P |
| 37(32351) | P | N | P | N |
| 57(206812) | P | N | N | N |
| 18(335811) | P | P | NA | NA |
| 33(337805) | P | N | NA | NA |
| 15(427311) | P | P | P | P |
| 60(428187) | P | N | P | N |
| 58(577286) | P | N | P | P |
| 48(840166) | P | N | NA | NA |
| 874(886541) | P | N | N | N |
| 79(1212108) | P | N | NA | NA |
| 654(1217494) | P | N | P | P |
| 90(2062023) | P | N | P | NA |
| 75(2165327) | P | N | N | N |
| 23(2180797) | P | N | P | P |
| 57(2372437) | P | P | N | N |
| 2273(2381411) | P | N | N | N |
| 499(2704307) | P | N | N | N |
| 213(3054706) | P | P | NA | N |
| 54(3171467) | P | N | P | N |
| 312(3501334) | P | P | N | N |
| 63(3663826) | P | P | P | P |
| 3197(3732759) | P | P | N | N |
| 46(3935411) | P | P | NA | NA |
| 189(3739700) | P | N | N | NA |
| 30(3936241) | P | P | NA | NA |
| 32(3934871) | P | P | NA | NA |
| 640(3955464) | P | N | N | N |
| 18(4359134) | P | P | P | P |
aInsertion sites used in this analysis were defined by pairwise analysis of the two strains H37Rv and CDC1551.
SInsertion site.
P: The presence of deletion. N: Absence of deletion.
NA: Data not available.
*: M. tuberculosis H37Rv and M. tuberculosis F11 differ in deletion pattern.
Alteration in anchor order in M. tuberculosis H37Rv as compared with M. tuberculosis CDC1551
| Preceding anchor in | Anchor in | Succeeding anchor in |
|---|---|---|
| CDC1551/H37Rv | CDC1551/H37Rv | CDC1551/H37Rv |
| 1481173/1481632 | 1482185/1480970 | 1483260/1482045 |
| 2265308/2268165 | 2267841/2266488 | 2271735/2269402 |
| 2629407/2633562 | 2631401/2633513 | 2634602/2637271 |
| 3887611/3893778 | 3888939/1532953 | 3890381/3895380 |
| 4244934/4252621 | 4245184/1469637 | 4245791/4253468 |
CDC1551 — M. tuberculosis CDC1551.
H37Rv — M. tuberculosis H37Rv. The numbers represent position in the respective genome.
Figure 5.Distribution of individual mismatch score of anchors of M. tuberculosis H37Rv compared to M. avium. For every randomly picked anchor from M. tuberculosis H37Rv (S) and its homologous sequence in the M. avium (T), a mismatch score was computed as described in the Methods section. The individual scores were plotted against anchors that have been extracted sequentially in terms of position in the genome (S). Homologous anchors on positive strand are shown by plus and those in complementary strand by cross.
Repeats detected by ABWGC but not by MUMmer and LAGAN. Data represent comparison of M. tuberculosis H37Rv with M. tuberculosis CDC1551
| Site of repeat | Size of repeat | Number of copies present |
|---|---|---|
| 24699 | 18 | 2 |
| 335812 | 18 | 3 |
| 427312 | 15 | 4 |
| 428188 | 60 | 2 |
| 2163731 | 69 | 5 |
| 2165328 | 75 | 3 |
| 2347415 | 58 | 3 |
| 3171467 | 54 | 2 |
| 33663826 | 63 | 2 |
| 3948753 | 603 | 2 |
Partial list of insertions sites (size more than 100) in M. tuberculosis H37Rv and M. avium identified by ABWGC and MUMmer
| H37Rv (ABWGC) | H37Rv (MUMmer) | |||
|---|---|---|---|---|
| 274317–274819 | – | 274206–275017 | 4075151–4075297 | |
| 274878–275039 | – | |||
| 400011–400242 | – | 399261–401994 | 4279373–4281116 | |
| 401167–401445 | – | |||
| 490659–490760 | – | 479482–490852 | 4341052–4344911 | |
| 561809–562411 | – | 561712–562821 | 4425498–4425734 | |
| 562459–562728 | – | |||
| 921821–921921 | 683450–683552 | 919347–923827 | 682088–686944 | |
| 683619–683770 |