| Literature DB >> 29029186 |
Ha Minh Lam1,2, Oliver Ratmann3, Maciej F Boni1,2,4.
Abstract
Identifying recombinant sequences in an era of large genomic databases is challenging as it requires an efficient algorithm to identify candidate recombinants and parents, as well as appropriate statistical methods to correct for the large number of comparisons performed. In 2007, a computation was introduced for an exact nonparametric mosaicism statistic that gave high-precision P values for putative recombinants. This exact computation meant that multiple-comparisons corrected P values also had high precision, which is crucial when performing millions or billions of tests in large databases. Here, we introduce an improvement to the algorithmic complexity of this computation from O(mn3) to O(mn2), where m and n are the numbers of recombination-informative sites in the candidate recombinant. This new computation allows for recombination analysis to be performed in alignments with thousands of polymorphic sites. Benchmark runs are presented on viral genome sequence alignments, new features are introduced, and applications outside recombination analysis are discussed.Entities:
Keywords: mosaic structure; nonparametric; recombination
Mesh:
Year: 2018 PMID: 29029186 PMCID: PMC5850291 DOI: 10.1093/molbev/msx263
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
. 1Relationship between ordering of informative sites along a genome and a hypergeometric random walk. Below each set of axes, the 30 red bars and 30 blue bars show positions on a genome (informative sites) where a putative recombinant sequence is identical to parent P but different from parent Q (blue bars), or identical to parent Q but different from parent P (red bars). Each blue site can be mapped to an up-step in a random walk and each red site can be mapped to a down-step in a random walk, and there is a one-to-one correspondence between the space of informative-site arrangements and the space of hypergeometric random walks. ( A random arrangement of informative sites, which does not visually suggest that the sequence is a mosaic of putative parents P and Q. The arrangement of sites maps to a random walk which stays fairly close to the horizontal axis. This walk’s maximum descent is eight steps, and ∼54% of HGRWs with 30 up-steps and 30 down-steps have a maximum descent of eight steps or greater. ( A nonrandom arrangement of informative sites that clearly suggests that the candidate sequence is a mosaic of the two parental sequences P and Q. The probability of all the red sites appearing consecutively is 31! × 30!/60! which is 2.62 × 10−16. ( An arrangement of red sites and blue sites that suggests the red sites may be clustered in the middle. When mapping the site arrangement to a hypergeometric random walk, the random walk has a maximum descent of 18 steps. The P value for a maximum descent of 18 steps cannot be written down in closed form but can be calculated from recursion (4). The P value for this maximum descent and for this arrangement of informative sites is 1.8 × 10−4.
. 2Screenshot of new online tool that can be used to calculate P values testing the hypothesis of whether one binary outcome clusters in the middle of a (1D) sequence of binary outcomes. One input method is simply typing two characters in a text box (above, “U” for up and “D” for down) and letting the calculator return a P value showing whether one type of character is clustered in the middle. To test whether the other type of character is clustered, the “SWAP” button can be used. The hypergeometric walk is shown graphically. The exact P value, computed with the methods in this article, is shown. The two Hogan–Siegmund approximations for this P value are also shown.
Computation Times for Large Alignments of Viral Genomes.
| Gene/Genome | Number of Distinct Sequences | Sequence Length (nt) | Number of Polymorphic Sites | % of Triplets with Exact | Dunn–Sidak Corrected | Number of Identified Recombinant Sequences Longer Than 500 nt | Run Time |
|---|---|---|---|---|---|---|---|
| 112 | 2,409 | 844 | 100 | 0.0016 | 0 | 11 s | |
| 160 | 906 | 298 | 100 | 1 | 0 | 24 s | |
| 164 | 30,130 | 1,150 | 100 | 1.72×10−11 | 100 | 1.5 min | |
| 157 | 11,192 | 2,792 | >99.9 | 1.44×10−37 | 6 | 2 min | |
| 982 | 18,980 | 2,535 | 100 | 6.49×10−12 | 0 | 8.5 h | |
| 1,108 | 11,349 | 6,151 | 99.4 | 0 | 36 | 15.5 h |
Note.—Computations were done on a 2.6-GHz linux laptop with 16-GB RAM. The P value table used was 1,200 × 1,200 × 1,200, which has a memory footprint of 2.2 GB.