| Literature DB >> 21698090 |
Valerio Freschi1, Alessandro Bogliolo.
Abstract
The increasing availability of high throughput sequencing technologies poses several challenges concerning the analysis of genomic data. Within this context, duplication-aware sequence alignment taking into account complex mutation events is regarded as an important problem, particularly in light of recent evolutionary bioinformatics researches that highlighted the role of tandem duplications as one of the most important mutation events. Traditional sequence comparison algorithms do not take into account these events, resulting in poor alignments in terms of biological significance, mainly because of their assumption of statistical independence among contiguous residues. Several duplication-aware algorithms have been proposed in the last years which differ either for the type of duplications they consider or for the methods adopted to identify and compare them. However, there is no solution which clearly outperforms the others and no methods exist for assessing the reliability of the resulting alignments. This paper proposes a Monte Carlo method for assessing the quality of duplication-aware alignment algorithms and for driving the choice of the most appropriate alignment technique to be used in a specific context.The applicability and usefulness of the proposed approach are demonstrated on a case study, namely, the comparison of alignments based on edit distance with or without repeat masking.Entities:
Keywords: Monte Carlo simulation; duplications; sequence alignment; significance metrics; tandem repeat
Year: 2011 PMID: 21698090 PMCID: PMC3118696 DOI: 10.4137/EBO.S6662
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Figure 1Pseudo-code of descendant().
Figure 2Pseudo-code of rndTRperiod().
Ranges of the parameters used for Monte Carlo simulations.
| 160 | 40 | 0.00008 | 0.00008 | 0.0008 | 0.004 | 0.0008 | 12 | 80 | |
| 240 | 60 | 0.000012 | 0.000012 | 0.0012 | 0.006 | 0.0012 | 18 | 120 | |
| 200 | 50 | 0.000010 | 0.000010 | 0.0010 | 0.005 | 0.0010 | 15 | 100 | |
Notes: M, number of ancestral DNA sequences, T, number of epochs considered as evolution time, p/p /p /p/p, insertion/deletion/duplication/extension/mutation probabilities, L, maximum size of a repeat unit, N, length of the ancestral DNA sequences.
Results: average and standard deviation for ranking error (E) and correlations between E and the parameters of Monte Carlo simulations.
| Avg. | 0.10 | 0.05 | 0.05 | 0.16 | 0.03 | 0.07 | 0.03 | 0.10 |
| St. D. | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
| Avg. | 0.10 | 0.05 | 0.05 | 0.15 | 0.04 | 0.07 | 0.04 | 0.08 |
| St. D. | 0.04 | 0.03 | 0.03 | 0.04 | 0.02 | 0.03 | 0.02 | 0.03 |
| 0.12 | 0.11 | 0.12 | 0.15 | 0.08 | 0.10 | 0.08 | 0.09 | |
| 0.61 | 0.58 | 0.56 | 0.63 | 0.60 | 0.62 | 0.68 | 0.74 | |
| 0.29 | 0.30 | 0.27 | 0.33 | 0.32 | 0.24 | 0.33 | 0.24 | |
| 0.21 | 0.12 | 0.08 | −0.01 | 0.10 | 0.05 | 0.15 | 0.11 | |
| −0.01 | 0.05 | 0.12 | 0.15 | 0.11 | 0.15 | 0.08 | 0.20 | |
| −0.05 | −0.08 | −0.05 | −0.06 | −0.05 | 0.01 | −0.06 | 0.00 | |
| 0.04 | 0.07 | 0.06 | 0.04 | 0.14 | 0.04 | 0.11 | 0.08 | |
| 0.46 | 0.52 | 0.58 | 0.43 | 0.48 | 0.53 | 0.126 | 0.28 | |
| −0.21 | −0.23 | −0.23 | −0.23 | −0.21 | −0.23 | −0.22 | −0.17 | |
Results: average and standard deviation for selectivity (S) and correlations between S and the parameters of Monte Carlo simulations.
| Avg. | 0.75 | 0.84 | 0.81 | 0.40 | 0.87 | 0.61 | 0.87 | 0.49 |
| St. D. | 0.02 | 0.02 | 0.02 | 0.04 | 0.03 | 0.03 | 0.03 | 0.03 |
| Avg. | 0.73 | 0.84 | 0.82 | 0.42 | 0.87 | 0.64 | 0.82 | 0.50 |
| St. D. | 0.08 | 0.07 | 0.08 | 0.09 | 0.06 | 0.10 | 0.07 | 0.09 |
| −0.12 | −0.10 | −0.14 | −0.25 | −0.12 | −0.21 | −0.12 | −0.22 | |
| −0.68 | −0.67 | −0.64 | −0.69 | −0.71 | −0.75 | −0.74 | −0.77 | |
| −0.30 | −0.28 | −0.28 | −0.36 | −0.32 | −0.24 | −0.31 | −0.27 | |
| −0.18 | −0.10 | −0.05 | 0.05 | −0.13 | −0.03 | −0.18 | −0.11 | |
| −0.02 | −0.06 | −0.13 | −0.26 | −0.08 | −0.26 | −0.08 | −0.25 | |
| 0.02 | 0.04 | 0.01 | 0.05 | 0.04 | 0.00 | −0.01 | −0.03 | |
| −0.03 | −0.08 | −0.07 | −0.09 | −0.09 | −0.06 | −0.11 | −0.05 | |
| −0.44 | −0.49 | −0.54 | −0.24 | −0.37 | −0.36 | −0.20 | −0.14 | |
| 0.22 | 0.25 | 0.27 | 0.23 | 0.22 | 0.23 | 0.25 | 0.17 | |
Parameter settings: values taken by mreps options res and allowsmall (used to control resolution and statistical significance of TRs) and by flag all (used for masking all TRs together with their original motif).
| Res | / | 0 | 0 | 0 | 2 | 2 | 5 | 5 |
| Allowsmall | / | f | t | f | f | t | f | t |
| All | / | f | f | t | f | f | f | f |
Notes: “/” means not applicable, “f” means false, “t” means true.
Figure 3Average compression ratios of different masking techniques applied to synthetic benchmarks.
Results: average and standard deviation for significance ratio (R) and correlations between R and the parameters of Monte Carlo simulations.
| Avg. | 0.57 | 0.64 | 0.66 | 0.51 | 0.66 | 0.60 | 0.66 | 0.54 |
| St. D. | 0.01 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 |
| Avg. | 0.58 | 0.65 | 0.67 | 0.52 | 0.66 | 0.60 | 0.65 | 0.55 |
| St. D. | 0.06 | 0.05 | 0.06 | 0.05 | 0.05 | 0.05 | 0.05 | 0.04 |
| −0.05 | −0.07 | −0.08 | −0.11 | −0.08 | −0.11 | −0.07 | −0.09 | |
| −0.76 | −0.77 | −0.74 | −0.73 | −0.79 | −0.77 | −0.80 | −0.77 | |
| −0.38 | −0.38 | −0.32 | −0.37 | −0.38 | −0.25 | −0.37 | −0.24 | |
| −0.18 | −0.16 | −0.06 | 0.02 | −0.18 | −0.10 | −0.21 | −0.12 | |
| −0.04 | −0.07 | −0.20 | −0.26 | −0.06 | −0.25 | −0.06 | −0.26 | |
| −0.01 | −0.01 | −0.04 | 0.01 | −0.01 | −0.04 | −0.01 | −0.05 | |
| −0.06 | −0.10 | −0.08 | −0.07 | −0.10 | −0.09 | −0.10 | −0.09 | |
| −0.25 | −0.27 | −0.35 | −0.12 | −0.17 | −0.26 | −0.10 | −0.09 | |
| −0.15 | −0.10 | −0.10 | −0.21 | −0.11 | −0.19 | −0.13 | −0.27 | |