| Literature DB >> 25403086 |
Marie Lisandra Zepeda Mendoza, Sanne Nygaard, Rute R da Fonseca1.
Abstract
BACKGROUND: Sequence alignments are used to find evidence of homology but sometimes contain regions that are difficult to align which can interfere with the quality of the subsequent analyses. Although it is possible to remove problematic regions manually, this is non-practical in large genome scale studies, and the results suffer from irreproducibility arising from subjectivity. Some automated alignment trimming methods have been developed to remove problematic regions in alignments but these mostly act by removing complete columns or complete sequences from the MSA, discarding a lot of informative sites.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25403086 PMCID: PMC4240845 DOI: 10.1186/1756-0500-7-806
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Figure 1Example of the windows identified by DivA. Outlier windows determined by DivA are shown in black boxes. MEGA5 [10] was used to display the alignment view (the option to toggle off sites with a conservation score of more than 80% was used for an easier visualization of the outlier amino acids).
Figure 2DivA’s workflow. Using a sliding window approach, four parameters are calculated for every sequence in every window. If there are conserved sites at the edge of each window those are trimmed and the parameter values for the window are recalculated. The threshold values for each parameter are then calculated and used to classify each window in each sequence as very divergent (potentially non-homologous) or truly homologous. Sequences from overlapping windows classified as outlier are merged, and the final coordinates are provided in the output file. The user can also obtain a new alignment file where the outlier windows are masked.
Example of output from DivA for the Test alignment
| Alignment | Sequence | start | end |
| Z
|
| Z
|
|---|---|---|---|---|---|---|---|
| Test.fasta | Mesite | 318 | 337 | 0.19 | 3.78 | 0.94 | 3.89 |
| Test.fasta | Mesite | 458 | 469 | 0.06 | 3.58 | −0.80 | 3.86 |
| Test.fasta | Mesite | 872 | 882 | 0.22 | 3.68 | 0.20 | 3.92 |
| Test.fasta | Duck | 63 | 66 | 0.07 | 3.91 | 0.75 | 4.12 |
| Test.fasta | Duck | 564 | 621 | 0.17 | 4.01 | 0.49 | 4.02 |
| Test.fasta | Duck | 626 | 659 | 0.25 | 4.00 | 0.95 | 4.04 |
| Test.fasta | Woodpecker | 823 | 858 | 0.21 | 3.89 | 0.10 | 3.93 |
| Test.fasta | Kea | 768 | 781 | 0,24 | 3,97 | 0,27 | 4,01 |
| Test.fasta | Ostrich | 291 | 309 | 0,24 | 4,08 | 0,45 | 4,09 |
The output of the method gives information on the name of the sequence on the alignment file, the start and end positions of the very divergent window, and the four parameter values.
Figure 3Distributions of the parameter values calculated for the 200 bird-only alignments. The upper ticks show the values of the homologous windows and the ticks on the lower X-axis show the values of the outlier windows. A) Distribution of the parameter. B) Distribution of the parameter. C) Distribution of the parameter. D) Distribution of the parameter.
Dataset impact on model accuracy
| MSAs dataset | TPR | FDR | PPV |
|---|---|---|---|
| 50 only-bird | 0.7970402 | 0.6267327 | 0.3732673 |
| 100 only-bird | 0. 812071 | 0. 4976526 | 0. 5023474 |
| 200 only-bird | 0. 810281 | 0. 3775697 | 0. 6224303 |
| 200 all species | 0. 469429 | 0. 5588211 | 0. 4411789 |
The table shows the efficiency tests results on different datasets with different sizes (50 MSAs, 100, and 200) and divergence (only birds, and birds plus distant species). True positives (TP) correspond to the number of alignment positions included in outlier windows by DivA that were also detected to be outlier by the manual annotation. False positives (FP) are located within outlier windows but were not contemplated in the manual annotation. False negatives (FN) were manually annotated as outlier, but were not detected by DivA as such. True negatives (TN) are absent in windows annotated as outlier both manually and using DivA. TPR: true positive rate, FDR: false discovery rate, PPV: positive predictive value.
Efficiency tests of Guidance and DivA
| TPR | FDR | PPV | ||||
|---|---|---|---|---|---|---|
| Method | 200 only birds | 200 all species | 200 only birds | 200 all species | 200 only birds | 200 all species |
| Guidance Score <=0.4 | 0.3687836 | 0.3687836 | 0.9084643 | 0.9084643 | 0.09153575 | 0.09153575 |
| Guidance Score <=0.8 | 0.5439066 | 0.7788185 | 0.928089 | 0.8882818 | 0.07191096 | 0.1117182 |
| DivA | 0. 810281 | 0. 469429 | 0. 3775697 | 0. 5588211 | 0. 6224303 | 0. 4411789 |
Guidance performance on the datasets of 200 only-birds alignments and the 200 alignments with very divergent species included was compared to DivA. Two threshold values, 0.4 and 0.8, were used in Guidance to consider a sequence region as potentially non-homologous.