| Literature DB >> 27920488 |
Alvaro Chiner-Oms1, Fernando González-Candelas1.
Abstract
We present EvalMSA, a software tool for evaluating and detecting outliers in multiple sequence alignments (MSAs). This tool allows the identification of divergent sequences in MSAs by scoring the contribution of each row in the alignment to its quality using a sum-of-pair-based method and additional analyses. Our main goal is to provide users with objective data in order to take informed decisions about the relevance and/or pertinence of including/retaining a particular sequence in an MSA. EvalMSA is written in standard Perl and also uses some routines from the statistical language R. Therefore, it is necessary to install the R-base package in order to get full functionality. Binary packages are freely available from http://sourceforge.net/projects/evalmsa/for Linux and Windows.Entities:
Keywords: gappiness; multiple sequence alignment; outlier sequence
Year: 2016 PMID: 27920488 PMCID: PMC5127606 DOI: 10.4137/EBO.S40583
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Figure 1Gap penalty value. Using all the values contained in the scoring matrix, we obtained the distribution represented above. A value lower than the rest in the scoring matrix is valued as a gap penalty.
MSAs used to benchmark the program.
| ALIGNMENT NUMBER | PFAM FAMILY | PFAM CLAN | OUTLIERS FAMILY | OUTLIERS CLAN |
|---|---|---|---|---|
| 1 | PG_binding_1 (PF01471) | PGBD (CL0244) | PG_binding_2 (PF08823)/PG_binding_3 (PF09374) | PGBD (CL0244) |
| 2 | ParBc (PF02195) | ParB like superfamily (CL0248) | PG_binding_1 (PF01471) | PGBD (CL0244) |
| 3 | ParBc (PF02195) | ParB like superfamily (CL0248) | PG_binding_1 (PF01471)/Hyaluronidase_1 (PF07212) | PGBD (CL0244)/Phage fibre (CL0606) |
| 4 | ParBc (PF02195) | ParB like superfamily (CL0248) | ParBc_2 (PF08857)/DUF262 (PF03235) | ParB like superfamily (CL0248) |
| 5 | Linocin_M18(PF04454) | Phage-coat CL0373 | Phage_cap_P2 (PF05125)/DUF2184 (PF09950)/P22_CoatProtein (PF11651) | Phage-coat CL0373 |
Figure 2Default output. (A) Preanalysis boxplot showing the original sequence length distribution. (B) Weight score histogram, highlighting the sequences with the highest number of gaps (green line) and with the largest gappiness value (magenta line). (C) Normalized weight score distribution. Sequence index refers to the list of sequences listed by weight. (D) Gappiness values. Sequence index refers to the list of sequences listed by gpp value.
Summary of the results obtained after running the program with dataset 5 aligned with MUSCLE (see Table 1).
| GENENAME | INDELNUM | WEIGHT | NORMALIZED_WEIGHT | GAPPINESS | NORMALIZED_GAPPINESS |
|---|---|---|---|---|---|
| Outlier2_PF09950 | 169 | −53117 | 0.000 | 0.015 | 0.024 |
| Q97V86_SULSO/96–329 | 183 | −50346 | 0.104 | 0.007 | 0.002 |
| Outlier3_PF11651/1–404 | 13 | −47751 | 0.202 | 0.351 | 1.000 |
| Outlier1_PF05125/8–339 | 85 | −47745 | 0.202 | 0.182 | 0.510 |
| A8F8I8_PSELT/3–247 | 172 | −37039 | 0.605 | 0.009 | 0.007 |
| Q2IH48_ANADE/1–259 | 158 | −36908 | 0.610 | 0.022 | 0.044 |
| A7HIB4_ANADF/2–255 | 163 | −36652 | 0.619 | 0.015 | 0.024 |
| C0R0J8_BRAHW/1–256 | 161 | −36595 | 0.621 | 0.017 | 0.031 |
| Q5L1H9_GEOKA/8–264 | 160 | −36564 | 0.623 | 0.027 | 0.059 |
| B8GHL2_METPE/5–250 | 171 | −36543 | 0.623 | 0.010 | 0.010 |
| B4UA40_HYDS0/1–265 | 152 | −36184 | 0.637 | 0.028 | 0.063 |
| C8WPL7_EGGLE/1–257 | 160 | −36106 | 0.640 | 0.017 | 0.031 |
| C0ZHN4_BREBN/1–265 | 152 | −35900 | 0.648 | 0.030 | 0.068 |
| A7I7A2_METB6/2–249 | 169 | −35839 | 0.650 | 0.012 | 0.017 |
| Q08WR7_STIAD/2–266 | 152 | −35405 | 0.666 | 0.028 | 0.063 |
| A3DFK3_CLOTH/1–257 | 160 | −35058 | 0.679 | 0.017 | 0.031 |
| O67639_AQUAE/1–267 | 150 | −35042 | 0.680 | 0.033 | 0.078 |
| Q7MSM9_WOLSU/1–252 | 165 | −34373 | 0.705 | 0.010 | 0.010 |
| A9BEM3_PETMO/1–251 | 166 | −34166 | 0.713 | 0.010 | 0.010 |
| B2V6Y3_SULSY/1–265 | 152 | −34053 | 0.717 | 0.028 | 0.063 |
| D1B7I4_THEAS/1–249 | 168 | −34049 | 0.717 | 0.010 | 0.010 |
| B2A6K6_NATTJ/8–259 | 165 | −33733 | 0.729 | 0.011 | 0.012 |
| D4H156_DENA2/1–251 | 166 | −33411 | 0.741 | 0.010 | 0.011 |
| MARIT_THEMA/1–251 | 166 | −33062 | 0.754 | 0.010 | 0.011 |
| A6TWC5_ALKMQ/1–250 | 167 | −32960 | 0.758 | 0.009 | 0.006 |
| B8CYH7_HALOH/1–249 | 168 | −32325 | 0.782 | 0.008 | 0.004 |
| C4XPM7_DESMR/1–251 | 166 | −31049 | 0.830 | 0.010 | 0.011 |
| A9FWS5_SORC5/3–254 | 165 | −30219 | 0.861 | 0.011 | 0.012 |
| Q0RH88_FRAAA/1–253 | 164 | −29884 | 0.874 | 0.013 | 0.020 |
| D0LZ74_HALO1/1–251 | 166 | −29877 | 0.874 | 0.009 | 0.008 |
| A8L1F1_FRASN/1–253 | 164 | −29836 | 0.876 | 0.013 | 0.020 |
| B2GID2_KOCRD/1–252 | 165 | −29170 | 0.901 | 0.008 | 0.005 |
| Q2RVS0_RHORT/1–258 | 159 | −28769 | 0.916 | 0.017 | 0.032 |
| Q0SE23_RHOJR/5–254 | 167 | −28767 | 0.916 | 0.007 | 0.000 |
| B1VSP7_STRGG/1–251 | 166 | −28218 | 0.937 | 0.007 | 0.001 |
| A1B987_PARDP/1–251 | 166 | −27970 | 0.946 | 0.007 | 0.001 |
| B2JNZ6_BURP8/1–251 | 166 | −27833 | 0.951 | 0.007 | 0.001 |
| C5B5H8_METEA/1–251 | 166 | −27569 | 0.961 | 0.007 | 0.001 |
| D5UVK7_TSUPD/1–251 | 166 | −27517 | 0.963 | 0.007 | 0.001 |
| B2HH42_MYCMM/1–251 | 166 | −27459 | 0.965 | 0.007 | 0.001 |
| B8EQK3_METSB/1–251 | 166 | −27337 | 0.970 | 0.007 | 0.001 |
| A9H5P1_GLUDA/1–251 | 166 | −27043 | 0.981 | 0.007 | 0.001 |
| C0ZVK4_RHOE4/1–251 | 166 | −26849 | 0.988 | 0.007 | 0.001 |
| B9JHD1_AGRRK/1–251 | 166 | −26780 | 0.991 | 0.007 | 0.001 |
| Q5YPL3_NOCFA/1–252 | 165 | −26532 | 1.000 | 0.008 | 0.005 |
Figure 3Execution time of the program with different alignment sizes. (A) Execution times for MSA with different numbers of sequences (sequence length 10 kbp, viral genome). (B) Execution times for MSA with different numbers of sequences (sequence length 4.4 Mbp, bacterial genomes).