| Literature DB >> 16845088 |
Sònia Casillas1, Antonio Barbadilla.
Abstract
Pipeline Diversity Analysis (PDA) is an open-source, web-based tool that allows the exploration of polymorphism in large datasets of heterogeneous DNA sequences, and can be used to create secondary polymorphism databases for different taxonomic groups, such as the Drosophila Polymorphism Database (DPDB). A new version of the pipeline presented here, PDA v.2, incorporates substantial improvements, including new methods for data mining and grouping sequences, new criteria for data quality assessment and a better user interface. PDA is a powerful tool to obtain and synthesize existing empirical evidence on genetic diversity in any species or species group. PDA v.2 is available on the web at http://pda.uab.es/.Entities:
Mesh:
Year: 2006 PMID: 16845088 PMCID: PMC1538800 DOI: 10.1093/nar/gkl080
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Example showing the new algorithm for maximizing the number of informative sites. (1) Input sequences are grouped according to their length, so that sequences in a group cannot differ in more than the 20% of their length. In this example, the eight input sequences are split into two different groups (group 1 and group 2). (2) Assuming that an ‘informative site’ is the number of non-gapped positions multiplied by the number of sequences in the set (note that this differs from the definition of ‘informative site’ typically used in phylogenetics), PDA v.2 calculates the amount of informative sites in each accumulative group of sequences, starting with the group of the longest sequences (group 1 = 168 informative sites) and adding in each step the next group of sequences ordered by their length (groups 1 + 2 = 56 informative sites). (3) Finally, PDA v.2 shows the alignment with all the sequences, but uses the set of sequences which offer the largest number of informative sites for the estimations, in some cases discarding the shortest sequences. In this case, PDA v.2 would use only the four longest sequences for the estimations (group 1). To distinguish which sequences were used in the analyses from those which were discarded, PDA v.2 uses a color code: green for sequences that were included in the estimates, and red for sequences that were not included.