| Literature DB >> 26231214 |
Jorge A Hongo1, Giovanni M de Castro2, Leandro C Cintra3, Adhemar Zerlotini4, Francisco P Lobo5.
Abstract
BACKGROUND: Detection of genes evolving under positive Darwinian evolution in genome-scale data is nowadays a prevailing strategy in comparative genomics studies to identify genes potentially involved in adaptation processes. Despite the large number of studies aiming to detect and contextualize such gene sets, there is virtually no software available to perform this task in a general, automatic, large-scale and reliable manner. This certainly occurs due to the computational challenges involved in this task, such as the appropriate modeling of data under analysis, the computation time to perform several of the required steps when dealing with genome-scale data and the highly error-prone nature of the sequence and alignment data structures needed for genome-wide positive selection detection.Entities:
Mesh:
Year: 2015 PMID: 26231214 PMCID: PMC4521464 DOI: 10.1186/s12864-015-1765-0
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1General schema of POTION. Black boxes represent user-provided files and final results, grey boxes indicate filtering steps, and white boxes indicate parallelized steps performed for each valid group of homologs. Filtering steps comprise four sequential conceptual stages (A–D), each composed of one or more sequential filters (numbered steps). Stage “A” comprises four filters for removal of sequence data: (1) absence of valid start and/or stop codons; (2) presence of non-standard nucleotides; (3) length not a multiple of three and (4) lower and upper bounds for sequence length. Stage “B” comprises one filter to remove sequences and groups according to homology relationships within groups, allowing users to analyze biologically meaningful gene sets they wish (1-1 orthologs and/or paralogs, for instance). Stage “C” comprises four filters for sequences and groups: (1) mean sequence identity of groups or of individual sequences; (2) removal of groups containing any sequence removed in previous steps, allowing users to analyze only high-quality data since the beginning of analysis; (3) removal of groups containing sequence and species count outside user-defined ranges and (4) removal of groups with no sequence from a user-defined anchor genome. Step “D” comprises a filter where POTION detects groups with evidence of recombination using three methods (Phi, NSS, Max Chi2), followed by multiple hypothesis correction. After the filtering steps POTION executes the following sequential analyses in parallel for each valid group of homologs: multiple protein sequence alignment using one out of three popular sequence aligners: MUSCLE, MAFFT or PRANK; protein-guided codon alignment; alignment trimming using TrimAl; phylogenetic tree reconstruction using proml and dnaml from phylip; search for positive selection using codeml–site-model analysis using nested models M1a/M2, M7/M8 and M8a/M8, followed by multiple hypothesis correction. POTION parses output files and writes final results files (fasta and flat files) for groups with evidence of recombination and positive selection
Fig. 2TubercuList categories significantly enriched in positively selected genes in M. tuberculosis. The TubercuList category of PE/PPE paralogs is significantly more represented in the list of positively selected genes in H37Rv strain when compared with all coding genes. Count data for positively selected genes was obtained in FILTER experiment and count data for the background frequencies was obtained in the intersection of the list of all valid genes after filtering procedures that are also represented on a given functional category as defined in the TubercuList database [45]
Fig. 3Evaluation of POTION parallelization. a Time to compute the first 300 groups of homologs from the MYC dataset while changing the number of processors. b Time to compute the TRYP dataset while changing the number of processors. Time decreases in a power-law distribution as the number of processors increases up to the limits of the current algorithm implemented in POTION