Literature DB >> 33936774

RDP5: a computer program for analyzing recombination in, and removing signals of recombination from, nucleotide sequence datasets.

Darren P Martin1, Arvind Varsani2,3, Philippe Roumagnac4, Gerrit Botha1, Suresh Maslamoney1, Tiana Schwab5, Zena Kelz1, Venkatesh Kumar1,6, Ben Murrell6.   

Abstract

For the past 20 years, the recombination detection program (RDP) project has focused on the development of a fast, flexible, and easy to use Windows-based recombination analysis tool. Whereas previous versions of this tool have relied on considerable user-mediated verification of detected recombination events, the latest iteration, RDP5, is automated enough that it can be integrated within analysis pipelines and run without any user input. The main innovation enabling this degree of automation is the implementation of statistical tests to identify recombination signals that could be attributable to evolutionary processes other than recombination. The additional analysis time required for these tests has been offset by algorithmic improvements throughout the program such that, relative to RDP4, RDP5 will still run up to five times faster and be capable of analyzing alignments containing twice as many sequences (up to 5000) that are five times longer (up to 50 million sites). For users wanting to remove signals of recombination from their datasets before using them for downstream phylogenetics-based molecular evolution analyses, RDP5 can disassemble detected recombinant sequences into their constituent parts and output a variety of different recombination-free datasets in an array of different alignment formats. For users that are interested in exploring the recombination history of their datasets, all the manual verification, data management and data visualization components of RDP5 have been extensively updated to minimize the amount of time needed by users to individually verify and refine the program's interpretation of each of the individual recombination events that it detects.
© The Author(s) 2020. Published by Oxford University Press.

Entities:  

Year:  2020        PMID: 33936774      PMCID: PMC8062008          DOI: 10.1093/ve/veaa087

Source DB:  PubMed          Journal:  Virus Evol        ISSN: 2057-1577


1. Introduction

Recombination and genome component reassortment are processes that strongly impact the evolution of many virus species. The patterns of nucleotide sequence variation created in virus genomic sequence datasets by these processes (which are hereafter collectively referred to as recombination) can seriously undermine the accuracy of phylogenetics-based molecular evolution analyses (Schierup and Hein 2000a,b; Scheffler, Martin, and Seoighe 2006; Arenas and Posada 2010). It is therefore frequently desirable to test nucleotide sequence datasets for evidence of recombination and, when such evidence is found, take steps to minimize the impacts of detected recombination events on downstream analyses of these datasets. Of the many available computer programs for analyzing recombination (see http://bioinf.man.ac.uk/robertson/recombination/programs.shtml; Martin et al. 2011), recombination detection program (RDP) is presently one of the most commonly used. Successive versions of RDP have applied an expanding array of recombination event detection, recombination breakpoint demarcation, and recombinant sequence identification methods, all applied in unison, to yield detailed descriptions of how recombination may have impacted the evolution of any given set of aligned nucleotide sequences (Martin et al. 2015). The accuracy of these descriptions, however, frequently depended on the amount of effort users were willing to put into exploring the many similarly plausible ways in which the detected patterns of recombination may have arisen. A guiding principle during the development of RDP5, the latest version of the RDP series, has therefore been a minimization of the amount of time that users need to invest in detecting and removing signals of recombination from nucleotide sequence datasets.

2. Generation of recombination-free datasets

RDP5 can take as input an aligned nucleotide sequence dataset, automatically identify and characterize individual recombination events that are evident within that dataset, and output modifications of the input alignment within which all signals of detectable recombination have been removed. The different types of modified recombination-free multiple sequence datasets that RDP5 can output include alignments where: (1) recombinant sequences have been removed; (2) fragments of sequence derived through recombination have been removed; (3), recombinant sequences are split up into their constituent parts; and (4) the input alignment is divided into multiple different gene/genome sub-region alignments based on the locations of detected recombination breakpoints (Supplementary Fig. S1).

3. Query vs reference scans for recombination

In addition to the fully exploratory recombination analysis modes found in previous versions of the program (Martin et al. 2015), RDP5 also includes a new highly automated ‘query vs reference’ analysis mode such as that found in the programs REGA (de Oliveira et al. 2005) and jpHMM (Schultz et al. 2006). Unlike with the default fully exploratory recombination analysis mode where every sequence is tested for evidence of recombination, the query vs reference mode will test a user-defined set of query sequences for evidence that they originated through recombination between a user-defined set of reference sequences. Such an analysis mode is well suited to analyzing patterns of recombination between two or more groups of viruses that have only recently had the opportunity to start recombining with one another: such as in an individual patient that has been infected with two distinct variants of a virus (Sheward et al. 2018), or within a geographical region where multiple distinct genetic variants of a virus have recently started co-circulating. To minimize the amount of effort needed to define reference sequences and/or groups of reference sequences, RDP5 will automatically identify sequences as queries and references based on simple sequence naming rules (see the manual provided with RDP5 for these rules). These naming rules can also be used to group reference sequences into different reference types.

4. Automated sequence annotation

Given an accessible internet connection, RDP5 will automatically annotate the genomic features of input virus sequence datasets using the curated NCBI virus reference sequence database (https://www.ncbi.nlm.nih.gov/genome/viruses/). Annotation is useful in the context of generating recombination-free datasets because it enables RDP5 to output gene sequence alignments that are suitable for downstream codon-focused analyses of natural selection. The gene sequence alignments that RDP5 produces can account for variation in the positions of gene start and stop codons, intron splicing (wherever intron donor and acceptor sites are annotated in the NCBI reference sequence records), portions of genes that are expressed in two or more different open reading frames (these regions can be excluded) and real and/or artifactual frame-shift mutations (partial codons can be removed). Sequence annotation is also useful for testing the selective processes that impact patterns of recombination detected within sequence datasets. Given a set of annotated genome sequences recombination breakpoint distribution tests in RDP5 will now automatically detect whether observed recombination breakpoint distributions vary between: (1) coding and non-coding regions; (2) different genes; and (3) the edges and internal regions of genes.

5. Detection of potential false-positive recombination signals

A crucial facilitator of the highly automated recombination analyses that RDP5 can perform is the inclusion of tests that are specifically designed to detect and flag potential false-positive signals of recombination. Besides automatically testing whether each of the recombination signals that RDP5 detects might be attributable to sequence misalignment (which is a major contributor to false positive signals of recombination), the program will also detect and flag as suspicious any detectable recombination signals that may have arisen through evolutionary processes other than recombination. Specifically, RDP5 uses the PHI test (Bruen, Philippe, and Bryan 2006) and 4-gamete test (McVean, Awadalla, and Fearnhead 2002) adapted versions of the homoplasy test (Maynard Smith and Smith 1998) to flag apparent recombination signals that are potentially attributable to a combination of inter-lineage and inter-site mutation-rate variation rather than recombination (Bertrand, Johansson, and Norberg 2016).

6. RDP5CL: a command-line version of RDP5

For instances where users would like to integrate RDP5 into an analysis pipeline, a separate command-line driven version of the program, RDP5CL, is distributed with RDP5. RDP5CL will take as input a multiple sequence alignment in any standard alignment file format and, contingent on command line switches, output any of various different types of recombination-free alignments, recombination breakpoint distribution plots and/or maximum likelihood (Stamatakis 2014) phylogenetic trees accounting for recombination.

7. Improved computational performance

Despite the additional tests that RDP5 carries out during the characterization of detected recombination events, it is still able to analyze any given dataset between two and five times faster than RDP4 could (Supplementary Table S1). To achieve this, RDP5 implements a multitude of algorithmic improvements such as multi-core CPU-level parallelization and intensive use of lookup tables for bootstrap, permutation, likelihood and probability calculations. Additionally, while previous versions of RDP were restricted to using 2 GB of available RAM (which is standard for 32-Bit Windows programs), RDP5 can utilize 4 GB of RAM even under 32-bit Windows operating systems. This, together with improvements in memory management, means that RDP5 can analyze datasets that contain up to twice as many sequences that are each five times longer than those which RDP4 could manage. The improved computational performance of RDP5 also extends to the manual recombination-signal verification, data management and data visualization components of the program. These enhancements will substantially decrease the amount of time that it takes users to manually verify and refine RDP5’s interpretation of each of the individual recombination events that it detects.

8. Operational limits

RDP5 will work on computers with Windows 7/Vista/8/10 operating systems and can also be installed to run under Windows 7 emulators running on computers with MacOSx and UNIX operating systems. RDP5 can be used to productively analyze datasets containing up to 400 million nucleotides within 24 h on a standard 8-core 2.5 GHz processor with >4 GB of RAM. Such datasets might, for example, consist of 120 3.0-Mb-long bacterial genome sequences, or 4000 10-kb-long viral genome sequences. With default program settings, RDP5 can analyze 100 10-kb-long sequences in under 5 min on a standard desktop computer.

Availability

RDP5 is available for free download from http://web.cbio.uct.ac.za/∼darren/rdp.html. It is distributed with an extensive manual that contains (1) detailed descriptions of the various methods implemented in the program; (2) a step-by-step guide describing how to create and analyze datasets for recombination detection; (3) instructions on how to run completely automated analyses from the command-line using RDP5CL; and (4) information on how to run RDP5 on Mac and Linux computers.

Funding

Z.K. was funded by the South African National Research Foundation. B.M. was supported by the Swedish Research Council (2018-02381). D.P.M. was supported by the H3Africa

Supplementary data

Supplementary data are available at Virus Evolution online. Conflict of interest: None declared. Click here for additional data file.
  14 in total

1.  Consequences of recombination on traditional phylogenetic analysis.

Authors:  M H Schierup; J Hein
Journal:  Genetics       Date:  2000-10       Impact factor: 4.562

2.  An automated genotyping system for analysis of HIV-1 and other microbial sequences.

Authors:  Tulio de Oliveira; Koen Deforche; Sharon Cassol; Mika Salminen; Dimitris Paraskevis; Chris Seebregts; Joe Snoeck; Estrelita Janse van Rensburg; Annemarie M J Wensing; David A van de Vijver; Charles A Boucher; Ricardo Camacho; Anne-Mieke Vandamme
Journal:  Bioinformatics       Date:  2005-08-02       Impact factor: 6.937

3.  Robust inference of positive selection from recombining coding sequences.

Authors:  Konrad Scheffler; Darren P Martin; Cathal Seoighe
Journal:  Bioinformatics       Date:  2006-08-07       Impact factor: 6.937

4.  A simple and robust statistical test for detecting the presence of recombination.

Authors:  Trevor C Bruen; Hervé Philippe; David Bryant
Journal:  Genetics       Date:  2006-02-19       Impact factor: 4.562

5.  The effect of recombination on the reconstruction of ancestral sequences.

Authors:  Miguel Arenas; David Posada
Journal:  Genetics       Date:  2010-02-01       Impact factor: 4.562

Review 6.  Analysing recombination in nucleotide sequences.

Authors:  Darren P Martin; Philippe Lemey; David Posada
Journal:  Mol Ecol Resour       Date:  2011-05-19       Impact factor: 7.090

7.  Detecting recombination from gene trees.

Authors:  J Maynard Smith; N H Smith
Journal:  Mol Biol Evol       Date:  1998-05       Impact factor: 16.240

8.  A jumping profile Hidden Markov Model and applications to recombination sites in HIV and HCV genomes.

Authors:  Anne-Kathrin Schultz; Ming Zhang; Thomas Leitner; Carla Kuiken; Bette Korber; Burkhard Morgenstern; Mario Stanke
Journal:  BMC Bioinformatics       Date:  2006-05-22       Impact factor: 3.169

9.  RDP4: Detection and analysis of recombination patterns in virus genomes.

Authors:  Darren P Martin; Ben Murrell; Michael Golden; Arjun Khoosal; Brejnev Muhire
Journal:  Virus Evol       Date:  2015-05-26

10.  HIV Superinfection Drives De Novo Antibody Responses and Not Neutralization Breadth.

Authors:  Daniel J Sheward; Jinny Marais; Valerie Bekker; Ben Murrell; Kemal Eren; Jinal N Bhiman; Molati Nonyane; Nigel Garrett; Zenda L Woodman; Quarraisha Abdool Karim; Salim S Abdool Karim; Lynn Morris; Penny L Moore; Carolyn Williamson
Journal:  Cell Host Microbe       Date:  2018-09-27       Impact factor: 21.023

View more
  51 in total

1.  Genome characterization of zucchini yellow mosaic virus infecting cucurbits reveals the presence of a new genotype in Trinidad and Tobago in the Caribbean region.

Authors:  Chinnaraja Chinnadurai; Mounika Kollam; Adesh Ramsubhag; Jayaraj Jayaraman
Journal:  Arch Virol       Date:  2021-04-03       Impact factor: 2.574

2.  Circoviruses and cycloviruses identified in Weddell seal fecal samples from McMurdo Sound, Antarctica.

Authors:  Quinn M Patterson; Simona Kraberger; Darren P Martin; Michelle R Shero; Roxanne S Beltran; Amy L Kirkham; Maketalena Aleamotu'a; David G Ainley; Stacy Kim; Jennifer M Burns; Arvind Varsani
Journal:  Infect Genet Evol       Date:  2021-09-03       Impact factor: 4.393

3.  Exploring the tymovirales landscape through metatranscriptomics data.

Authors:  Nicolás Bejerman; Humberto Debat
Journal:  Arch Virol       Date:  2022-06-16       Impact factor: 2.685

4.  Senecavirus A Enhances Its Adaptive Evolution via Synonymous Codon Bias Evolution.

Authors:  Simiao Zhao; Huiqi Cui; Zhenru Hu; Li Du; Xuhua Ran; Xiaobo Wen
Journal:  Viruses       Date:  2022-05-16       Impact factor: 5.818

5.  Detection of SARS-CoV-2 intra-host recombination during superinfection with Alpha and Epsilon variants in New York City.

Authors:  Joel O Wertheim; Jade C Wang; Mindy Leelawong; Darren P Martin; Jennifer L Havens; Moinuddin A Chowdhury; Jonathan E Pekar; Helly Amin; Anthony Arroyo; Gordon A Awandare; Hoi Yan Chow; Edimarlyn Gonzalez; Elizabeth Luoma; Collins M Morang'a; Anton Nekrutenko; Stephen D Shank; Stefan Silver; Peter K Quashie; Jennifer L Rakeman; Victoria Ruiz; Lucia V Torian; Tetyana I Vasylyeva; Sergei L Kosakovsky Pond; Scott Hughes
Journal:  Nat Commun       Date:  2022-06-25       Impact factor: 17.694

6.  Identification of the Begomoviruses Squash Leaf Curl Virus and Watermelon Chlorotic Stunt Virus in Various Plant Samples in North America.

Authors:  Rafaela S Fontenele; Amulya Bhaskara; Ilaria N Cobb; Lucas C Majure; Andrew M Salywon; Jesús A Avalos-Calleros; Gerardo R Argüello-Astorga; Kara Schmidlin; Philippe Roumagnac; Simone G Ribeiro; Simona Kraberger; Darren P Martin; Pierre Lefeuvre; Arvind Varsani
Journal:  Viruses       Date:  2021-04-30       Impact factor: 5.048

7.  A novel lineage of polyomaviruses identified in bark scorpions.

Authors:  Kara Schmidlin; Simona Kraberger; Chelsea Cook; Dale F DeNardo; Rafaela S Fontenele; Koenraad Van Doorslaer; Darren P Martin; Christopher B Buck; Arvind Varsani
Journal:  Virology       Date:  2021-08-18       Impact factor: 3.616

8.  Recurrent emergence of SARS-CoV-2 spike deletion H69/V70 and its role in the Alpha variant B.1.1.7.

Authors:  Bo Meng; Steven A Kemp; Guido Papa; Rawlings Datir; Isabella A T M Ferreira; Sara Marelli; William T Harvey; Spyros Lytras; Ahmed Mohamed; Giulia Gallo; Nazia Thakur; Dami A Collier; Petra Mlcochova; Lidia M Duncan; Alessandro M Carabelli; Julia C Kenyon; Andrew M Lever; Anna De Marco; Christian Saliba; Katja Culap; Elisabetta Cameroni; Nicholas J Matheson; Luca Piccoli; Davide Corti; Leo C James; David L Robertson; Dalan Bailey; Ravindra K Gupta
Journal:  Cell Rep       Date:  2021-06-08       Impact factor: 9.995

9.  Genetic Insights into Feline Parvovirus: Evaluation of Viral Evolutionary Patterns and Association between Phylogeny and Clinical Variables.

Authors:  Claudia Maria Tucciarone; Giovanni Franzo; Matteo Legnardi; Elena Lazzaro; Andrea Zoia; Matteo Petini; Tommaso Furlanello; Marco Caldin; Mattia Cecchinato; Michele Drigo
Journal:  Viruses       Date:  2021-05-30       Impact factor: 5.048

10.  A Pilot Study Investigating the Dynamics of Pigeon Circovirus Recombination in Domesticated Pigeons Housed in a Single Loft.

Authors:  Anthony Khalifeh; Simona Kraberger; Daria Dziewulska; Arvind Varsani; Tomasz Stenzel
Journal:  Viruses       Date:  2021-05-22       Impact factor: 5.048

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.