| Literature DB >> 22493668 |
Benjamin E R Rubin1, Richard H Ree, Corrie S Moreau.
Abstract
Reduced-representation genome sequencing represents a new source of data for systematics, and its potential utility in interspecific phylogeny reconstruction has not yet been explored. One approach that seems especially promising is the use of inexpensive short-read technologies (e.g., Illumina, SOLiD) to sequence restriction-site associated DNA (RAD)--the regions of the genome that flank the recognition sites of restriction enzymes. In this study, we simulated the collection of RAD sequences from sequenced genomes of different taxa (Drosophila, mammals, and yeasts) and developed a proof-of-concept workflow to test whether informative data could be extracted and used to accurately reconstruct "known" phylogenies of species within each group. The workflow consists of three basic steps: first, sequences are clustered by similarity to estimate orthology; second, clusters are filtered by taxonomic coverage; and third, they are aligned and concatenated for "total evidence" phylogenetic analysis. We evaluated the performance of clustering and filtering parameters by comparing the resulting topologies with well-supported reference trees and we were able to identify conditions under which the reference tree was inferred with high support. For Drosophila, whole genome alignments allowed us to directly evaluate which parameters most consistently recovered orthologous sequences. For the parameter ranges explored, we recovered the best results at the low ends of sequence similarity and taxonomic representation of loci; these generated the largest supermatrices with the highest proportion of missing data. Applications of the method to mammals and yeasts were less successful, which we suggest may be due partly to their much deeper evolutionary divergence times compared to Drosophila (crown ages of approximately 100 and 300 versus 60 Mya, respectively). RAD sequences thus appear to hold promise for reconstructing phylogenetic relationships in younger clades in which sufficient numbers of orthologous restriction sites are retained across species.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22493668 PMCID: PMC3320897 DOI: 10.1371/journal.pone.0033394
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Species of Drosophila, mammals, and fungi included in this study.
| Species | Genome size (bases) | #50 bp reads | |
|
|
|
| |
|
| 151452809 | 3628 | 5230 |
|
| 125196136 | 3072 | 4372 |
|
| 136438528 | 2166 | 2224 |
|
| 120416594 | 2666 | 4000 |
|
| 156491326 | 4034 | 3062 |
|
| 144081254 | 4636 | 5146 |
|
| 136104117 | 4704 | 5114 |
|
| 120266937 | 2698 | 4208 |
|
| 125504156 | 2672 | 4058 |
|
| 155510099 | 4960 | 3176 |
|
| 171979099 | 1540 | 2038 |
|
| 123618427 | 2866 | 4288 |
Genome size is presented in the total number of nucleotides (bases). #50 bp reads is the number of simulated RAD sequences using the given restriction enzyme. #100 bp reads was similar but could differ slightly because we did not include sequences that failed the length requirement.
Figure 1Reference phylogenies of each study group.
All branch lengths are arbitrary and do not indicate evolutionary distance. A) Drosophila phylogeny modified from [4]. The inset shows the two alternative topologies commonly supported by individual gene trees in [32]. B) Reference mammal phylogeny from [20]. C) Reference yeast phylogeny from [3].
Figure 2The orthology of one replicate of the 100 bp SbfI Drosophila matrices based on the concatenated alignment (701-299,470 bp) of all 12 genomes after restriction cutting and clustering without prior knowledge of orthology.
Each column of square pixels bounded by white lines represents a single cluster (locus) produced by a given set of parameters. Each row within these clusters represents a single taxon. Therefore, between each pair of horizontal white lines is a grid where rows are taxa and columns are clusters. The order of taxa from top to bottom of each cluster is: D. simulans, D. sechellia, D. melanogaster, D. yakuba, D. erecta, D. ananassae, D. pseudoobscura, D. persimilis, D. willistoni, D. virilis, D. mojavensis, and D. grimshawi. The area in the white box is blown up in the inset to show detail. Within a cluster, black indicates that a taxon did not have a sequence in that cluster. Colors in a cluster represent orthologous sequences. For example, the top right cluster (or last column in the top row) in the expanded portion contains orthologous sequences from D. simulans and D. sechellia (yellow), and orthologous sequences from D. melanogaster, D. yakuba, and D. erecta (green), though sequences from the two groups are not orthologous. The cluster immediately to the left contains orthologous sequences from D.pseudoobscura, D. persimilis, D. mojavensis, and D. grimshawi. The values of similarity used for clustering the sequences in each matrix are indicated on the left and the minimum threshold number of taxa (min. taxa) is indicated by the plots on the right. These plots are exactly as in Fig. 3. Note that many parameter combinations yield matrices that span several lines. The boundaries between matrix representations are indicated on the left.
Figure 3Accuracy of the RAD method for inferring Drosophila phylogeny.
Proportions are indicated on the left axis. The x-axis shows the percent similarity used for clustering, the three rows show each minimum cluster size, and the read lengths and restriction sites used are indicated by column. Gray bars represent total matrix length as represented on the right axis. Black points are the mean proportion of correct nodes in a tree (out of a total of 9), blue points are the mean proportion of correct nodes with bootstrap support greater than 70%, and red points are the mean proportion of incorrect nodes with bootstrap support greater than 70%. Purple points are the proportion of clusters that are orthologous and yellow points are the proportion of invariant sites within clusters. Results from every set of parameters are shown. Points represent the mean ± SE of the five replicates of clustering, filtering, and tree inference for each set of parameters with randomized input order of sequences into UCLUST. However, not all parameters produced five usable matrices (one or more taxa with all empty sequence). The number of successful replicates is shown in Table S2.
A workflow for phylogenetic inference using RAD sequences.
| Steps to determine if the species your wish to study are appropriate for RAD phylogenetics |
| How much evolutionary divergence time do you expect between taxa? RAD appears to work well for ≤50 million year divergences by consistently fails at ≥100 million years. |
| Collect samples |
| High quality genomic DNA is the required input. It is better if there is a continuum of relatedness between taxa so that each species has at least some close relatives included in the analysis. |
| Prep and sequence DNA |
| This can either be done in house or by sending samples to a sequencing facility |
| Filter sequences and call consensus loci |
| Som sequence reads will be ambiguous or of low quality. These should be discarded. High coverage of loci allows for probabilistic analyses of the most likely base at each position |
| Cluster sequences (Step 1 from Methods) |
| A variety of clustering similarities should be tried to test the consistency and believability of results. UCLUST |
| Choose minimum taxa cluster sizes (Step 2) |
| Small minimum taxa cluster sizes tend to produce the best topologies but larger values may be useful with very large datasets. Any cluster smaller than the chosen minimum taxa cluster size is excluded as are clusters with samples represented by multiple sequences. |
| Align clusters of sequences (Step 3) |
| Each cluster of sequences should be individually aligned using an automated alignment program. The volume of data precludes manual alignment. |
| Concatenate clusters (Step 3) |
| All clusters should be concatenated, filling in missing sequences from each cluster with gaps. There will be many missing sequences. |
| Reconstruct phylogeny |
| RAxML |
| Compare results from different parameters |
| Different sets of reasonably chosen parameters should produce similar topologies. Although low clustering similarities were successful in our study, higher similarities may be more useful for more recently divergent taxa. Low clustering thresholds may allow for more data, but more data may also be discarded if multiple sequences from a single species more often end up clustering together. |