Literature DB >> 26900589

Anonymous nuclear markers data supporting species tree phylogeny and divergence time estimates in a cactus species complex in South America.

Manolo F Perez1, Bryan C Carstens2, Gustavo L Rodrigues1, Evandro M Moraes1.   

Abstract

Supportive data related to the article "Anonymous nuclear markers reveal taxonomic incongruence and long-term disjunction in a cactus species complex with continental-island distribution in South America" (Perez et al., 2016) [1]. Here, we present pyrosequencing results, primer sequences, a cpDNA phylogeny, and a species tree phylogeny.

Entities:  

Keywords:  Molecular markers; Next generation sequencing; Non-model species; Phylogeography; Species tree

Year:  2015        PMID: 26900589      PMCID: PMC4716445          DOI: 10.1016/j.dib.2015.12.002

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table

Value of the data

Pyrosequencing filtering steps results enable comparisons with other genomic studies in non-model species. Primer sequences allow researchers to test and to use this genomic information in other related taxa. Mitochondrial and multilocus phylogenies allow comparing the topologies gathered with the two sets of markers, and also enable comparisons with other codistributed taxa.

Data

The data shared in this article consist of primer sequences designed after filtering two Pyrosequencing runs, sequencing data from 25 nuclear markers in 40 individuals from 4 species of the Pilosocereus aurisetus species complex, and the species tree and chloroplast topologies used in Perez et al. [1].

Experimental design, materials and methods

Bioinformatic analysis

The Pyrosequencing reads were quality controlled using FASTX-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) and the pyRAD package [2] to recover variable loci with data available across the species and populations analyzed. The following parameters were applied: (1) ≥5 identical sequences for each allele, to minimize the recovery of sequencing errors and homopolymers; (2) ≤2 different bases for a given nucleotide position, as the organisms are diploid and showed no signal of polyploidy [3]; (3) ≤20 polymorphic sites for each locus, to avoid the inclusion of paralogous loci, that usually show high levels of variation. The remaining dataset after each quality control step is in Table 1. The pyrosequencing data filtering resulted in a total of 223 loci occurring in at least 10 individuals, which were aligned against GenBank with Blastn (Table 2). All loci that matched cytoplasmatic sequences (cp and mtDNA) and retrotransposons were discarded, resulting in 26 loci in all populations sampled. Primers were developed for these loci in the software Primer3 v4.0.0 [4] with the parameters: (1) primer size between 18 and 23 bp; (2) melting (Tm) between 58 and 63 °C; (3) maximum difference of 2 °C for the Tm between forward and reverse primers; GC content of 20–70%. All the developed loci showed specific amplification in at least one sample, but one marker was discarded from further analysis owing to amplification and sequencing problems in the outgroup. Sanger sequencing reactions were obtained for 117 sequences (containing both strands), selected to assure data for at least two individuals for each locus. After combining sequences from both Sanger and pyrosequencing for the 25 loci, a total of 687 sequences over 40 individuals were obtained (Supplementary Table 1), with a total of 367 SNPs. The obtained loci were quality-controlled for recombination using the DSS method [5] as implemented in the software package TOPALi v2 [6], and we also tried to detect loci under selection using Tajima׳s D, Fu and Li׳s D* and F* in DNAsp [7]. The results of the quality control for recombination and selection, as well as the main characteristics of each locus are available in Table 3.
Table 1

Results from pyrosequencing runs and filtering steps.

Filtering stepAmount of data
Total number of reads2,282,266
Barcoded reads >100 bp and QC1,511,080
Mean number of aligned loci (95% similarity)13,218
Mean number of pre-loci (≥ 5 similar sequences in one individual)892
Paralog filter (≤20 SNPs)530
Loci in ≥ 10 individuals223
Loci in all species167
Loci in all pops48
Manual paralog inspection36
Without matches in Genbank26
Amplified in all individuals tested (including outgroup)25
Table 2

Blastn matches for the pyrosequencing 223 filtered loci occurring in more than 10 individuals.

Marker typeNumber of matches
ANL (E-value<10−4)121
cpDNA8
mtDNA27
Retrotransposon5
RNA62
Table 3

Primers and statistics for each locus.

LocusPrimer Fwd (5′–3′)Primer Rev (5′–3′)TmNbpSθwπDDFModelStrictLogNormal
PaANL008TCCTCTCTTTTCTAGGGACGACCCCCATTCTTTCTTCATTCTATC525849750.0020.001−0.89−0.89−1.04F81−776.2717−770.3064
PaANL010GAGAACGTCAATCCGACAGGGAACATAGGCTGGCCTCTTC537047340.0020.001−1.030.970.40JC−738.62−743.16
PaANL015GACCCTAACGAGGGTGAGACAAATCATTTCATGAGGCATCG5156461270.0200.010−1.571.510.48F81+G−1011.27−1002.16
PaANL017TGTCCACCCCATAGAAGAGGTTTAGATGAGTCCCAAAAGATACAC5580309310.0200.013−1.111.93⁎⁎0.92K80−655.69−654.05
PaANL028CGTAGCAAACAGACATCCACTTAAGAAATGCAACAAAAGAGTACCA5448459130.0090.003−2.010.50–0.40F81−746.59−740.17
PaANL035TCCTCTTTCCTACCATTCTTTCTGTTTGAGGAAGGCAGAGGAG544434090.0060.002−1.94−0.56−1.18HKY−536.95−530.30
PaANL046ACTTTCCTGTRTCATATGTAACGAACTGGCCTCGGATTC5048404250.0140.006−1.87−1.61−2.02F81−841.24−845.07
PaANL050CGGGTCTAACTTGCCTTCAAACCCAACCGGTCAGATTGT5852450290.0170.016−0.101.270.93HKY+I−942.70−941.18
PaANL080AAGAAGAACGGGCGAGTTGAGGAGGTGGCAATGCAGTAG5880477250.0120.011−0.431.83⁎⁎1.18HKY+G−1013.49−1013.74
PaANL082CCAAGCAATATCGCATAAACAAGGCACTAACTGATTCAATAACTGGT556438360.0030.001−1.721.160.27GTR+I−674.07−664.83
PaANL087TCTTTATGGCGTTATTCACTCGCGAAGGCCTAACTTGACAGG584639530.0020.001−1.320.900.27K80−647.56−645.56
PaANL096AGAAATGTGGGTCAGGAGGAGAAATGCACATGCCTAGTGA5644436170.0110.003−2.18−2.42−2.77F81−789.03−781.24
PaANL123TTGCATGTTTATACAATTTTTCTTGTGATAGATGCCAATCAGTCCAC5540387180.0110.006−1.361.250.45HKY−690.90−687.04
PaANL126TCCTAAACAAGGGCTACGAAGTGTACCAATGGGCAGCAC6052451150.0080.005−1.21−0.75−1.07GTR+I−901.97−893.11
PaANL134CGTGGTTTGACAAAACTTACCCTCAGTGTTTCTAAGATGCTGCAC5844473170.0090.005−1.35−1.21−1.49HKY−837.50−830.98
PaANL140TAGCCTCCTGAGCCCAAGCGTTCATCAATGGGGAAGGTG603647850.0030.002−1.450.39−0.20HKY−759.99−752.42
PaANL142CAAGCCTCTCCCTATAACTATAGAGTCTAGGCAAGGC5936483260.0150.013−0.620.410.08K80−945.42−938.42
PaANL147CTGTTGGCTCTGCATAGCTGTGCTACACTGGCTTCATTGC5836440140.0100.005−1.60−0.23−0.79F81+G−940.03−922.82
PaANL155CTTTTCAGTCCAAAGCAAATTCAAGGTCAGTAAGTCAAGCTCCTC566045850.0030.001−1.611.080.27F81−680.40−683.15
PaANL160CGTGCTTTTACCTCCGTAAAGCTAAGGGCTAATGGTGCTAGG5644489260.0140.010−0.931.86⁎⁎0.96HKY−839.39−838.84
PaANL165AGCCCTATATGTGGAAGGGGAGTGCTTTCAAGCCTTTG5838478370.0240.013−1.590.62−0.17GTR−952.36−954.68
PaANL182TTCAGGCTTAGGTTGGTGTTCAGGGTCGTCACGATCATCC6040476330.0190.010−1.68−2.97−2.30HKY−945.48−945.80
PaANL187CCGATTGAGGCTAGAAGCTGTGTCTCTTGGCTTTACTTTAGGG5840485280.0150.007−1.921.240.20GTR−768.93−772.03
PaANL196GCTTGGAGGTTTCCAATGAGGAATGCTAAGGCCAAAAAGC5638435430.0280.022−0.911.380.70HKY+I−818.35−816.98
PaANL205AAATCGGAGTCACAACAGAGATACCGAGATCTTGCGATGC5452382230.0130.008−1.461.430.49F81−819.18−807.35

Tm – melting temperature (°C) for each pair of primer, N – number of obtained sequences, bp – length in base pairs, S – number of segregating sites, θw – Waterson׳s theta, π – nucleotide diversity, Tajima׳s D, Fu and Li׳s D, Fu and Li׳s F. Numbers in bold represent the model with higher marginal posterior probabilities after the path sampling test.

Significance is shown at 0.05.

Significance is shown at 0.02.

Species tree

A species tree was estimated using the STRUCTURE groups (Fig. 1 in Perez et al. [1]) as operational taxonomic units (OTUs) in BEAST 2 [8]. We performed this analysis using a Yule speciation prior, with the most likely model of sequence evolution obtained in jModeltest2 [9]. We used either a strict or a relaxed lognormal clock at each locus, selected after comparing the marginal likelihoods of runs using each model with a Path Sampling analysis with 8 steps and 500,000 generation after a 50% burn-in. The species tree was obtained after two independent runs of 100,000,000 MCMC generations each, with a 10% burn-in, and sampling trees every 5000 steps. The species tree analyzes were performed according to the sequence evolution and clock models recovered for each marker (Table 3). A Maximum Clade Credibility (MCC) tree was generated in TreeAnotator [10], by combining the trees from the two runs. The XML input file, containing all the sequences (also deposited as GenBank accession numbers KU161695–KU162858) used is available in Supplementary data 1. The obtained MCC tree is available in Newick format in the Supplementary data 2.

cpDNA and multiloci data comparison

Comparison of the plastid (partial trnT-trnL and trnS-trnG data from [11]) and the combined multilocus datasets (Fig. 4a in [1]) was performed by contrasting the topology of the species tree analysis with the nuclear data (Supplementary data 2) and the topology of a BEAST phylogenetic analysis with a relaxed lognormal clock in the plastid data. The cpDNA XML file with the sequences is available in Supplementary data 3. The cpDNA tree in Newick format is in Supplementary data 4. The divergence times (Mya) estimate between the two main lineages was also compared (Table 4) by setting them as monophyletic and calculating the time to the Most Recent Common Ancestor (TMRCA) using BEAST for the plastid dataset and the combined multilocus dataset, including the plastid data (Fig. 4b in [1]). Because of the lack of substitution rates for the nuclear markers, relative rates to the plastid marker was used, by using a prior distribution including the minimum and maximum substitution rates observed in the chloroplast sequences of angiosperms [12].
Table 4

Comparison of the divergence times (Mya) estimated for the plastid dataset and the combined multilocus dataset.

ParametercpDNACombined
Mean1.70271.6862
SD0.59380.2515
Variance0.35260.0633
95% HPD0.6915–2.8840.9131–1.766
Subject areaBiology, Genetics and Genomics
More specific subject areaPhylogenetics and Phylogenomics
Type of dataPyrosequencing filtering steps, primer sequences and characteristics, species tree analysis input and output, species tree and cpDNA phylogenetic tree
How data was acquiredPyrosequencing filtering in pyRAD, primer sequences designed with Primer3, primer characteristics gathered with DNAsp, species tree and cpDNA phylogenetic tree generated with BEAST2
Data formatFiltered and analyzed
Experimental factorsn/a
Experimental featuresPyrosequencing of reduced genomic libraries, development of primers and Sanger sequencing for primer validation and missing data reduction
Data source locationn/a
Data accessibilityWith this article, GenBank accession numbers GenBank: KU161695–KU162858
  11 in total

1.  TOPAL 2.0: improved detection of mosaic sequences within multiple alignments.

Authors:  G McGuire; F Wright
Journal:  Bioinformatics       Date:  2000-02       Impact factor: 6.937

2.  DnaSP v5: a software for comprehensive analysis of DNA polymorphism data.

Authors:  P Librado; J Rozas
Journal:  Bioinformatics       Date:  2009-04-03       Impact factor: 6.937

3.  jModelTest 2: more models, new heuristics and parallel computing.

Authors:  Diego Darriba; Guillermo L Taboada; Ramón Doallo; David Posada
Journal:  Nat Methods       Date:  2012-07-30       Impact factor: 28.547

4.  Rates of nucleotide substitution vary greatly among plant mitochondrial, chloroplast, and nuclear DNAs.

Authors:  K H Wolfe; W H Li; P M Sharp
Journal:  Proc Natl Acad Sci U S A       Date:  1987-12       Impact factor: 11.205

5.  Interglacial microrefugia and diversification of a cactus species complex: phylogeography and palaeodistributional reconstructions for Pilosocereus aurisetus and allies.

Authors:  Isabel A S Bonatelli; Manolo F Perez; A Townsend Peterson; Nigel P Taylor; Daniela C Zappi; Marlon C Machado; Ingrid Koch; Adriana H C Pires; Evandro M Moraes
Journal:  Mol Ecol       Date:  2014-06-09       Impact factor: 6.185

6.  TOPALi v2: a rich graphical interface for evolutionary analyses of multiple alignments on HPC clusters and multi-core desktops.

Authors:  Iain Milne; Dominik Lindner; Micha Bayer; Dirk Husmeier; Gráinne McGuire; David F Marshall; Frank Wright
Journal:  Bioinformatics       Date:  2008-11-04       Impact factor: 6.937

7.  Anonymous nuclear markers reveal taxonomic incongruence and long-term disjunction in a cactus species complex with continental-island distribution in South America.

Authors:  Manolo F Perez; Bryan C Carstens; Gustavo L Rodrigues; Evandro M Moraes
Journal:  Mol Phylogenet Evol       Date:  2015-11-12       Impact factor: 4.286

8.  Primer3--new capabilities and interfaces.

Authors:  Andreas Untergasser; Ioana Cutcutache; Triinu Koressaar; Jian Ye; Brant C Faircloth; Maido Remm; Steven G Rozen
Journal:  Nucleic Acids Res       Date:  2012-06-22       Impact factor: 16.971

9.  BEAST: Bayesian evolutionary analysis by sampling trees.

Authors:  Alexei J Drummond; Andrew Rambaut
Journal:  BMC Evol Biol       Date:  2007-11-08       Impact factor: 3.260

10.  BEAST 2: a software platform for Bayesian evolutionary analysis.

Authors:  Remco Bouckaert; Joseph Heled; Denise Kühnert; Tim Vaughan; Chieh-Hsi Wu; Dong Xie; Marc A Suchard; Andrew Rambaut; Alexei J Drummond
Journal:  PLoS Comput Biol       Date:  2014-04-10       Impact factor: 4.475

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.