Literature DB >> 29904704

Genome-scale DNA sequence data and the evolutionary history of placental mammals.

Shaoyuan Wu1, Scott Edwards2, Liang Liu3.   

Abstract

We present a genomic data set comprised of the coding DNA sequences of 5162 loci from 90 vertebrate species, including 82 mammals. The loci were aligned with their protein sequences. The aligned protein sequences were then back translated into their original DNA sequences. The alignments were further filtered to remove individual sequences from each alignment exhibiting long branches or other unusual features. The data is deposited in figshare (http://figshare.com/articles/cds_5162.zip/6031190) and will be useful as a test data set for large-scale phylogenomic analysis.

Entities:  

Keywords:  Alignment; Mammal; Phylogenomics

Year:  2018        PMID: 29904704      PMCID: PMC5998303          DOI: 10.1016/j.dib.2018.04.094

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications table Value of the data The data reveal the evolutionary relationships of placental mammals. The data, combined with fossil information, can be used to estimate the divergence times of placental mammals. The data can be used to evaluate the performance of various phylogenetic methods in estimating phylogenies and divergence times.

Data

The data consist of DNA sequence alignments derived from Liu et al. [1], representing the CDS sequences of 5162 loci from across 90 vertebrate species, including 82 mammals. Liu et al.'s [1] original data were unaltered natural sequence alignments, which are here processed further to remove potential low-quality regions and indels. Whereas a critique of the original data set [2] was unnecessarily pessimistic about the consequences of any alignment errors, we recognize that, although our original alignments correctly designated each codon position in our analyses, the alignments themselves were not guided by amino acid codons and therefore could be improved. Here we present such improved alignments, which should be valuable as a test case in phylogenomic analysis.

Experimental design, materials and methods

Liu et al. [1] performed a DNA-based alignment. In contrast, we aligned the 5162 loci based on the protein sequences using the program Mafft v6 [3], where the protein sequences were translated from their original DNA sequences using the program EMBOSS [4]. The aligned protein sequences were back translated into DNA sequences according to their original DNA codon usage. We then employed the program trimAl [5] with the option -gappyout to trim the aligned DNA sequences. Liu et al. [1] had identified loci that may be misaligned by calculating the ratios between the maximal and average branch lengths of each of the gene trees and then removing entire alignments from the dataset if they did not pass the long-branch gene tree test. Here we instead applied Philippe et al.'s [6], [7] protocol to identify individual sequences within each locus that might generate long-branches due to misalignments or misassemblies. We then removed the individual sequences flagged by this protocol, rather than removing entire genes from the dataset. In the alignment screening protocol, a reference tree was reconstructed from the concatenated sequences of 100 genes. For each gene, the reference tree was pruned to match the taxa of the gene, and the branch lengths of the pruned reference tree were estimated from the gene alignment. If an estimated branch length of this new tree was more than 5 times the original branch length of the pruned reference tree, the sequence corresponding to this branch was removed from the gene alignment. However, after performing this protocol, we still observed some long branches remaining in maximum likelihood (ML) gene trees after pruning. We therefore further filtered the alignments by comparing ML gene trees with the pruned reference tree. In this last step, for each gene, if the length of a terminal branch of the ML gene tree was more than 5 times the length of the corresponding terminal branch in the pruned reference tree, the corresponding sequence was removed. The filtered alignments of 5162 loci from 90 species are summarized with respect to the alignment length, the number of missing sequences, and the proportion of missing characters (Fig. 1). The length of each alignment, including gaps, ranges from 204 bp to 7017 bp with the average length being 1773 bp (Fig. 1a). The total length of the concatenated alignment is 9,150,597 bp. On average, there are 6 missing sequences per gene, whereas 86% of 5162 genes contain more than 80 species (Fig. 1b). The average proportion of missing characters across loci is 9% (Fig. 1c). These statistics compare well to our original data set, which was only 4388 alignments comprising 13,040,111 bp. These improved alignments are deposited in figshare (https://figshare.com/articles/cds_5162_zip/6031190) for further analyses. We have already demonstrated [8] that re-analysis of a 60-gene subset of this improved data set does not change the results presented in Liu et al. [1], and we plan to present a more comprehensive re-analysis of this data set elsewhere. We hope this data set is used widely as a test case for phylogenomic analysis and dating of divergence times.
Fig. 1

The summary of the alignments for 5162 loci. (a) The histogram of the sequence length across 5162 loci. (b) The histogram of the number of species across 5162 loci. (c) The boxplot for the proportion of missing characters across 5162 loci. Missing characters include gaps and ambiguous characters.

The summary of the alignments for 5162 loci. (a) The histogram of the sequence length across 5162 loci. (b) The histogram of the number of species across 5162 loci. (c) The boxplot for the proportion of missing characters across 5162 loci. Missing characters include gaps and ambiguous characters.
Subject areaBiology
More specific subject areaMolecular Evolution
Type of datatext file
How data was acquiredSequencing and Blasting in GenBank
Data formatFiltered
Experimental factorsThe alignments were trimmed and filtered
Experimental featuresPhylogenetic analysis of the alignments
Data source locationGenBank
Data accessibilityDryad
  8 in total

1.  EMBOSS: the European Molecular Biology Open Software Suite.

Authors:  P Rice; I Longden; A Bleasby
Journal:  Trends Genet       Date:  2000-06       Impact factor: 11.639

2.  Recent developments in the MAFFT multiple sequence alignment program.

Authors:  Kazutaka Katoh; Hiroyuki Toh
Journal:  Brief Bioinform       Date:  2008-03-27       Impact factor: 11.622

3.  Genomic evidence reveals a radiation of placental mammals uninterrupted by the KPg boundary.

Authors:  Liang Liu; Jin Zhang; Frank E Rheindt; Fumin Lei; Yanhua Qu; Yu Wang; Yu Zhang; Corwin Sullivan; Wenhui Nie; Jinhuan Wang; Fengtang Yang; Jinping Chen; Scott V Edwards; Jin Meng; Shaoyuan Wu
Journal:  Proc Natl Acad Sci U S A       Date:  2017-08-14       Impact factor: 11.205

4.  Phylogenomic red flags: Homology errors and zombie lineages in the evolutionary diversification of placental mammals.

Authors:  John Gatesy; Mark S Springer
Journal:  Proc Natl Acad Sci U S A       Date:  2017-10-24       Impact factor: 11.205

5.  Reply to Gatesy and Springer: Claims of homology errors and zombie lineages do not compromise the dating of placental diversification.

Authors:  Liang Liu; Jin Zhang; Frank E Rheindt; Fumin Lei; Yanhua Qu; Yu Wang; Yu Zhang; Corwin Sullivan; Wenhui Nie; Jinhuan Wang; Fengtang Yang; Jinping Chen; Scott V Edwards; Jin Meng; Shaoyuan Wu
Journal:  Proc Natl Acad Sci U S A       Date:  2017-10-24       Impact factor: 11.205

6.  Resolving difficult phylogenetic questions: why more sequences are not enough.

Authors:  Hervé Philippe; Henner Brinkmann; Dennis V Lavrov; D Timothy J Littlewood; Michael Manuel; Gert Wörheide; Denis Baurain
Journal:  PLoS Biol       Date:  2011-03-15       Impact factor: 8.029

7.  trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.

Authors:  Salvador Capella-Gutiérrez; José M Silla-Martínez; Toni Gabaldón
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

8.  Phylotranscriptomic consolidation of the jawed vertebrate timetree.

Authors:  Iker Irisarri; Denis Baurain; Henner Brinkmann; Frédéric Delsuc; Jean-Yves Sire; Alexander Kupfer; Jörn Petersen; Michael Jarek; Axel Meyer; Miguel Vences; Hervé Philippe
Journal:  Nat Ecol Evol       Date:  2017-07-24       Impact factor: 15.460

  8 in total
  6 in total

1.  The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets.

Authors:  Xiaodong Jiang; Scott V Edwards; Liang Liu
Journal:  Syst Biol       Date:  2020-07-01       Impact factor: 15.683

2.  mPartition: A Model-Based Method for Partitioning Alignments.

Authors:  Thu Le Kim; Vinh Le Sy
Journal:  J Mol Evol       Date:  2020-08-31       Impact factor: 2.395

3.  Assessing Confidence in Root Placement on Phylogenies: An Empirical Study Using Nonreversible Models for Mammals.

Authors:  Suha Naser-Khdour; Bui Quang Minh; Robert Lanfear
Journal:  Syst Biol       Date:  2022-06-16       Impact factor: 9.160

4.  nQMaker: Estimating Time Nonreversible Amino Acid Substitution Models.

Authors:  Cuong Cao Dang; Bui Quang Minh; Hanon McShea; Joanna Masel; Jennifer Eleanor James; Le Sy Vinh; Robert Lanfear
Journal:  Syst Biol       Date:  2022-08-10       Impact factor: 9.160

5.  New Methods to Calculate Concordance Factors for Phylogenomic Datasets.

Authors:  Bui Quang Minh; Matthew W Hahn; Robert Lanfear
Journal:  Mol Biol Evol       Date:  2020-09-01       Impact factor: 16.240

6.  The effect of alignment uncertainty, substitution models and priors in building and dating the mammal tree of life.

Authors:  Yan Du; Shaoyuan Wu; Scott V Edwards; Liang Liu
Journal:  BMC Evol Biol       Date:  2019-11-06       Impact factor: 3.260

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.