| Literature DB >> 28460034 |
Zhen Li1,2,3, Amanda R De La Torre4,5, Lieven Sterck1,2,3, Francisco M Cánovas6, Concepción Avila6, Irene Merino7, José Antonio Cabezas8, María Teresa Cervera8, Pär K Ingvarsson4,7, Yves Van de Peer1,2,3,9.
Abstract
Phylogenetic relationships among seed plant taxa, especially within the gymnosperms, remain contested. In contrast to angiosperms, for which several genomic, transcriptomic and phylogenetic resources are available, there are few, if any, molecular markers that allow broad comparisons among gymnosperm species. With few gymnosperm genomes available, recently obtained transcriptomes in gymnosperms are a great addition to identifying single-copy gene families as molecular markers for phylogenomic analysis in seed plants. Taking advantage of an increasing number of available genomes and transcriptomes, we identified single-copy genes in a broad collection of seed plants and used these to infer phylogenetic relationships between major seed plant taxa. This study aims at extending the current phylogenetic toolkit for seed plants, assessing its ability for resolving seed plant phylogeny, and discussing potential factors affecting phylogenetic reconstruction. In total, we identified 3,072 single-copy genes in 31 gymnosperms and 2,156 single-copy genes in 34 angiosperms. All studied seed plants shared 1,469 single-copy genes, which are generally involved in functions like DNA metabolism, cell cycle, and photosynthesis. A selected set of 106 single-copy genes provided good resolution for the seed plant phylogeny except for gnetophytes. Although some of our analyses support a sister relationship between gnetophytes and other gymnosperms, phylogenetic trees from concatenated alignments without 3rd codon positions and amino acid alignments under the CAT + GTR model, support gnetophytes as a sister group to Pinaceae. Our phylogenomic analyses demonstrate that, in general, single-copy genes can uncover both recent and deep divergences of seed plant phylogeny.Entities:
Keywords: angiosperms; gymnosperms; phylogenomics; seed plants; single-copy genes
Mesh:
Year: 2017 PMID: 28460034 PMCID: PMC5414570 DOI: 10.1093/gbe/evx070
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
Transcriptome Assembly and Open Reading Frame (ORF) Predictions
| Species | # Transcripts | # ORFs | # ORFs with Pfam Domains |
|---|---|---|---|
| 206,574 | 76,426 | 43,771 (57.3%) | |
| 121,938 | 36,106 | 22,355 (61.9%) | |
| 39,229 | 28,909 | 19,708 (68.2%) | |
| 28,030 | 20,434 | 13,989 (68.5%) |
Fig. 1.—k-means bi-clustering of copy number profiles for single-copy genes in gymnosperms (A) and angiosperms (B). Rows represent species and columns represent gene families. In the copy number profiles, red denotes absence of genes in a gene family; blue denotes one copy; yellow denotes two copies; and orange denotes more than two copies in a gene family. The bar plot next to the copy number profile illustrates the number of proteins in each species with an orange line representing the average number of proteins. The dark and light gray bars distinguish the clusters identified by the k-means clustering.
Fig. 2.—Gene ontology slim (GOSlim) enrichment analysis for single-copy genes in angiosperms, gymnosperms, and seed plants. Dot size is representative for the statistical significance of overrepresented (green) and underrepresented (red) GOSlim terms. P values were corrected for multiple tests by Benjamini and Hochberg False Discovery Rate.
Fig. 3.—Maximum likelihood tree inferred from a concatenated alignment of 106 single-copy genes in seed plants including 3rd codon positions, partitioned by PartitionFinder. Bootstrap values <100% are shown on the specific branches. See supplementary figures S2–S4, Supplementary Material online, for maximum likelihood trees inferred from partitions based on codon positions.
Fig. 4.—Maximum likelihood tree inferred from a concatenated alignment of 1st and 2nd codon positions for 106 single-copy genes in seed plants partitioned by PartitionFinder. Bootstrap values <100% are shown on the specific branches. See supplementary figures S5 and S6, Supplementary Material online, for maximum likelihood trees inferred from partitions based on codon positions.
Fig. 5.—Comparison of GC content in the concatenated alignment (A) and at each codon position (B, C, and D) from 106 genes in 68 species. Dot size correlates with the number of species in each lineage (group) that have a significantly different GC% (Wilcox test, P < 1 × 10−3) with the species compared with (colors of dots correspond to the compared lineages). Lines connecting any two species represent significant difference in GC content, with most significant in green and weakest in yellow (1 × 10−3). The full names for the species can be found in supplementary table S3, Supplementary Material online.
Fig. 6.—Lineage specific branch length estimates from each species to the most recent common ancestor of the five monophyletic groups (angiosperms, cupressophytes, cycads and Ginkgo, Gnetophytes, and Pinaceae), in trees inferred from sites at 1st, 2nd, and 3rd codon positions. See text for details.
The Index of Substitution Saturation (I) on Concatenated Nucleotide Alignments and Alignments of Each Codon Position
| Dataset | # Sites | |||
|---|---|---|---|---|
| Alignment with 3rd codon positions (NT123) | 149,679 | 0.612 | 0.820 | 0.605 |
| Alignment with 1st and 2nd codon positions (NT12) | 99,786 | 0.521 | 0.819 | 0.603 |
| Alignment of 1st codon positions (NT1) | 49,893 | 0.551 | 0.818 | 0.598 |
| Alignment of 2nd codon positions (NT2) | 49,893 | 0.494 | 0.818 | 0.598 |
| Alignment of 3rd codon positions (NT3) | 49,893 | 0.796 | 0.818 | 0.598 |
P value < 1×10−4, two-tailed t-test.
Fig. 7.—Internode certainty (IC) and internode certainty all (ICA) estimated from gene trees of 106 phylogenetic markers for the deep divergence of seed plants. (A) The “Gnetales—other gymnosperms” hypothesis; (B) the “Gnepine” hypothesis; and (C) the “Gnetifer” hypothesis. Numbers above branches represent IC and ICA estimated from the gene trees based on alignments with 3rd codon positions; numbers below branches represent IC and ICA estimated from the gene trees based on alignments without 3rd codon positions.