| Literature DB >> 31340752 |
Kevin Debray1, Jordan Marie-Magdelaine2, Tom Ruttink3, Jérémy Clotault2, Fabrice Foucher2, Valéry Malécot4.
Abstract
BACKGROUND: With an ever-growing number of published genomes, many low levels of the Tree of Life now contain several species with enough molecular data to perform shallow-scale phylogenomic studies. Moving away from using just a few universal phylogenetic markers, we can now target thousands of other loci to decipher taxa relationships. Making the best possible selection of informative sequences regarding the taxa studied has emerged as a new issue. Here, we developed a general procedure to mine genomic data, looking for orthologous single-copy loci capable of deciphering phylogenetic relationships below the generic rank. To develop our strategy, we chose the genus Rosa, a rapid-evolving lineage of the Rosaceae family in which several species genomes have recently been sequenced. We also compared our loci to conventional plastid markers, commonly used for phylogenetic inference in this genus.Entities:
Keywords: Conflicting topologies; Nuclear single-copy orthologs; Phylogenetic informativeness; Species-level phylogenomics
Mesh:
Substances:
Year: 2019 PMID: 31340752 PMCID: PMC6657147 DOI: 10.1186/s12862-019-1479-z
Source DB: PubMed Journal: BMC Evol Biol ISSN: 1471-2148 Impact factor: 3.260
References used for Whole Genome Shotgun data
| Species | Ploidy of the genome sequence | Sample origin | BioProject/SRA code | Original publication | |
|---|---|---|---|---|---|
| Ingroup | 1x | IRHS, Beaucouzé, France | – | [ | |
| 2x | Jardin expérimental de Colmar, Colmar, France | SRX3286288 | [ | ||
| 2x | Roseraie du Val-de-Marne, L’ Hay-les-Roses, France | SRX4006790 | [ | ||
| 4x | Bulgaria | PRJNA322107 | – | ||
| 2x | Lyon botanical garden, Lyon, France | SRX3286284, SRX3286283 | [ | ||
| 2x | Roseraie du Val-de-Marne, L’ Hay-les-Roses, France | SRX4006792 | [ | ||
| 2x | ENS Lyon, Lyon, France | SRX3286287 | [ | ||
| 2x | Roseraie du Val-de-Marne, L’ Hay-les-Roses, France | SRX4006787 | [ | ||
| 2x | Roses Loubert rose garden, Les Rosiers-sur-Loire, France | SRX4006793 | [ | ||
| 2x | Keisei Rose Nurseries, Chiba, Japan | PRJDB4738 | [ | ||
| 2x | Lyon botanical garden, Lyon, France | SRX3286293 | [ | ||
| 2x | NA | ERS1829481 | [ | ||
| 2x | Lyon botanical garden, Lyon, France | SRX3286278 | [ | ||
| 2x | Roses Loubert nurseries, Les Rosiers-sur-Loire, France | SRX4006789 | [ | ||
| 2x | Roseraie du Val-de-Marne, L’ Hay-les-Roses, France | SRX4006791 | [ | ||
| 2x | ILVO, Melle, Belgium | PRJNA504542 | – | ||
| 2x | Roses Loubert rose garden, Les Rosiers-sur-Loire, France | SRX4006788 | [ | ||
| Outgroup | 1x | NCGR, Corvallis, OR, USA | PRJNA66853 | [ | |
| 2x | Kagawa University, Kagawa, Japan | PRJDB1478 | [ | ||
| 2x | Kagawa University, Kagawa, Japan | PRJDB1479 | [ | ||
| 2x | NCGR, Corvallis, OR, USA | PRJDB1480 | [ | ||
| 2x | Punnets Town, UK | PRJEB23412 | [ | ||
| 6x | Avala, Serbia | PRJEB18433 | [ | ||
| 2x | Rich Mountain, South Carolina, USA | – | [ |
Bold species indicate unassembled Whole Genome Shotgun data
IRHS Institut de Recherche en Horticulture et Semences, ENS École Normale Supérieure, ILVO Instituut voor Landbouw-, Visserij- en Voedingsonderzoek, NCGR National Clonal Germplasm Repository
Fig. 1Data-mining workflow to identify single-copy orthologous tags (SCOTags) for phylogenomics. Single-copy genes (SCGs) from reference genomes are identified using a self-blast procedure (step 1). The two SCG sets are compared to each other to retrieve shared single-copy orthologs (SCOs) (step 2). SCOs are target-assembled from unassembled whole genome shotgun sequencing data using the aTRAM pipeline. Numbers presented in table (1) correspond to the total number of contigs that were assembled for each Rosa species with an unassembled genome (step 3). Contig sequences from each SCO are aligned using mafft and the resulting alignment is sliced in regions ≥300 bp covered by ≥4 taxa including Rosa ‘Old Blush’ and Rosa persica. For each region, pairs of primers are designed on the consensus sequence and the most variable non-overlapping SCOTags are retained (step 4). Additional filtering steps enables to discard SCOTags with unspecific primer pairs (step 5a), SCOTags that do not pass the RBB test of orthology (5b), SCOTags with inconsistent number of alleles regarding the genome ploidy level (5c) and to find SCOTags in whole genome shot gun assemblies of three additional Rosa species (step 5d) and seven outgroups. Numbers in table (2) correspond to the number of SCOTags that were retrieved for each of the four Rosa species with already assembled datasets. The procedure is described in detail in the Methods section. RBB: Reciprocal Best Blast; mcl: Markov CLuster algorithm
Fig. 2Characterization of the plastid loci and nuclear SCOTags. a Position of the 1784 single-copy orthologs (SCOs) in the seven pseudo chromosomes and unanchored scaffolds (Chr00) of the haploid genome sequence of Rosa ‘Old Blush’. b Completeness of SCOs in the 12 unassembled rose genomes. Missing means that no contig matching the reference SCO could have been assembled; partial means that only part of the reference SCO was assembled; complete means that the complete reference SCO is covered by at least one assembled contig. c Structural annotation of 1856 SCOTags. d Parsimony-informative site (PIS) content for plastid sequences (psbA-trnH, trnL and trnG) and the nuclear SCOTags. SCOTags are divided into three categories: coding regions (exons), non-coding (untranslated regions and introns), and mixed regions (containing both coding and non-coding regions). (*) and (#) denote significant differences between coding and mixed regions and between mixed and non-coding regions, respectively (t-test; p-value < 0.05)
Fig. 3Net phylogenetic informativeness (PI) profiles compared to species chronograms. a Plastid loci; b 1856 nuclear SCOTags. Taxa are colored as follows: dark blue for taxa from Rosa sect. Chinenses, pink for R. sect. Gallicanae, green for R. sect. Synstylae, light blue for R. sect. Laevigatae, red for R. sect. Rosa (ex. R. sect. Cinnamomeae), orange for R. sect. Carolinae, purple for R. subg. Hesperhodos, yellow for R. sect. Pimpinellifoliae and fuchsia for R. subg. Hulthemia
Fig. 4Combined ML species tree with summary of conflicting and concordant SCOTags. The ML species-tree was constructed from 1526 concatenated rooted SCOTags. Outgroups are not shown. Node names are in bold. For each branch, the three values separated by a slash are the local posterior probability (LPP), the bootstrap support (BS) and the Internode Certainty All (ICA), respectively. The pie charts at each node present the fraction of SCOTags that supports that bipartition (blue), the fraction that supports the main alternative bipartition (green), the fraction that supports other alternative bipartitions (red) and the fraction with either less than 70% BS at this bipartition or that do not have this partition due to missing data (gray). On the right side of the pie charts, the top and bottom values indicate the numbers of SCOTags concordant, respectively in conflict, with the corresponding bipartition in the species-tree. Scatter plot on the left side compares values of BS, LPP and ICA at each node. Nodes are ranked from the most ancient (N1) to the most recent (N9) according to Fig. 3b. Stars indicate conflicting nodes with great fractions of alternative bipartitions
Fig. 5Correlation between phylogenetic informativeness (PI) and the number of a concordant nodes and b conflicting nodes in SCOTag topologies. c corresponds to the PI distribution for unrootable SCOTag that were not analyzed using PhyParts. Situations with less than 30 points were ploted but not used in the calculation of correlations. Red dots correspond to mean values. Blue lines correspond to regression lines: y = 4.95 + 0.65x, R2 = 0.04 in panel (a) and y = 5.03 + 0.56x, R2 = 0.10 in panel (b). The top most purple dot corresponds to the highest PI profile in Fig. 3b