| Literature DB >> 34336402 |
Lars Nauheimer1,2,3, Nicholas Weigner1, Elizabeth Joyce1,3, Darren Crayn1,2,3, Charles Clarke1,4, Katharina Nargar1,5.
Abstract
PREMISE: Hybrids contain divergent alleles that can confound phylogenetic analyses but can provide insights into reticulated evolution when identified and phased. We developed a workflow to detect hybrids in target capture data sets and phase reads into parental lineages using a similarity and phylogenetic framework.Entities:
Keywords: Angiosperm353; HybPiper; Nepenthes; alleles; introgression; paralogs; polyploidy; reticulation
Year: 2021 PMID: 34336402 PMCID: PMC8312746 DOI: 10.1002/aps3.11441
Source DB: PubMed Journal: Appl Plant Sci ISSN: 2168-0450 Impact factor: 1.936
FIGURE 1(A) Illustration of the linkage of heterozygous sites in the assembly of hybrid accessions that contain reads from two divergent haplotypes (blue = TAG, yellow = ACT). The two SNPs on the left can be linked (continuous line), but they cannot be linked to the third SNP on the right (dotted line). Reference mapping can result in consensus sequence coding SNPs as ambiguities (WMK) or represent the most common nucleotide (AAG) generating a chimeric sequence. De novo assembly can either connect phased blocks correctly into phased alleles (TAG/ATC) or generate chimeric sequences (ACG/TAT). (B) Illustration of major parts and concepts of the workflow: (1) assembly in HybPiper; (2) SNPs assessment in HybPhaser; (3) phylogenetic analysis; (4) clade association; and (5) phasing.
HybPhaser workflow overview containing all steps with a short description, the script/software used, required input, and generated output for each step.
| Part/Step | Description | Script/software | Input | Output |
|---|---|---|---|---|
| 1. Assessment of SNPs | ||||
| Consensus sequence generation | Reads are mapped to the de novo–assembled contigs to generate consensus sequences with SNPs where reads differ. | Bash 1 (BWA, bcftools) | de novo contigs (HybPiper), reads mapped to each locus (HybPiper) | Consensus sequences |
| Consensus sequence assessment | Proportion of SNPs and length of consensus sequences for each locus are collected. | R1a | Consensus sequences | Tables with SNPs/locus and sequence length |
| Data set optimization | Missing data can be reduced and putative paralogs removed. | R1b | Tables with SNPs/locus and sequence length | Lists with samples and loci to be removed |
| Assessment of heterozygosity and allele divergence | Summary tables are generated. | R1c | Tables with SNPs/locus and sequence length (cleaned) | Summary table and graphs for the assessment of heterozygosity and the detection of hybrids |
| Sequence lists generation | Sequences from HybPiper and HybPhaser folders are collated into sequence lists. | R1d | Contigs and consensus sequences and list with samples/loci to be removed | Sequence lists for loci or samples with contigs or consensus sequences, raw or cleaned (optimized) |
| 2. Clade association | ||||
| Phylogenetic analysis | Alignments and phylogenetic analysis | e.g., MAFFT | Sequence lists | Phylogenetic tree |
| Selection of clade references | Taxa that represent major clades are selected by the user. | Information from phylogeny and summary table | Table (csv) with names of clade references | |
| Extraction of mapped reads | Generation of read files that contain only reads that mapped on the target sequences | Bash 2 (samtools) | Bam file from HybPiper | Read files (mapped only) |
| BBSplit script preparation/execution | Generate and run BBSplit script to match reads (mapped only) to clade references | R2a | Table (csv) with names of clade references | BBSplit stats files with proportions of reads mapped to each reference |
| Collation of BBSplit results | Generation of summary table for clade association | R2b | BBSplit stats files, summary table | Clade association summary table |
| 3. Phasing | ||||
| Selection of accessions for phasing | Clade association summary table | Table (csv) with names of accessions to phase with respective references | ||
| BBSplit phasing script preparation and execution | Generate and run BBSplit script to map and phase read files | R3a | Table (csv) with names of accessions to phase with respective references, sequence read files | Read files of phased accessions, BBSplit stats files |
| Collation of BBSplit phasing results | Generation of summary table for phasing | R3b | BBSplit stats files | Summary table for phasing stats |
| 4. Data set merging | ||||
| Assembly of phased accessions | Phased accessions are assembled using HybPiper and HybPhaser (part 1) | HybPiper, HybPhaser | Read files of phased accessions, target sequence list | Sequence lists of phased accessions |
| Merging of data sets | Sequences of phased accessions are merged with sequences of non‐phased accessions | R4 | Sequence lists of phased and non‐phased accessions | Merged sequence lists of phased and non‐phased accessions |
| Phylogenetic analysis | Alignments and phylogenetic analysis | e.g., MAFFT | Merged sequence lists | Phylogenetic tree including phased and non‐phased accessions |
Software marked with an asterisk are not part of the workflow.
FIGURE 2(A) Scatterplot displaying the locus heterozygosity and allele divergence of samples. Known hybrids (red dots) and putative hybrid (orange dot) are labeled. (B) Phylogenetic tree of the consensus supermatrix displayed in three parts: the basal grade with two diverging clades, clade 1 below, and clade 2 on the right. Summary statistics for each accession are given: locus heterozygosity (LH), allele divergence (AD), number of loci (loci), and proportion of target sequence recovered (seq.). Clades selected for clade association are shown in gray with bars. Clade references are displayed in blue, known hybrids in red, and the putative hybrid in orange. Node support is shown above the node in bootstrap (BS) (*=BS100), and below the node in gene and site concordance factors (gCF/sCF).
FIGURE 3Clade association table and heatmap displaying the percentage of reads matching to each of the 44 clade references. The complete table is on the left, with extracts of example rows shown on the right. The table includes locus heterozygosity (LH) and allele divergence (AD), percentages of reads matching to each reference, and the number of clade associations (CA). Known hybrids are in red, and the putative hybrid is in orange.
FIGURE 4Phylogenetic tree of the consensus supermatrix including phased haplotype accessions. Phased accessions are displayed in different colors. Clade references are displayed in blue. Node support is shown above the node in bootstrap (BS) (*=BS100) and below the node in gene and site concordance factors (gCF/sCF).