Literature DB >> 29348708

HybPhyloMaker: Target Enrichment Data Analysis From Raw Reads to Species Trees.

Abstract

SUMMARY: Hybridization-based target enrichment in combination with genome skimming (Hyb-Seq) is becoming a standard method of phylogenomics. We developed HybPhyloMaker, a bioinformatics pipeline that performs target enrichment data analysis from raw reads to supermatrix-, supertree-, and multispecies coalescent-based species tree reconstruction. HybPhyloMaker is written in BASH and integrates common bioinformatics tools. It can be launched both locally and on a high-performance computer cluster. Compared with existing target enrichment data analysis pipelines, HybPhyloMaker offers the following main advantages: implementation of all steps of data analysis from raw reads to species tree reconstruction, calculation and summary of alignment and gene tree properties that assist the user in the selection of "quality-filtered" genes, implementation of several species tree reconstruction methods, and analysis of the coding regions of organellar genomes. AVAILABILITY: The HybPhyloMaker scripts, manual as well as a test data set, are available in https://github.com/tomas-fer/HybPhyloMaker/. HybPhyloMaker is licensed under open-source license GPL v.3 allowing further modifications.

Entities: Chemical

Keywords: Target enrichment; genome skimming; locus selection; phylogenomics; species tree

Year: 2018 PMID： 29348708 PMCID： PMC5768271 DOI： 10.1177/1176934317742613

Source DB: PubMed Journal: Evol Bioinform Online ISSN： 1176-9343 Impact factor: 1.625

Introduction

Hybridization-based target enrichment in combination with genome skimming (Hyb-Seq) is becoming a standard method of phylogenomics (in plants see, for example, the works by Mandel et al., Weitemier et al., and Nicholls et al.[1-3]; see also the works by Lemmon and Lemmon, Heyduk et al.[4,5] for a general overview of genome subsampling methods). Up to now, two well-documented data analysis pipelines have been published: PHYLUCE[6] and HybPiper[7]. PHYLUCE has been developed and optimized for working with ultraconserved elements (UCEs)[8,9], but it performs poorly in case of targeted sequences in the form of multiple exons per gene, which are common targets in plant phylogenetics due to the paucity of UCEs[10]; for locus selection in plants, see, for example, the works by Weitemier et al. and Nicholls et al.[2,11]. PHYLUCE applies a very stringent filter on potentially paralogous loci. This might result in a severe loss of loci in case one is working with multiple targeted exons per gene. The often multiple contigs per gene after de novo read assembly are interpreted by PHYLUCE as an indication of paralogy, and the respective loci are rejected from phylogenetic reconstruction. This can result in a dramatic decrease in potentially orthologous and phylogenetically informative data[2]. The alternative pipeline, HybPiper, is able to handle not only the exonic probe sequences but also the intronic flanking regions, and it identifies and separates putative paralogs. However, apart from the identification of putative paralogs, there are no further criteria for locus selection, and gene and species tree reconstruction as well as the reconstruction of organellar phylogenies are not part of HybPiper. Therefore, phylogeneticists using exonic probe sequences lack a straightforward and well-documented bioinformatics pipeline that performs target enrichment data analysis from raw reads to species trees, including quality filtering of raw reads, read assembly, alignment of loci, evaluation of missing data and phylogenetic utility of loci, phylogenetic reconstruction in form of gene/species trees and concatenation as well as phylogenetic reconstruction from organellar data. Especially plastid reads are often obtained in sufficient quantity as part of the off-target reads (eg, 2%[12]; 5%[13]). Incongruence between the nuclear and plastid trees often gives evidence of hybridization events, and both these data sets are usually used in phylogenetics. Here, we present our pipeline HybPhyloMaker, which carries out all of these tasks.

Implementation

HybPhyloMaker consists of 11 major BASH scripts (HybPhyloMaker0-10) that integrate common bioinformatics tools of high-throughput sequencing and phylogenomics. These scripts perform the various steps of data analysis as separate modules within a particular directory structure that is created by them. HybPhyloMaker has a command line interface and can be run both locally and on a high-performance computer cluster. The modular BASH scripts of HybPhyloMaker enable flexible use. All steps are described in detail in Figure 1.

Figure 1.

HybPhyloMaker processing steps. Input data and intermediate results are displayed in white boxes, modification steps are shown in gray boxes. Each modification step is performed by a particular HybPhyloMaker script (small gray boxes).

Data preparation for phylogenetic tree reconstruction

HybPhyloMaker requires two types of input files: (1) paired-end Illumina reads in form of two gzipped FASTQ files per sample and (2) sequences of the probes that were used for target enrichment (FASTA file). The script HybPhyloMaker0 prepares the raw reads for HybPhyloMaker use. It gives the paired-end raw reads a unique label, sorts them according to the HybPhyloMaker-specific directory structure, and creates the reference sequence (“pseudoreference”) for the subsequent reference-guided assembly of the enriched nuclear loci: the probe sequences are concatenated and separated by a string of several hundreds of Ns each (400 Ns are recommended for 2 × 150 bp [base pairs] reads). PhiX read removal, adapter trimming, quality filtering, and duplicate read removal are done with HybPhyloMaker1, using Bowtie 2[14], SAMtools[15], bam2fastq[16], Trimmomatic[17], and FastUniq[18]. In a subsequent step, reads are mapped to a “pseudoreference” that was created from the probe sequences with HybPhyloMaker0. Read mapping is performed with HybPhyloMaker2 using Bowtie 2 or BWA[19], and the consensus sequence is called either with OCOCO[20] or Kindel[21], which is also implemented in HybPhyloMaker2. The consensus sequence is called according to adjustable majority. It results in the reconstruction of the most abundant sequence, which is considered to be the ortholog, as paralogs are usually not enriched in similar quantities compared to orthologs due to a higher sequence dissimilarity to the probe sequences (see Supplement Figure 1). A similar approach was used in recent publications[3,13]. With HybPhyloMaker3, the consensus sequence is fragmented into the exonic parts, which will be called contigs hereafter, and those are matched to the probe sequences using BLAT (BLAST-like alignment tool)[22]. Exonic multiple sequence alignments are constructed with HybPhyloMaker4a, which uses the Python script “assembled_exons_to_fastas.py”[2]; if an exon is missing for a particular accession, Ns are added. Also with HybPhyloMaker4a, exons are aligned using MAFFT[23], and exons from the same gene are concatenated using the Perl script “catfasta2phyml.pl”[24]. Optionally, exon and gene alignments can be adjusted to correct the reading frame with HybPhyloMaker4b. This option later allows not only for per-exon but also for per-codon partitioning at the same time when gene trees are estimated. With HybPhyloMaker5, the amount of missing data is calculated, and accessions as well as loci that match a user-defined threshold of missing data are retained for further analysis. In a first step, accessions that equal or exceed the maximum allowed percentage of missing data per locus are omitted from the respective loci. Then, the number of remaining accessions per locus is calculated and those loci retained that exceed the minimum allowed percentage of accessions per locus. In addition, HybPhyloMaker5 uses AMAS[25], MstatX[26], and trimAl[27] to calculate summary statistics of properties of each locus alignment, which will in a subsequent step assist in a more stringent locus selection. Tables that summarize the amount of missing data within both the entire data set and the user-selected loci as well as histograms that show the distribution of alignment properties are provided.

Gene tree and species tree reconstruction

Gene trees are reconstructed with HybPhyloMaker6. FastTree[28] and RAxML[29] are the tree-building algorithms to choose from, both can be run with or without bootstrapping. FastTree is computationally less demanding than RAxML, but it tends to provide higher branch support values (based on the Shimodaira-Hasegawa test[28]) compared to bootstrapping in RAxML[30]. RAxML trees can be estimated from unpartitioned or partitioned (by exon or by codon position) data sets. In addition, HybPhyloMaker6 calculates summary statistics of properties of each gene tree, using the R script “tree_props.R” (modified from the work of Borowiec[31]) and the R packages ape[32] and seqinr[33]. Alignment summary statistics, which were inferred with HybPhyloMaker5, and gene tree summary statistics are combined, and correlations among all properties are calculated and visualized with the R script “plotting_correlations.R” (modified from the work of Borowiec[31]). Based on those summaries and correlations, the user can optimize phylogenetic reconstruction using “quality-filtered” genes with HybPhyloMaker9. Especially saturated genes (those deviating from simple linear regression on uncorrected p-distances against inferred distances[34]) should be omitted from downstream analysis. However, this step is optional and users must make themselves familiar with any steps that select particular genes before applying HybPhyloMaker9. With HybPhyloMaker7, all gene trees are combined into one file and the trees optionally rooted using Newick Utilities[35]. Users also have the possibility to collapse unsupported branches in gene trees by specifying a minimum support value for which the branch is kept and/or subselect trees containing selected samples using HybPhyloMaker10. Species trees are reconstructed with HybPhyloMaker8. There are several options: ASTRAL[36], ASTRID[37] (both coalescent summary methods), MRL[38] (supertree method using matrix representation with likelihood), and maximum likelihood implemented in FastTree and ExaML[39] (concatenation). In preparation of an ExaML run, the selected loci are concatenated, and gene partition information is provided by AMAS. Partition Finder 2[40,41] is used to find the optimal partitioning scheme.

Organellar reads

HybPhyloMaker also allows working with organellar reads that are often obtained in sufficient quantity as off-target reads (eg, 2%[12]; 5%[13]), ie, it is possible to work with organellar sequences even if one does not specifically target them. Such amount of organellar reads usually provides sufficient sequencing depth, especially for coding regions. For phylogenetic reconstruction based on organellar genomes, the user needs to provide sequences of the coding regions from the target group or from a closely related group. First, the organellar reads are extracted from the total read pool with HybPhyloMaker2 by mapping to an organellar “pseudoreference” (concatenated, coding organellar sequences that are separated by a string of several hundreds of Ns each; prepared using HybPhyloMaker0b). The resulting contigs are matched to the coding sequences with BLAT. The subsequent analysis follows the pipeline of enriched nuclear loci in most instances. Commands for processing organellar data are implemented in HybPhyloMaker2-10.

Computational implementation, performance, and pipeline comparison

HybPhyloMaker runs on major Linux distributions (Debian, Ubuntu, openSUSE, Fedora, CentOS, Scientific Linux) and on MacOS X. Automated installation of the numerous software packages that are required to run HybPhyloMaker (Table 1) is provided by the script “install_software.sh”; smaller scripts and utilities (Perl, Python, Java, and R) are provided with HybPhyloMaker. The cluster version of HybPhyloMaker was optimized on the Smithsonian Institution High Performance Cluster (SI/HPC) and the Czech National Grid Organization MetaCentrum NGI (http://metacentrum.cz/) but could easily be modified for running on any other computer cluster.

Table 1.

List of software that must be installed/must be present on the local computer/cluster before running HybPhyloMaker.

Software	Source	Install (yes/no)	Used command(s)	0	1	2	3	4	4b	5	6	7	8a	8b	8c	8e	8f	9	10
Software	Source	Install (yes/no)	Used command(s)	Sample preparation	Raw data processing	Read mapping	Generate pslx	Process pslx	Correct frame, translate	Missing data handling	BUILD GENE TREES	Root GENE trees	ASTRAL	ASTRID	MRL	Concatenated FastTree	ExaML	Update	Collapse trees and select
GNU parallel	http://www.gnu.org/software/parallel/	y	parallel	X				x
Bowtie 2	http://bowtie-bio.sourceforge.net/bowtie2/index.shtml	y	bowtie2-build, bowtie2		x	x
BWA	http://bio-bwa.sourceforge.net/	y	bwa mem			x
SAMtools	http://samtools.sourceforge.net/	y	samtools		x	x
bam2fastq	https://gsl.hudsonalpha.org/information/software/bam2fastq/	y	bam2fastq		x
Trimmomatic	http://www.usadellab.org/cms/?page=trimmomatic	n	java-jar trimmomatic-0.33.jar		x
FastUniq	https://sourceforge.net/projects/fastuniq/	y	fastuniq		x
JDK/JRE	http://www.oracle.com/technetwork/java/javase/	y	java		x								x	x	x				x
OCOCO	https://github.com/karel-brinda/ococo/	y	ococo			x
Perl	https://www.perl.org/	y	perl		x			x			x					x
BLAT suite	http://genome.ucsc.edu/goldenPath/help/blatSpec.html	y	blat				x
MAFFT	http://mafft.cbrc.jp/alignment/software/	y	mafft					x
Python	https://www.python.org/	y	python					x					x	x
Python3	https://www.python.org/download/releases/3.0/	y	python3							x
AMAS	https://github.com/marekborowiec/AMAS/	n	python3 amas.py					x	x	x	x		x	x		x	x
trimAl	http://trimal.cgenomics.org/	y	trimal							x
MstatX	https://github.com/gcollet/MstatX/	y	mstatx							x
FastTree	http://www.microbesonline.org/fasttree/	y	fasttree								x					x
Newick Utilities	http://cegg.unige.ch/newick_utils/	y	nw_reroot, nw_topology								x	x	x	x	x				x
RAxML	https://sco.h-its.org/exelixis/web/software/raxml/	y	raxmlHPC								x				x		x
R	https://www.r-project.org/	y	R							x	x							x
ASTRAL	https://github.com/smirarab/ASTRAL/	n	java-jar astral.4.11.1.jar										x
ASTRID	https://github.com/pranjalv123/ASTRID/	n	ASTRID											x
p4	http://p4.nhm.ac.uk/	y	p4										x	x
mrpmatrix	https://github.com/smirarab/mrpmatrix/	n	java-jar mrp.jar												x
ExaML	https://sco.h-its.org/exelixis/web/software/examl/index.html	y	examl														x

All the software could be automatically installed using the script “install_software.sh”. For each software, the source and the specific command for calling the software are provided, and it is indicated in which HybPhyloMaker script the particular software is used (HybPhyloMaker0-10). Software that needs to be installed/must be present on the computer/cluster is marked with “y”; if marked with “n”, it is provided with HybPhyloMaker and does not need to be installed.

List of software that must be installed/must be present on the local computer/cluster before running HybPhyloMaker. All the software could be automatically installed using the script “install_software.sh”. For each software, the source and the specific command for calling the software are provided, and it is indicated in which HybPhyloMaker script the particular software is used (HybPhyloMaker0-10). Software that needs to be installed/must be present on the computer/cluster is marked with “y”; if marked with “n”, it is provided with HybPhyloMaker and does not need to be installed. We tested the performance of HybPhyloMaker using Hyb-Seq data sets from 6 samples of the plant genus Oxalis, each containing 1.3 to 1.9 million 2 × 150 bp raw reads. These Hyb-Seq libraries were enriched for 4,926 exons from 1,164 loci[11]. Run time, size of produced data files, and peak of RAM usage were recorded for each HybPhyloMaker script on a computer equipped with Intel Xeon E7-4860 CPU using 4 cores at 2.27 GHz and running CentOS 7.3.1611 (Supplement Table 1). In addition, we compared the number and percentage of mapped reads using Bowtie 2, BWA, and Geneious[42] (Supplement Table 2). Finally, we processed the same samples with HybPiper and PHYLUCE (Table 2). A direct comparison of steps within each of these pipelines (Table 3), regarding, eg, contig number, is not helpful in our opinion, due to different approaches and implementation of different software with noncomparable parameter settings in steps such as assembly (reference-guided versus de novo) and identification of contigs that match to the targeted sequences (as nucleotide sequences with BLAT [HybPhyloMaker], with exonerate[49] [HybPiper], and with LASTZ[50] [PHYLUCE]). We provide an approximate comparison of the three pipelines by recording the number of genes that were recovered (ie, with ≥25% completeness of each gene in case of HybPhyloMaker and HybPiper) and by indicating the number and percentage of putative paralogs in case of HybPiper and PHYLUCE. Filtering against missing data was not performed in PHYLUCE, thereby providing the most conservative number and percentage of recovered genes. Duplicate read removal was performed in case of HybPhyloMaker and HybPiper. In PHYLUCE, assembly of adapter- and quality-trimmed reads was performed with Velvet[51] using k-mer length k = 35. Matching of contigs to probe sequences was performed with 90% minimum sequence identity.

Table 2.

Comparison of the performance of the three pipelines PHYLUCE, HybPiper, and HybPhyloMaker when processing 6 samples from the plant genus Oxalis[11].

Name and code	No. of raw reads	PHYLUCE	HybPiper			HybPhyloMaker
Name and code	No. of raw reads	No. (%) of recovered loci; no filtering against missing data	No. (%) of recovered loci; no filtering against missing data	No. (%) of recovered loci; ≥25% data completeness	No. (%) of putative paralogs; ≥25% data completeness	No. (%) of recovered loci; ≥25% data completeness
Oxalis blastorrhiza J557	1 905 062	43 (3.7)	1102 (94.7)	1080 (92.8)	11 (0.9)	1160 (99.7)
Oxalis creaseyii J11-961	1 553 282	156 (13.4)	1148 (98.6)	1141 (98.0)	14 (1.2)	1161 (99.7)
Oxalis gracilis J558	1 306 633	125 (10.7)	1147 (98.5)	1139 (97.9)	20 (1.7)	1161 (99.7)
Oxalis helicoides J319	1 847 669	53 (4.6)	1134 (97.4)	1130 (97.1)	15 (1.3)	1161 (99.7)
Oxalis inconspicua J595	1 785 030	84 (7.2)	1118 (96.0)	1108 (95.2)	14 (1.2)	1163 (99.9)
Oxalis polyphylla J11-44	1 818 390	47 (4.0)	994 (85.5)	968 (83.2)	5 (0.4)	1161 (99.7)

The number and percentage of genes that were recovered (ie, with ≥25% completeness of each gene in case of HybPiper and HybPhyloMaker) and the number and percentage of putative paralogs in case of HybPiper are reported. Filtering against missing data was not performed in PHYLUCE, thereby the most conservative number and percentage of recovered genes are provided. Duplicate read removal was performed in case of HybPiper and HybPhyloMaker.

Table 3.

Comparison between the major steps of PHYLUCE, HybPiper, and HybPhyloMaker.

Step	PHYLUCE	HybPiper	HybPhyloMaker
Download from Illumina BaseSpace	No	No	Yes
Input	Paired-end Illumina reads	Paired-end and single-end Illumina reads	Paired-end Illumina reads
Adapter trimming and quality filtering of reads	YesIllumiprocessor[43]; pairs with both mates surviving and orphaned reads are used	NoAdapter trimming and quality filtering of reads need to be performed before using HybPiper; pairs with both mates surviving are used as input for HybPiper	YesTrimmomatic; pairs with both mates surviving and orphaned reads are used
Duplicate read removal	No	Yes (Super deduper[44])	Yes (FastUniq)
Assembly	De novo (Velvet; ABySS[45]; Trinity[46])	De novo (SPAdes[47])	Reference-guided (Bowtie 2/BWA; OCOCO/Kindel)
Identification of sequences that match to the targeted sequences	Done by matching contigs to the targeted sequences (as nucleotide sequences with LASTZ); after assembly	Before assembly: done by matching reads to the targeted sequences (as peptide sequences with BLASTX); as nucleotide sequences with BWA;After assembly: done by matching contigs to the targeted sequences with exonerate	Done by matching contigs to the targeted sequences (as nucleotide sequences with BLAT); after assembly
Filtering against paralogs	YesParalogy is indicated if a targeted locus matches multiple contigs or if a contig matches multiple targeted loci (the respective loci are excluded)	YesParalogy is indicated if a targeted locus matches multiple long-length contigs (the respective loci are flagged); separation of putative paralogs possible	NoConsensus calling after the reference-guided assembly is according to majority; this results in the reconstruction of the most abundant sequence, which is considered to be the ortholog
Particularly suitable for exonic probe sequences	No	Yes	Yes
Extraction of flanking intronic regions	No	Yes	No
Missing data calculation	Yes	No	Yes
Calculation of alignment and gene tree properties	No	No	Yes
Flexible handling of excluding accessions and loci	Yes	No	Yes
Gene tree reconstruction	No	No	Yes (RAxML, FastTree)
Concatenation	Yes (ExaBayes[a,48]; RAxML; ExaML[a])	No	Yes (FastTree, ExaML[a])
Species tree reconstruction	No	No	Yes (ASTRAL, ASTRID, MRL)
Organellar phylogeny		No	Yes (from coding sequences)

Input file preparation.

Comparison of the performance of the three pipelines PHYLUCE, HybPiper, and HybPhyloMaker when processing 6 samples from the plant genus Oxalis[11]. The number and percentage of genes that were recovered (ie, with ≥25% completeness of each gene in case of HybPiper and HybPhyloMaker) and the number and percentage of putative paralogs in case of HybPiper are reported. Filtering against missing data was not performed in PHYLUCE, thereby the most conservative number and percentage of recovered genes are provided. Duplicate read removal was performed in case of HybPiper and HybPhyloMaker. Comparison between the major steps of PHYLUCE, HybPiper, and HybPhyloMaker. Input file preparation.

Results and Discussion

Performance and pipeline comparison

Computer performance of HybPhyloMaker is summarized in Supplement Table 1. The most time-consuming steps are read mapping and consensus calling, reconstruction of RAxML gene trees, and ExaML analysis of the concatenated and partitioned data set. The most RAM memory-demanding step is phylogenetic tree reconstruction based on the concatenated data set (both FastTree and ExaML). The largest files are FASTQ files that are generated during raw read processing and BAM files obtained in the step of read mapping. Geneious performed best among the three implemented mapping software: BWA and Bowtie 2 mapped 78% to 93% and 67% to 80% of reads that were mapped by Geneious, respectively. HybPhyloMaker is the first data analysis pipeline for hybridization-based target enrichment data that are generated with exonic probe sequences, which performs all relevant steps from raw reads to species and organellar trees. Two alternative, well-documented data analysis pipelines are available, PHYLUCE and HybPiper, and a detailed comparison of the steps of these two pipelines with HybPhyloMaker is provided in Table 3. Major differences between them are as follows: (1) The assembly strategy (de novo in PHYLUCE and HybPiper or reference-guided in HybPhyloMaker): both assembly strategies allow for the assembly of both exonic and intronic regions. In HybPhyloMaker, this is due to the use of a reference sequence that is built from the concatenated exonic probe sequences, which are separated by a string of several hundreds of Ns each. (2) Paralog identification: both PHYLUCE and HybPiper detect putatively paralogous loci, which are either excluded from subsequent analyses (PHYLUCE) or flagged (HybPiper). In HybPhyloMaker, an adjustable majority consensus sequence is obtained. This results in the reconstruction of the most abundant sequence, which is considered to be the ortholog (Supplement Figure 1). (3) Suitability for exonic probe sequences: both HybPiper and HybPhyloMaker are tailored for exonic probe sequences, whereas PHYLUCE might exclude a large number of loci in case one works with multiple targeted exons per gene (Table 2), as in such case multiple contigs per gene are often formed, which is an indicator of paralogy in PHYLUCE. HybPiper filters putative paralogs less stringently (Table 2), as in case of multiple contigs per gene these contigs must exceed a certain minimum length threshold (>85% length of the targeted locus). (4) Extraction of flanking intronic regions: only HybPiper provides a script for that, the other pipelines obtain these intronic regions during assembly, but do not process them further. (5) Missing data calculation: PHYLUCE and HybPhyloMaker offer estimation of missing data. (6) Calculation of alignment and gene tree properties: this is only implemented in HybPhyloMaker. The alignment properties comprise number of accessions, alignment length, proportion of variable sites, proportion of parsimony informative sites, GC content, alignment entropy, and conservation distribution. Gene tree properties are as follows: average bootstrap support, average branch length, average uncorrected p-distance, clocklikeness, simple linear regression on uncorrected p-distances against inferred distances, and long-branch score. (7) Flexible handling of excluding accessions and loci: this is possible in both PHYLUCE and HybPhyloMaker. (8) Gene and species tree reconstruction: software for gene tree reconstruction is implemented in PHYLUCE and for both gene and species tree reconstruction implemented in HybPhyloMaker. HybPhyloMaker offers per-exon and per-codon partitioning. (9) Reconstruction of organellar phylogenies: only HybPhyloMaker offers their reconstruction, based on coding regions. We consider PHYLUCE not well suitable for exclusively exonic probe sequences due to the drastic loss of potentially orthologous loci (Table 2). HybPiper has the benefits of extraction of the flanking intronic regions, which are especially needed in the reconstruction of shallow phylogenies, and identification of putative paralogs. The identification of paralogs is mainly essential (1) if putatively paralogous loci are not excluded during probe design (in such case, the identified paralogs should be excluded from phylogenetic reconstruction) and (2) if the ancestry of an allopolyploid is of interest (in such case, paralogs can be beneficial for the inference of complex reticulate relationships[52,53]). HybPhyloMaker treats the most abundant sequence of a locus as ortholog and does not identify putatively paralogous loci, which we consider an appropriate approach, except for the latter two cases. Compared with existing target enrichment data analysis pipelines, HybPhyloMaker offers the following main advantages: It implements all steps of target enrichment data analysis: from raw reads to species tree reconstruction. It provides calculation and summary of many alignment and gene tree properties that assist the user in the selection of appropriate “quality-filtered” genes for species tree reconstruction. This step is optional and users must make themselves familiar with any steps that select particular genes. It implements several species tree reconstruction methods (ASTRAL, ASTRID, MRL) as well as concatenation (FastTree, ExaML). It allows the analysis of the coding part of organellar genomes, ie, the analysis of a large proportion of the off-target reads, especially plastid reads.

Conclusions

HybPhyloMaker is a user-friendly pipeline that conducts the analysis of phylogenetic Hyb-Seq data sets from raw reads to species tree reconstruction. It is written in BASH and requires a priori installation of several other software packages. An install script is provided for easy installation of these software packages. HybPhyloMaker runs on major Linux distributions and MacOS X. The software is open source and available in https://github.com/tomas-fer/HybPhyloMaker/.

40 in total

1. The rooting of the universal tree of life is not reliable.

Authors: H Philippe; P Forterre
Journal: J Mol Evol Date: 1999-10 Impact factor: 2.395

2. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

3. Long identical multispecies elements in plant and animal genomes.

Authors: Jeff Reneker; Eric Lyons; Gavin C Conant; J Chris Pires; Michael Freeling; Chi-Ren Shyu; Dmitry Korkin
Journal: Proc Natl Acad Sci U S A Date: 2012-04-10 Impact factor: 11.205

4. Phylogenetic marker development for target enrichment from transcriptome and genome skim data: the pipeline and its application in southern African Oxalis (Oxalidaceae).

Authors: Roswitha Schmickl; Aaron Liston; Vojtěch Zeisek; Kenneth Oberlander; Kevin Weitemier; Shannon C K Straub; Richard C Cronn; Léanne L Dreyer; Jan Suda
Journal: Mol Ecol Resour Date: 2015-12-15 Impact factor: 7.090

5. FastTree 2--approximately maximum-likelihood trees for large alignments.

Authors: Morgan N Price; Paramvir S Dehal; Adam P Arkin
Journal: PLoS One Date: 2010-03-10 Impact factor: 3.240

6. AMAS: a fast tool for alignment manipulation and computing of summary statistics.

Authors: Marek L Borowiec
Journal: PeerJ Date: 2016-01-28 Impact factor: 2.984

7. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.

Authors: Salvador Capella-Gutiérrez; José M Silla-Martínez; Toni Gabaldón
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

8. FastUniq: a fast de novo duplicates removal tool for paired short reads.

Authors: Haibin Xu; Xiang Luo; Jun Qian; Xiaohui Pang; Jingyuan Song; Guangrui Qian; Jinhui Chen; Shilin Chen
Journal: PLoS One Date: 2012-12-20 Impact factor: 3.240

9. Selecting optimal partitioning schemes for phylogenomic datasets.

Authors: Robert Lanfear; Brett Calcott; David Kainer; Christoph Mayer; Alexandros Stamatakis
Journal: BMC Evol Biol Date: 2014-04-17 Impact factor: 3.260

10. ASTRID: Accurate Species TRees from Internode Distances.

Authors: Pranjal Vachaspati; Tandy Warnow
Journal: BMC Genomics Date: 2015-10-02 Impact factor: 3.969

7 in total

1. New Insights Into the Relationships Within Subtribe Scorzonerinae (Cichorieae, Asteraceae) Using Hybrid Capture Phylogenomics (Hyb-Seq).

Authors: Elham Hatami; Katy E Jones; Norbert Kilian
Journal: Front Plant Sci Date: 2022-07-01 Impact factor: 6.627

2. Interrogating Phylogenetic Discordance Resolves Deep Splits in the Rapid Radiation of Old World Fruit Bats (Chiroptera: Pteropodidae).

Authors: Nicolas Nesi; Georgia Tsagkogeorga; Susan M Tsang; Violaine Nicolas; Aude Lalis; Annette T Scanlon; Silke A Riesle-Sbarbaro; Sigit Wiantoro; Alan T Hitch; Javier Juste; Corinna A Pinzari; Frank J Bonaccorso; Christopher M Todd; Burton K Lim; Nancy B Simmons; Michael R McGowen; Stephen J Rossiter
Journal: Syst Biol Date: 2021-10-13 Impact factor: 15.683

3. An empirical assessment of a single family-wide hybrid capture locus set at multiple evolutionary timescales in Asteraceae.

Authors: Katy E Jones; Tomáš Fér; Roswitha E Schmickl; Rebecca B Dikow; Vicki A Funk; Sonia Herrando-Moraira; Paul R Johnston; Norbert Kilian; Carolina M Siniscalchi; Alfonso Susanna; Marek Slovák; Ramhari Thapa; Linda E Watson; Jennifer R Mandel
Journal: Appl Plant Sci Date: 2019-10-25 Impact factor: 1.936

4. How to Tackle Phylogenetic Discordance in Recent and Rapidly Radiating Groups? Developing a Workflow Using Loricaria (Asteraceae) as an Example.

Authors: Martha Kandziora; Petr Sklenář; Filip Kolář; Roswitha Schmickl
Journal: Front Plant Sci Date: 2022-01-07 Impact factor: 5.753

5. New targets acquired: Improving locus recovery from the Angiosperms353 probe set.

Authors: Todd G B McLay; Joanne L Birch; Bee F Gunn; Weixuan Ning; Jennifer A Tate; Lars Nauheimer; Elizabeth M Joyce; Lalita Simpson; Alexander N Schmidt-Lebuhn; William J Baker; Félix Forest; Chris J Jackson
Journal: Appl Plant Sci Date: 2021-06-14 Impact factor: 1.936

6. Relative performance of customized and universal probe sets in target enrichment: A case study in subtribe Malinae.

Authors: Roman Ufimov; Vojtěch Zeisek; Soňa Píšová; William J Baker; Tomáš Fér; Marcela van Loo; Christoph Dobeš; Roswitha Schmickl
Journal: Appl Plant Sci Date: 2021-07-23 Impact factor: 1.936

7. HybPhaser: A workflow for the detection and phasing of hybrids in target capture data sets.

Authors: Lars Nauheimer; Nicholas Weigner; Elizabeth Joyce; Darren Crayn; Charles Clarke; Katharina Nargar
Journal: Appl Plant Sci Date: 2021-07-21 Impact factor: 1.936

7 in total