Literature DB >> 34674330

A bioinformatic platform to integrate target capture and whole genome sequences of various read depths for phylogenomics.

Pedro G Ribeiro^1,2, María Fernanda Torres Jiménez^3,4, Tobias Andermann^3,4,5,6, Alexandre Antonelli^3,4,7,8, Christine D Bacon^3,4, Pável Matos-Maraví^1,4.

Abstract

The increasing availability of short-read whole genome sequencing (WGS) provides unprecedented opportunities to study ecological and evolutionary processes. Although loci of interest can be extracted from WGS data and combined with target sequence data, this requires suitable bioinformatic workflows. Here, we test different assembly and locus extraction strategies and implement them into secapr, a pipeline that processes short-read data into multilocus alignments for phylogenetics and molecular ecology analyses. We integrate the processing of data from low-coverage WGS (<30×) and target sequence capture into a flexible framework, while optimizing de novo contig assembly and loci extraction. Specifically, we test different assembly strategies by contrasting their ability to recover loci from targeted butterfly protein-coding genes, using four data sets: a WGS data set across different average coverages (10×, 5× and 2×) and a data set for which these loci were enriched prior to sequencing via target sequence capture. Using the resulting de novo contigs, we account for potential errors within contigs and infer phylogenetic trees to evaluate the ability of each assembly strategy to recover species relationships. We demonstrate that choosing multiple sizes of kmer simultaneously for assembly results in the highest yield of extracted loci from de novo assembled contigs, while data sets derived from sequencing read depths as low as 5× recovers the expected species relationships in phylogenetic trees. By making the tested assembly approaches available in the secapr pipeline, we hope to inspire future studies to incorporate complementary data and make an informed choice on the optimal assembly strategy.

Entities: Chemical

Keywords: zzm321990secaprzzm321990; de novo assembly; loci extraction; low-coverage whole genome sequencing; target sequence capture

Mesh：

Year: 2021 PMID： 34674330 PMCID： PMC9298010 DOI： 10.1111/mec.16240

Source DB: PubMed Journal: Mol Ecol ISSN： 0962-1083 Impact factor: 6.622

INTRODUCTION

Until recently, the most cost‐efficient approaches to obtain genome‐wide data for phylogenomic and molecular ecology studies relied on genomic subsampling using size selection or enrichment prior to the sequencing process (genomic partitioning and reduced‐representation sequencing; Lemmon & Lemmon, 2013); such approaches include restriction site‐associated DNA sequencing (RADseq) and target sequence capture of conserved genetic regions, such as exons or ultraconserved elements (UCEs; e.g., Andermann et al., 2020; Burbano et al., 2010; Davey & Blaxter, 2010; Faircloth et al., 2012; Lemmon & Lemmon, 2013). The continuing decrease of sequencing costs (currently, one megabase [Mbp] of raw DNA sequences costs ~0.01 US$; https://www.genome.gov/sequencingcostsdata) have made low‐coverage whole‐genome sequencing (WGS) more economically feasible, and WGS data are also becoming widely applied in ecological and evolutionary studies (e.g., Li et al., 2019; Olofsson et al., 2019). Although the implementation of flexible and user‐friendly post‐sequencing bioinformatic pipelines has flourished within the past 5 years, there is still a gap when it comes to integrating data coming from different sequencing approaches (e.g., WGS and reduced‐representation sequencing). Loci of interest are often enriched in the laboratory prior to sequencing (in vitro enrichment; Albert et al., 2007; Gnirke et al., 2009) to achieve a sufficiently high read coverage for its processing into multilocus data sets. Moreover, the increasing volume of publicly available low read coverage WGS data provides a source for the bioinformatic (in silico) harvesting of these loci of interest. However, a guideline is still missing on how to assemble short reads most efficiently into contigs de novo (i.e., which parameters to use for contig assembly) when simultaneously working with data derived from low and high read genomic coverages of loci of interest, especially in cases in which the genome size of an organism is unknown or reference genomes are not available. Some bioinformatic pipelines have been developed for the processing of unassembled WGS data into multiple sequence alignments. For example, the Phylogenomics from Low‐coverage Whole‐genome Sequencing pipeline (PLWS; Zhang, Ding, et al., 2019) runs iterative de novo contig assemblies using the minia3 assembler (Chikhi & Rizk, 2013). Although the PLWS pipeline is computationally efficient (e.g., 21 hexapod genomes spanning from 0.1 to 2 gigabases [Gbp] were assembled in a period of 2 to 24 h on 16 [GB] or 32 [GB] of RAM PCs), it is unclear whether other contig assemblers using multi‐kmer strategies (see Box 1) can recover more complete multilocus alignments. Alternatively, aTRAM (Allen et al., 2017) uses iterative blast searches (Altschul et al., 1990) to find matching reads within a library of loci of interest (references) and subsequently assemble them with different contig assemblers relying on single‐kmer strategies (Box 1). Other pipelines can extract and assemble repetitive and high‐copy number genomic regions such as mitochondrial loci and rDNA repeat regions. For instance, GRAbB (Brankovics et al., 2016) uses computationally efficient assemblers such as edena (Hernandez et al., 2008, 2014), whereas mitofinder (Allio et al., 2020) maximizes the use of UCE data by retrieving, assembling, and annotating nonenriched mitochondrial loci using multi‐kmer assemblers such as metaspades (Nurk et al., 2017). There are other software that extract loci of interest from metagenomes (anvi'o; Eren et al., 2021; also see: https://merenlab.org/2019/10/17/export‐locus), from assembled genomes (Costa et al., 2016; Jarvis et al., 2014), or even from DNA data archived in VCF files (seqtailor; Zhang, Boisson, et al., 2019). To our knowledge, however, no studies have yet attempted to integrate WGS data of various underlying read coverages into multilocus data sets that have comparable quality to those resulting from in vitro target capture data, while comparing assemblers and assembly strategies in the same pipeline. Most assembler programs carry out three essential steps in order to assemble short reads into longer contigs: (1) Decomposing read sequences into kmers to improve efficiency of contig assembly by eliminating redundant short read overlaps; (2) Build a de Bruijn graph from the kmer overlapping information, which facilitates the connection of short reads; and (3) de Bruijn graph simplification. Decomposing read sequences into kmers requires the user to specify a kmer‐size lower than the short‐read length. A read is decomposed into n kmers by extracting the substring of kmer size length at each nucleotide position of the read. Then, all possible kmer are evaluated and two kmers connected if both overlap in kmer length size—1. Connections between kmers create a graph (de Bruijn graph) where nodes represent kmer sequences and edges represent connections. In the last step, the graph is simplified using information extracted from the reads and kmers themselves: coverage, distances, and pairing of reads. More details on the challenges of contig assembly and strategies to tackle those challenges are described in Sohn & Nahm (2018) and Liao et al. (2019). abyss (Figure 2, left): Starting from paired‐end reads (PE reads), abyss version 2 (Jackman et al., 2017) implements two modes to extract the kmers using a single kmer‐size provided by the user. The first mode, implemented in abyss version 1 (Simpson et al., 2009), which is the one used in this study, builds a hash table from kmers across all reads. A hash table is a data structure that efficiently stores large amounts of information. Unlike spades, abyss distributes sections of the hash table using a Message‐Passing Interface (mpi) to independent cluster nodes in order to parallelize the process of building the de Bruijn graph. The graph is extended by evaluating the overlap between all kmers and later simplified by using read pairing information. The second mode (abyss version 2, Jackman et al., 2017) stores all possible kmers of length = kmer‐size and their relative position using a bloom filter (Bloom, 1970) with one or more hashing functions used for indexing the information. Then, abyss applies a user‐provided threshold to flag infrequent kmers as an attempt to remove potential sequencing errors. Reads that do not contain flagged kmers are then used to build and extend the de Bruijn graph, which is simplified using the paired‐end information to trim off branches (a series of kmers connected to the graph only at one end) and bubbles (alternative paths on the graph joint at both ends that arise from single nucleotide polymorphisms or sequencing errors).

FIGURE 2

Flowchart summarizing the process of contig assemblage implemented in abyss (left, Jackman et al., 2017; Simpson et al., 2009) and spades (right, Bankevich et al., 2012). The chart briefly describes the de Bruijn graph construction and the differences between both assemblers. See Box 1 for more details

spades (Figure 2, right): Contrary to abyss, spades simultaneously uses all possible kmers extracted from a range of kmer sizes (the default values are 21, 33, 55, 77, 99) to build the initial de Bruijn graph. This multi‐kmer approach capitalizes on the advantages of building a graph from short and long kmer‐sizes. Smaller kmer‐sizes minimize contig fragmentation by increasing the probability of finding overlapping kmers. However, small kmer‐sizes might face difficulties in resolving repetitive regions, and longer kmer‐sizes further improve the graph (Bankevich et al., 2012). spades runs a read correction step before assembling the first de Bruijn. The initial graph architecture is extracted, and a series of graph operations take place on it, leveraging information about kmer coverage, kmer‐to‐kmer distances, and paired information to simplify the graph and remove branches and bubbles. Once the graph is simplified, the reads retained are mapped back onto the graph in order to extract the extended contigs. To improve best‐practices in de novo contig assembly from low‐coverage WGS reads and to demonstrate its integration with other types of reduced‐representation data, we expand the sequence capture processor pipeline (secapr – Andermann et al., 2018) to include the assembler spades (Bankevich et al., 2012). We also implement a new iterative assembly approach with the software abyss (Jackman et al., 2017; Simpson et al., 2009), in which contigs assembled with different sizes of read substrings—kmers—are combined and their orthology with reference sequences assessed to obtain new sets of contigs (which here we call a multi‐kmer approach; Box 1; Figure 2). secapr is a Python pipeline, available as a conda package for Linux, Windows, and MacOS, that automatically installs and executes software dependencies to obtain multilocus alignments from raw short sequencing reads. secapr was originally designed to process target sequence capture data of multisample data sets (see Figure 1 for pipeline workflow), inspired by the Phyluce pipeline workflow for UCEs (Faircloth, 2016).

FIGURE 1

Schematic representation of the workflow implemented in this study using the secapr version 2.2.3 pipeline. The bash commands used in each step are shown inside coloured (blue) boxes

Schematic representation of the workflow implemented in this study using the secapr version 2.2.3 pipeline. The bash commands used in each step are shown inside coloured (blue) boxes Flowchart summarizing the process of contig assemblage implemented in abyss (left, Jackman et al., 2017; Simpson et al., 2009) and spades (right, Bankevich et al., 2012). The chart briefly describes the de Bruijn graph construction and the differences between both assemblers. See Box 1 for more details Starting from unassembled low coverage WGS data, the bioinformatic steps for their processing are: (1) sequence quality filtering and cleaning, (2) de novo contig assembly (when no reference genome is available for read mapping), (3) identification and extraction of loci of interest from assembled contigs and (4) alignment of multiple sequences. In the new version of secapr, we enhance its efficiency for processing WGS data by modifying steps (2)—assembly using abyss and now, spades—and (3) ‐ identification (orthology assessment) and extraction of loci of interest using blastz (Schwartz et al., 2003). We allow for parallelization of multiple jobs in these steps and concomitant processing of short sequencing reads derived from different types of library preparations (target sequence capture and WGS). We tested the updated secapr pipeline (now called secapr version 2.2.3) on target sequence capture and WGS data and assessed the efficiency of using both single‐kmer and multi‐kmer de novo contig assembly (see Box 1 for detailed information on genome assembly using spades and abyss) to recover more, and more complete, multilocus alignments. Our study proposes a way forward to process data from different sequencing approaches in a single bioinformatic pipeline, assessing the performance of different de novo contig assemblers and strategies to enhance the extraction of loci of interest from WGS data.

MATERIALS AND METHODS

Data

We used published low‐coverage WGS data (Li et al., 2019) and newly generated target sequence capture data from skipper butterflies (Lepidoptera: Hesperiidae: Eudaminae). The study group was representative of commonly studied taxa in molecular ecology, given the lack of annotated reference genomes and the need to integrate closely related (species within genera) and highly divergent (lineages within a subfamily) samples. We downloaded low‐coverage whole‐genome sequences of 10 butterfly species representing three Eudaminae tribes (Table S1; Li et al., 2019; Bioproject PRJNA464409). The WGS data were generated via 150 bp paired‐end Illumina sequencing (HiSeq X Ten platform) at an average of ~10× genome coverage, which is about 5 Gbp sequencing data per sample considering a representative genome size of Eudaminae ~500–600 Mbp (Shen et al., 2017). To demonstrate how to integrate data from different sequencing approaches in a single bioinformatic pipeline, we generated target sequence capture data for different individuals representing the same 10 species in the low‐coverage WGS data (Table S1; BioProject PRJNA681152). We targeted 406 exons from protein‐coding genes using the butterfly 1.0 probe kit, which consists of 56,470 baits of 120 bp size each (Espeland et al., 2018). DNA was isolated from two to three butterfly legs using Qiagen DNA extraction kits following the manufacturer's instructions. RapidGenomics (FL, USA) prepared target enrichment libraries and conducted high‐throughput sequencing using paired‐end 150 bp on an Illumina NovaSeq platform. Raw reads from the publicly available WGS and the newly generated target sequence capture data were jointly processed using secapr version 2.2.3. We used a Linux centos version 7.9 system on a dedicated cluster provided by the Czech National Computing Infrastructure Metacentrum. secapr version 2.2.3 can be freely downloaded and installed following the detailed documentation at https://github.com/AntonelliLab/seqcap_processor (Open Research section at the end).

Sequencing quality check and cleaning

The quality of raw Illumina reads from both WGS, and target sequence capture was checked using fastqc version 0.11.9 (Andrews, 2010; https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) via the secapr version 2.2.3 pipeline using the command secapr quality_check. Target sequence capture data were filtered by removing low‐quality reads and trimming Illumina adapters using Trimmomatic (Bolger et al., 2014) via the secapr version 2.2.3 command secapr clean_reads. The average quality of the filtered, adapter‐free sequence capture data was assessed again using the command secapr quality_check.

De novo contig assembly

We evaluated the performance of abyss version 1.3.7 (Jackman et al., 2017; Simpson et al., 2009) and spades version 3.14.1 (Bankevich et al., 2012) with WGS and target sequence capture data. By default, spades uses six kmer values concurrently (21, 33, 55, 77, 99, and 127) and abyss uses a single‐kmer in each run with values of kmer up to 97. To make comparable evaluations of both assemblers, we ran abyss four times, each with a single‐kmer value of 21, 33, 55 or 77 and ran spades for the same kmer values concurrently with the command secapr assemble_reads.

Extraction of contigs containing loci of interest

We created 406 reference sequences representing each of the target sequence capture loci. The references are the consensus sequences of targeted exons from 129 unpublished Eudaminae samples to enhance matching with the assembled contigs. We extracted contigs of interest from the WGS and target sequence capture assemblies using the alignment algorithm blastz (Schwartz et al., 2003) via secapr version 2.2.3 using secapr find_target_contigs. For secapr version 2.2.3, we developed a new approach to automatically extract loci of interest from multiple abyss runs, here termed the multi‐kmer abyss approach. For this, we combined contigs from all individual kmer assemblies with abyss (using kmers = 21, 33, 55, and 77) and used the command secapr find_target_contigs to identify matched contigs against the reference sequences. The approach consists of removing redundant contig matches by selecting the longest contig among multiple single‐kmer runs in abyss, and for each targeted locus using the argument ‐‐keep_paralogs. This approach is different from the automated de novo contig assembly implemented within spades. spades outputs the assembled contigs resulting from a de Bruijn graph incorporating all different kmers across all kmer sizes at once. For abyss multi‐kmer, the selection of targeted contigs from multiple assemblies under different kmer values is done after the assembly run, thus, specifically targeting the step (3) in the bioinformatic pipeline.

DNA alignments and performance assessment

We generated six multilocus data sets, each from a different assembly and locus extraction strategy (spades, four single‐kmer abyss—21, 33, 55, and 77—and multi‐kmer abyss) and then we aligned each locus within the data sets using mafft version 7.130 (Katoh & Standley, 2013) in secapr version 2.2.3 with the command secapr align_sequences. We produced final alignment data sets of the same length as the reference sequences by trimming the exon boundaries using the ‐‐addfull option in mafft. To assess the performance of both assemblers and of their specific assembly strategies, we counted the number of recovered loci per sample across the six multilocus data sets. In addition, we evaluated the completeness of final alignments generated by each assembly strategy by comparing the alignments before and after excluding samples with more than 50% missing data (N bases) from the extracted loci alignments using Sequence_Cleaner (here named processed alignments) (https://github.com/metageni/Sequence‐Cleaner). Alignments for which we did not remove these sequences are named unprocessed alignments.

Producing data sets of different genome coverage

To evaluate the performance of assemblers and assembly strategies under different coverage depths, we randomly subsampled the original average 10× WGS coverage using BBTools/BBMap (Bushnell, 2021). We used the reformat.sh script to subsample 50% and 20% of the original WGS data set to generate new data sets with average genome coverages of 5× and 2×, respectively. These data sets were used in subsequent analyses to estimate the limitations of contig assembly and the variability in recovering loci of interest under lower genome sequencing coverages. We used reduced coverages in which differences in assembly performance may be more evident.

Statistical analysis

To statistically assess significant differences in the recovered number of loci amongst assembly approaches, we used linear mixed models using the r (R Core Team, 2020) package lme4 (Bates et al., 2015). We performed model selection using the corrected Akaike information criterion (AICc) to assess the best‐fit model that explains the estimated marginal means of recovered loci. Our analyses involved two approaches. First, we compared the 10× coverage WGS data set and the target sequence capture data set to test if, overall, there are statistically significant differences between estimated means of recovered loci. We considered the 10 Eudaminae samples as random variables and assembly strategies (single‐kmer abyss—21, 33, 55, and 77, multi‐kmer abyss and spades), sequencing strategy (WGS or target sequence capture), and processing of final alignments (whether sequences with more than 50% missing data are excluded or not) as fixed variables. The number of recovered loci per sample was used as the effect variable. Second, we compared the performance of the assembly strategies under different average coverages of WGS data. For this, we followed the same protocol described above and considered the subsampled coverages (10×, 5× or 2×) as fixed variables. For both approaches, we made a full model considering the whole set of fixed variables, models with every possible combination of fixed variables and a null model with only our random variable. After checking that the residuals fulfilled the assumptions of linear models, we also calculated a full model considering the interactions between the fixed variables.

Phylogenetic inference

We inferred species trees for each sequencing data set and assembly strategy to assess if they produced phylogenetically informative alignments that were congruent with the expected phylogenetic relationships among species. First, we inferred gene trees from the alignment of each targeted locus using iq‐tree version 2.0.7 (Minh et al., 2020). We used ModelFinder (Kalyaanamoorthy et al., 2017) as implemented in iq‐tree to estimate the best substitution models, and we performed 1000 ultrafast bootstrap replicates (Hoang et al., 2017). Second, we used the sets of maximum‐likelihood gene trees to infer coalescent species trees in ASTRAL III (Zhang et al., 2018). Support was calculated as local posterior probabilities from quartet frequencies (Sayyari & Mirarab, 2016). Species trees were also inferred using each of the data sets with different sequencing coverages (WGS under 10×, 5×, and 2×) to evaluate the informativeness of multilocus alignments under decreasing amount of raw WGS data. We considered a species tree as well‐resolved if it successfully recovered the expected species relationships (Li et al., 2019). We also considered branches as well‐supported when they presented a local posterior probability support higher than 0.95. To assess discordance among species trees, we calculated symmetric distances (Robinson & Foulds, 1981) using the r package phangorn (Schliep, 2011). Since the symmetric distances only consider tree topology, we used a reference species tree generated with our WGS data set and under 5× depth of coverage, assembled with the multi‐kmer abyss approach, which retrieved the expected tree topology (as in Li et al., 2019). This reference was then compared against each of the species’ trees obtained with ASTRAL III.

Assessment of potential errors in de novo contig assemblies

To evaluate the accuracy of the de novo contig assembly, we aligned queries of the assembled target loci sequences for each species of our study, against subject reference sequences of the same species using blast (Altschul et al., 1990). Both queries and subject references were generated by our own alignments so we could assess the performance of our own implemented assembly strategies. We then calculated the percentage of errors as the number of nucleotide mismatches, multiplied by 100, and divided by the total length of the alignment for each locus. We used the number of mismatches as reported by blast because ambiguous nucleotides and Ns are not considered errors, and the alignment length only counts the positions in which the queries and the subject sequences match (but it includes potential gaps). This procedure allowed us to interpret any observed differences (mismatches) between alignments as potential assembly errors. Our target sequence capture data set comes from different individuals of the same species present in the WGS data set. Therefore, differences from comparing WGS versus target sequence capture may as well result from within‐species polymorphisms and not necessarily due to assembly errors. We therefore carried out the analyses for the WGS and target sequence capture data sets separately. For the WGS data set, first we created the blast database of subject sequences for each species, each database including all the exons derived from the WGS 10× data set and assembled with either our multi‐kmer abyss approach or spades. We only used the 10× average coverage data as subjects since this data set generated the more complete alignments in terms of total recovered loci and number of present species within alignments, and because they produced well‐resolved species trees (see Section 3.3). Second, we performed per‐species blast alignments between the query sequences and the corresponding species’ subject sequences. The query sequences are those generated by multi‐kmer abyss and spades, and under different depths of coverage (10×, 5×, and 2×). Third, we calculated the percentage of error for each blast alignment (mismatches relative to matches), grouped the results by locus, averaged the percentage of error across species, and plotted the kernel density for every alignment. For the target sequence capture data set, we chose only the sequences assembled with spades to create subject sequences, since only these sequences generated well‐resolved species trees (see Section 3). Query sequences were generated for the multi‐kmer abyss approach and aligned against the subject. We also calculated the percentage of error for the blast alignment and plotted the averaged percentages. For all alignments, we used blast’s default values for blastn searches. Our factorial pairwise comparisons allowed us to quantify any differences in sequences coming from the same individual resulting from differences in assembly strategy (same data set, different assembly approach) and depths of coverage in the case of the WGS data set (same assembly approach, different coverages).

RESULTS

Sequence quality and computing performance

fastqc reports showed that per base sequence quality and per sequence quality score for clean and trimmed reads for both the WGS and target sequence capture data sets were above 28 Phred score. WGS data available from NCBI already contained adapter‐free Illumina reads. In general, de novo contig assemblies for the target sequence capture data were faster in terms of CPU time and wall time in comparison to the WGS data set even at 2× average read depth coverage, and they also required less memory usage. Also, single‐kmer abyss runs used less computational resources than spades for CPU time, wall time and memory usage regardless of the type of data. A comprehensive table with running times and memory usage for assemblies can be found in Table S2.

Recovery of loci of interest and statistical analyses

First, we determined which of the six assembly strategies (four single‐kmer runs with abyss—kmers 21, 33, 55, 77, our novel multi‐kmer approach with abyss, and spades) maximized the recovery of loci of interest from low‐coverage WGS (average 10× read depth) and target sequence capture data. Our multi‐kmer abyss approach significantly recovered more loci of interest per sample in both data sets compared to all other strategies, including spades (Table 1). Nevertheless, spades significantly recovered more loci than all other single‐kmer strategies (Table 1). A summary of estimated marginal means for every assembly strategy and data set can be found in Table S3. AICc‐based model selection indicated that the best fit model considered the interactions among all our fixed variables (Tables S4 and S5).

TABLE 1

Pairwise contrasts of the marginal estimated means of recovered loci per sample between multi‐kmer and single‐kmer strategies and between both multi‐kmer strategies

Pairwise comparisons among strategies	Completeness of alignments	WGS 10× read depth		Target sequence capture
Pairwise comparisons among strategies	Completeness of alignments	Estimate	p‐value	Estimate	p‐value
abyss Multikmer ‐ abyss 21	Unprocessed	136	<.0001	126	<.0001
abyss Multikmer ‐ abyss 33		99	<.0001	120	<.0001
abyss Multikmer ‐ abyss 55		143	<.0001	115	<.0001
abyss Multikmer ‐ abyss 77		191	<.0001	104	<.0001
abyss Multikmer ‐ spades		68	<.0001	85	<.0001
spades ‐ abyss 21		68	<.0001	41	.012
spades ‐ abyss 33		31	.1311	35	.0535
spades ‐ abyss 55		75	<.0001	30	.1386
spades ‐ abyss 77		123	<.0001	19	.6389
abyss Multikmer ‐ abyss 21	Processed	117	<.0001	107	<.0001
abyss Multikmer ‐ abyss 33		81	<.0001	102	<.0001
abyss Multikmer ‐ abyss 55		129	<.0001	100	<.0001
abyss Multikmer ‐ abyss 77		180	<.0001	92	<.0001
abyss Multikmer ‐ spades		27	.2523	44	.006
spades ‐ abyss 21		90	<.0001	63	<.0001
spades ‐ abyss 33		54	.0002	58	<.0001
spades ‐ abyss 55		102	<.0001	56	.0001
spades ‐ abyss 77		153	<.0001	48	.0015

Contrasts are made considering an average of 10× coverage for the low‐coverage WGS and the target sequence capture data sets. Standard error = 12.4 for all pairwise comparisons; degrees of freedom = 477 for all pairwise comparisons

Pairwise contrasts of the marginal estimated means of recovered loci per sample between multi‐kmer and single‐kmer strategies and between both multi‐kmer strategies Contrasts are made considering an average of 10× coverage for the low‐coverage WGS and the target sequence capture data sets. Standard error = 12.4 for all pairwise comparisons; degrees of freedom = 477 for all pairwise comparisons Second, we analysed the performance of our different assembly strategies using low‐coverage WGS data under varying degrees of read depth. Our multi‐kmer abyss strategy significantly recovered more loci of interest per sample than any other strategy for the 10× and the 5× coverage data sets, in both unprocessed and processed final alignments. spades, on the other hand, significantly recovered more loci of interest than any other strategy for the 2× coverage data set, and more than the single‐kmer strategies across all the data sets of varying average coverages (Table 2). Overall, the multi‐kmer strategies (multi‐kmer approach with abyss and spades) were significantly better than single‐kmer strategies in recovering more loci of interest (Tables 1 and 2; Figures 3 and 4).

TABLE 2

Pairwise comparisons among strategies	Completeness of alignments	WGS 10× read depth		WGS 5× read depth		WGS 2× read depth
Pairwise comparisons among strategies	Completeness of alignments	Estimate	p‐value	Estimate	p‐value	Estimate	p‐value
abyss Multikmer ‐ abyss 21	Unprocessed	149.1	<.0001	133.55	<.0001	42.55	.0005
abyss Multikmer ‐ abyss 33		116.02	<.0001	126.17	<.0001	61.62	<.0001
abyss Multikmer ‐ abyss 55		166.22	<.0001	172.62	<.0001	95.97	<.0001
abyss Multikmer ‐ abyss 77		217.75	<.0001	229.5	<.0001	109.85	<.0001
abyss Multikmer ‐ spades		70.38	<.0001	69.28	<.0001	−33.97	.0118
spades ‐ abyss 21		78.72	<.0001	64.27	<.0001	76.52	<.0001
spades ‐ abyss 33		45.63	.0001	56.88	<.0001	95.58	<.0001
spades ‐ abyss 55		95.83	<.0001	103.33	<.0001	129.93	<.0001
spades ‐ abyss 77		147.37	<.0001	160.22	<.0001	143.82	<.0001
abyss Multikmer ‐ abyss 21	Processed	117.8	<.0001	102.25	<.0001	11.25	.8786
abyss Multikmer ‐ abyss 33		82.78	<.0001	92.93	<.0001	28.38	.061
abyss Multikmer ‐ abyss 55		131.48	<.0001	137.88	<.0001	61.23	<.0001
abyss Multikmer ‐ abyss 77		182.45	<.0001	194.2	<.0001	74.55	<.0001
abyss Multikmer ‐ spades		33.12	.0155	32.02	<.0001	−71.23	<.0001
spades ‐ abyss 21		84.68	<.0001	70.23	<.0001	82.48	<.0001
spades ‐ abyss 33		49.67	<.0001	60.92	<.0001	99.62	<.0001
spades ‐ abyss 55		98.37	<.0001	105.87	<.0001	132.47	<.0001
spades ‐ abyss 77		149.33	<.0001	162.18	.022	145.78	<.0001

Standard error for all pairwise contrasts = 10.2, degrees of freedom for all pairwise contrasts = 375.

FIGURE 3

Boxplot of the median recovered loci per sample for each assembly strategy for the 10× average coverage WGS and target sequence capture data sets. (a) Unprocessed data, without the exclusion of sequences with more than 50% missing information – Ns. (b) Processed data (when sequences with more than 50% missing information ‐ Ns ‐ are excluded). Different colours represent the two different types of sequencing approaches that our data are derived from, target sequence capture and low‐coverage WGS at 10× coverage

FIGURE 4

Boxplot of the median recovered loci per sample for each assembly strategy for the three subsets of different average coverages (10×, 5× and 2×). (a) Unprocessed data, without the exclusion of sequences with more than 50% missing information – Ns. (b) Processed data (when sequences with more than 50% missing information ‐ Ns ‐ are excluded). Different colours represent the different average depths of coverage for the WGS data sets

Pairwise contrasts of the marginal estimated means of recovered loci per sample between multi‐kmer and single‐kmer strategies and between both multi‐kmer strategies for each of the subsets of depths of coverage of low‐coverage WGS data set Standard error for all pairwise contrasts = 10.2, degrees of freedom for all pairwise contrasts = 375. Boxplot of the median recovered loci per sample for each assembly strategy for the 10× average coverage WGS and target sequence capture data sets. (a) Unprocessed data, without the exclusion of sequences with more than 50% missing information – Ns. (b) Processed data (when sequences with more than 50% missing information ‐ Ns ‐ are excluded). Different colours represent the two different types of sequencing approaches that our data are derived from, target sequence capture and low‐coverage WGS at 10× coverage Boxplot of the median recovered loci per sample for each assembly strategy for the three subsets of different average coverages (10×, 5× and 2×). (a) Unprocessed data, without the exclusion of sequences with more than 50% missing information – Ns. (b) Processed data (when sequences with more than 50% missing information ‐ Ns ‐ are excluded). Different colours represent the different average depths of coverage for the WGS data sets For the target sequence capture data set, only spades generated alignments that led to well‐resolved species trees with well‐supported branches, that were consistent with the expected phylogenetic hypothesis (Li et al., 2019; Figure 5). For the low‐coverage WGS data sets at 10× and 5× coverage, both multi‐kmer abyss and spades resulted in alignments that recovered species trees with the expected tree topology. Single‐kmer abyss 21, 33, and 55 resulted in well‐resolved trees for the 10× coverage subset, although the single‐kmer abyss 55 strategy did not recover a well‐resolved species tree for the processed alignments (with sequences with more than 50% missing data removed). Only abyss 33 and 55 recovered well‐resolved species trees for the 5× coverage subset. Single‐kmer abyss 77 did not produce useful alignments for any of the data sets and did not result in well‐resolved species trees. None of our implemented assemblers and strategies using the 2× WGS coverage subset resulted in well‐supported branches in species trees nor recovered the expected tree topology in terms of species relationships. Finally, Robinson‐Fould distances showed that the trees with most incongruences were obtained from the 2× coverage subset. Single‐kmer abyss 77 also presented high levels of incongruences in comparison with the expected tree topology (Table S6). All species trees inferred in this study can be found at the Zenodo repository (https://doi.org/10.5281/zenodo.5515798).

FIGURE 5

Species tree obtained with ASTRAL III (Zhang et al., 2018) by using gene trees estimated with iq‐tree version 2.0.7 (Minh et al., 2020). Tips represent each of the studied species and numbers represent local posterior probabilities inferred by ASTRAL III for the specific node. This species tree was obtained using the alignment derived from our abyss multi‐kmer strategy for the WGS data set with an average of 10× read depth coverage. Only this tree is shown since topology is the same for other well‐resolved species tree When estimating the errors introduced by each assembly strategy using multi‐kmer abyss 10× as a query and spades 10× as the subject reference, most alignments resulted in a low percentage of errors (<1%; Figure 6). The same was observed when multi‐kmer abyss was used as query and spades as subject reference but considering the target sequence capture data set. This suggests that both assembly strategies produced very similar sequences for both data sets. When using the WGS 10× average coverage as subject reference (multi‐kmer or spades), and reduced depth coverage sequences as queries, more alignments resulted in a higher percentage of errors (~10%), both for 5× and 2× coverage and using either assembly strategy (Figure 6). This indicates that errors are more likely to appear as coverage decreases, regardless of the assembly strategy (Figure 6). Averaged results for the performed blast alignments can be found in Table S8.

FIGURE 6

Kernel density of the percentage of error per locus, averaged across species. Percentage of error was calculated from the mismatches and alignment lengths in the blast results. (a) Comparisons between spades and our multi‐kmer abyss approach, showing the density of the average percentage of error in assembly strategy (same data set, different assembly approach); WGS indicates our whole genome sequence data set with 10× average depth of coverage and TC indicates the target sequence capture data set. (b) Comparisons showing the density of the average percentage of error from comparing the same assembly approach but different coverages. Multi‐kmer indicates the multi‐kmer abyss approach. In each density graph, the term before “vs” indicates the subject and the term after “vs” indicates the query. Numbers under each density graph indicate the average number of loci aligned per species in the blast alignments. Graphs show that the percentage of errors tend to increase when using reduced read coverage, regardless of the used assembly strategy

DISCUSSION

Our study exemplifies how short reads from WGS can be efficiently processed and integrated in multiple sequence alignments. By doing this within a single pipeline, we provide a way forward for the integration of sequencing strategies and the use of low‐coverage genomic data in phylogenomics and molecular ecology. We compared the performance of two de novo contig assembly methods and assembly strategies (single‐kmer vs. multi‐kmer) to recover 406 loci that represent exons of protein‐coding genes from low‐coverage WGS and target sequence capture data. We also developed a new approach to extract loci of interest from multiple single‐kmer contig assemblies and compared its performance with other assembly strategies including single and multi‐kmer de novo contig assembly. We implemented this approach in an expanded secapr pipeline, which was originally designed to process target sequence capture data, but it is now able to jointly process different types of sequencing approaches including low‐coverage WGS.

Multi‐kmer approaches most efficiently recover loci of interest

We showed that multi‐kmer approaches significantly recovered more loci of interest from our set of targeted loci than single‐kmer approaches from both low‐coverage WGS (average depth of coverages ~10×, 5×, and 2×) and target sequence capture data. More loci, however, does not always translate into better phylogenetic inferences. For example, for the target sequence capture data, the multi‐kmer approach with abyss recovered more loci than spades (Table 1), but only the alignments from spades assemblies recovered well‐resolved trees with the expected phylogenetic relationships among species (Li et al., 2019). Since the percentage of errors within target sequence capture alignments was similar to the percentage of errors for the 10× WGS data set (Figure 6), the well‐resolved trees only for spades may be the result of spades assemblies generating less N bases within sequences in the target sequence capture data (Table S7). Alternatively, our target sequence capture may have recovered a smaller number of alignments compared to the WGS at 10× average coverage, due to the unspecific nature of probe sequences used for in vitro capture. The butterfly probe kit aims to target protein‐coding genes of all major butterfly lineages (families); thus, probe sequences were not designed specifically to target our study organisms, species within the butterfly family Hesperiidae. This scenario is widespread in target sequence capture studies where the design of sequence probes tends to be more universal to increase the cost‐efficiency of designing baits for library preparation. spades performs better than multi‐kmer abyss with decreasing amounts of genomic coverage, since this assembler statistically outperformed all the other strategies for the 2× coverage subset in recovering more loci of interest per sample. Nevertheless, all the alignments for the 2× average coverage data set contained a large proportion of missing data in terms of nucleotide calls (more N bases in the alignments), and a more pronounced reduction in the mean of recovered loci per sample after processing of the data (exclusion of sequences with more than 50% N), especially for the abyss strategies (Table S7). In fact, it has been shown that abyss consistently generates more incomplete sequences with reduced genomic coverage (Allen et al., 2017). Since spades simultaneously uses different kmer sizes during the building of the de Bruijn graph, performing multiple graph reduction and correction steps, the final contig assemblies are expected to be more complete compared to single‐kmer approaches (Bankevich et al., 2012). We advocate the use of multi‐kmer approaches, which retrieve significantly more loci of interest from assembled contigs, resulting in more complete alignments from both target sequence capture and low‐coverage WGS data. We highlight that our new implementation of the multi‐kmer abyss strategy was the most efficient approach for WGS with averages of 10× and 5× read depth for both extraction of loci and phylogenetic inference. A similar multi‐kmer approach (post contig assembly) was described by Zhang, Ding, et al. (2019) using the minia3 assembler. However, it is unclear how the performance of this approach compares to multi‐kmer approaches when all kmers are processed during the contig assembly step as in spades. The new expansion of secapr version 2.2.3 including spades (Bankevich et al., 2012) and our newly developed multi‐kmer abyss strategy, therefore, represent a significant way forward for molecular ecology and phylogenomics.

Sequencing approaches and their impact on phylogenetic trees

Average coverages of ~10× from small to medium sized genomes (<1 [Gbp]) have been shown to be optimal for extracting single‐copy orthologs amenable to phylogenomics (Allen et al., 2017), while a coverage of 5× is sufficient for retrieving UCEs (Zhang, Ding, et al., 2019). We showed that multi‐kmer approaches significantly recover more target loci from both 10× and 5× genomes, and that those loci were useful for the recovery of expected and well‐supported phylogenetic relationships compared to single‐kmer assembly approaches. On average, sequencing a 1 [Gbp] genome at 5× is ~65% the cost of sequencing the same genome at 10× read depth and is similar to target sequence capture of hundreds of loci for phylogenomics. However, library preparations for WGS are more straightforward (Allen et al., 2017; Lemmon & Lemmon, 2013) and require less initial DNA material compared to target sequence capture (Zhang, Ding, et al., 2019). Nevertheless, in cases where the expected genome size is large (e.g., 2, 3 [Gbp]), target sequence capture might still be the most cost‐efficient approach to obtain phylogenomic markers. Taken together, this shows that low‐coverage WGS aiming for at least 5× read depths and for genomes as large as 1 [Gbp] is currently the most cost‐efficient approach for phylogenomics.

Potential errors within de novo contig assemblies

We found that with decreasing depths of coverage in WGS data, the percentage of errors generated during de novo contig assembly increases, which might ultimately generate biases in branch length and divergence time estimations (Andermann et al., 2019; Simion et al., 2020). At lower coverages, individual reads containing sequencing errors and those resulting from contamination have a higher impact on the assembled contigs, leading to the observed increase of assembly errors in low coverage samples. We advise caution when using 5× or lower coverages for analyses that require the estimation of within‐species polymorphisms, variant calling, and population genetic studies (Lou et al., 2021; Menelaou & Marchini, 2013), unless a reference genome is provided (e.g., Bizon et al., 2014; Rustagi et al., 2017). However, we demonstrate that 5× is sufficient for phylogenomic inference in terms of retrieving accurate species relationships. Finally, our results add further evidence showing that a minimum of 10× average coverage is suitable for obtaining high numbers of single‐copy target loci shared between samples of varying evolutionary distances (Zhang, Ding, et al., 2019) and to enable accurate phylogenetic inference (Allen et al., 2017; Li et al., 2019). Low‐coverage WGS data might bias the estimation of population‐level parameters, genotyping and phasing of alleles in which cases the use of genomic imputation to estimate missing genotypes more confidently is needed (Lou et al., 2021). Although it is possible to phase alleles with secapr version 2.2.3, the implementation of new models and software dealing with genotyping and phasing of alleles at low WGS coverage (e.g., Lou et al., 2021; Menelaou & Marchini, 2013; Rubinacci et al., 2021; Zan et al., 2019) would represent an important advance to fully integrate different short‐read library preparations and research scopes (e.g., phylogenomics and population genomics using the same WGS data in a single bioinformatic pipeline).

CONCLUSION

Our assessment of assemblers, assembly strategies and WGS sequencing depth of coverage provides a guide for improving the extraction of more loci of interest from WGS and target sequence capture data. With further increases in the cost‐benefit of low‐coverage WGS sequencing, researchers are now able to address questions in molecular ecology and evolutionary biology using more taxa, even in the absence of reference genomes. By using multi‐kmer approaches, either spades or by merging the assembled contigs of interest from different single‐kmer abyss assemblies (multi‐kmer abyss), we were able to generate alignments with more samples and of better quality to infer robust species trees. Also, other available types of data, such as RADseq and UCEs, can be tested within our pipeline so users are able to recycle these data by extracting other specific regions of genomes from them. For years to come, both types of sequencing techniques will likely remain at the center of a myriad of questions in evolutionary biology and molecular ecology. Our freely available bioinformatic platform and guidelines allow researchers to make informed choices on the generations of contigs, be them from low‐coverage WGS data, specifically enriched target sequence capture data, or a combination of both.

AUTHOR CONTRIBUTIONS

Pedro G. Ribeiro, María Fernanda Torres Jiménez, Tobias Andermann, Christine D. Bacon, and Pável Matos‐Maraví conceived/designed the study, Alexandre Antonelli and Pável Matos‐Maraví secured funding, Tobias Andermann created the bioinformatic pipeline, Pedro G. Ribeiro, María Fernanda Torres Jiménez and Pável Matos‐Maraví conducted analyses, Pedro G. Ribeiro and María Fernanda Torres Jiménez interpreted the results, Pedro G. Ribeiro and María Fernanda Torres Jiménez created figures, Pedro G. Ribeiro wrote the first draft, and all authors contributed to the final version of the article. Table S1‐S7 Click here for additional data file. Table S8 Click here for additional data file.

45 in total

1. ABySS: a parallel assembler for short read sequence data.

Authors: Jared T Simpson; Kim Wong; Shaun D Jackman; Jacqueline E Schein; Steven J M Jones; Inanç Birol
Journal: Genome Res Date: 2009-02-27 Impact factor: 9.043

2. De novo finished 2.8 Mbp Staphylococcus aureus genome assembly from 100 bp short and long range paired-end reads.

Authors: David Hernandez; Ryan Tewhey; Jean-Baptiste Veyrieras; Laurent Farinelli; Magne Østerås; Patrice François; Jacques Schrenzel
Journal: Bioinformatics Date: 2013-10-15 Impact factor: 6.937

3. Phylogenomics using low-depth whole genome sequencing: A case study with the olive tribe.

Authors: Jill K Olofsson; Isabel Cantera; Céline Van de Paer; Cynthia Hong-Wa; Loubab Zedane; Luke T Dunning; Adriana Alberti; Pascal-Antoine Christin; Guillaume Besnard
Journal: Mol Ecol Resour Date: 2019-05-13 Impact factor: 7.090

4. Phylogenomics from Whole Genome Sequences Using aTRAM.

Authors: Julie M Allen; Bret Boyd; Nam-Phuong Nguyen; Pranjal Vachaspati; Tandy Warnow; Daisie I Huang; Patrick G S Grady; Kayce C Bell; Quentin C B Cronk; Lawrence Mugisha; Barry R Pittendrigh; M Soledad Leonardi; David L Reed; Kevin P Johnson
Journal: Syst Biol Date: 2017-09-01 Impact factor: 15.683

5. Targeted investigation of the Neandertal genome by array-based sequence capture.

Authors: Hernán A Burbano; Emily Hodges; Richard E Green; Adrian W Briggs; Johannes Krause; Matthias Meyer; Jeffrey M Good; Tomislav Maricic; Philip L F Johnson; Zhenyu Xuan; Michelle Rooks; Arindam Bhattacharjee; Leonardo Brizuela; Frank W Albert; Marco de la Rasilla; Javier Fortea; Antonio Rosas; Michael Lachmann; Gregory J Hannon; Svante Pääbo
Journal: Science Date: 2010-05-07 Impact factor: 47.728

6. A Comprehensive and Dated Phylogenomic Analysis of Butterflies.

Authors: Marianne Espeland; Jesse Breinholt; Keith R Willmott; Andrew D Warren; Roger Vila; Emmanuel F A Toussaint; Sarah C Maunsell; Kwaku Aduse-Poku; Gerard Talavera; Rod Eastwood; Marta A Jarzyna; Robert Guralnick; David J Lohman; Naomi E Pierce; Akito Y Kawahara
Journal: Curr Biol Date: 2018-02-15 Impact factor: 10.834

7. A beginner's guide to low-coverage whole genome sequencing for population genomics.

Authors: Runyang Nicolas Lou; Arne Jacobs; Aryn P Wilder; Nina Overgaard Therkildsen
Journal: Mol Ecol Date: 2021-08-31 Impact factor: 6.185

8. Genotyping by low-coverage whole-genome sequencing in intercross pedigrees from outbred founders: a cost-efficient approach.

Authors: Yanjun Zan; Thibaut Payen; Mette Lillie; Christa F Honaker; Paul B Siegel; Örjan Carlborg
Journal: Genet Sel Evol Date: 2019-08-14 Impact factor: 4.297

9. Trimmomatic: a flexible trimmer for Illumina sequence data.

Authors: Anthony M Bolger; Marc Lohse; Bjoern Usadel
Journal: Bioinformatics Date: 2014-04-01 Impact factor: 6.937

10. Fast Coalescent-Based Computation of Local Branch Support from Quartet Frequencies.

Authors: Erfan Sayyari; Siavash Mirarab
Journal: Mol Biol Evol Date: 2016-04-15 Impact factor: 16.240

1 in total

1. A bioinformatic platform to integrate target capture and whole genome sequences of various read depths for phylogenomics.

Authors: Pedro G Ribeiro; María Fernanda Torres Jiménez; Tobias Andermann; Alexandre Antonelli; Christine D Bacon; Pável Matos-Maraví
Journal: Mol Ecol Date: 2021-10-31 Impact factor: 6.622

1 in total