Brian Tilston Smith1, William M Mauck1,2, Brett W Benz3, Michael J Andersen4. 1. Department of Ornithology, American Museum of Natural History, New York, New York. 2. New York Genome Center, New York, New York. 3. Museum of Zoology and Department of Ecology and Evolutionary Biology, University of Michigan. 4. Department of Biology and Museum of Southwestern Biology, University of New Mexico.
Abstract
The resolution of the Tree of Life has accelerated with advances in DNA sequencing technology. To achieve dense taxon sampling, it is often necessary to obtain DNA from historical museum specimens to supplement modern genetic samples. However, DNA from historical material is generally degraded, which presents various challenges. In this study, we evaluated how the coverage at variant sites and missing data among historical and modern samples impacts phylogenomic inference. We explored these patterns in the brush-tongued parrots (lories and lorikeets) of Australasia by sampling ultraconserved elements in 105 taxa. Trees estimated with low coverage characters had several clades where relationships appeared to be influenced by whether the sample came from historical or modern specimens, which were not observed when more stringent filtering was applied. To assess if the topologies were affected by missing data, we performed an outlier analysis of sites and loci, and a data reduction approach where we excluded sites based on data completeness. Depending on the outlier test, 0.15% of total sites or 38% of loci were driving the topological differences among trees, and at these sites, historical samples had 10.9× more missing data than modern ones. In contrast, 70% data completeness was necessary to avoid spurious relationships. Predictive modeling found that outlier analysis scores were correlated with parsimony informative sites in the clades whose topologies changed the most by filtering. After accounting for biased loci and understanding the stability of relationships, we inferred a more robust phylogenetic hypothesis for lories and lorikeets.
The resolution of the Tree of Life has accelerated with advances in DNA sequencing technology. To achieve dense taxon sampling, it is often necessary to obtain DNA from historical museum specimens to supplement modern genetic samples. However, DNA from historical material is generally degraded, which presents various challenges. In this study, we evaluated how the coverage at variant sites and missing data among historical and modern samples impacts phylogenomic inference. We explored these patterns in the brush-tongued parrots (lories and lorikeets) of Australasia by sampling ultraconserved elements in 105 taxa. Trees estimated with low coverage characters had several clades where relationships appeared to be influenced by whether the sample came from historical or modern specimens, which were not observed when more stringent filtering was applied. To assess if the topologies were affected by missing data, we performed an outlier analysis of sites and loci, and a data reduction approach where we excluded sites based on data completeness. Depending on the outlier test, 0.15% of total sites or 38% of loci were driving the topological differences among trees, and at these sites, historical samples had 10.9× more missing data than modern ones. In contrast, 70% data completeness was necessary to avoid spurious relationships. Predictive modeling found that outlier analysis scores were correlated with parsimony informative sites in the clades whose topologies changed the most by filtering. After accounting for biased loci and understanding the stability of relationships, we inferred a more robust phylogenetic hypothesis for lories and lorikeets.
Historical and ancient DNA from museum specimens is widely employed for incorporating rare and extinct taxa into phylogenetic studies (e.g., Thomas et al. 1989; Mitchell et al. 2014; Fortes et al. 2016). The inclusion of these samples has helped discover and delimit species (Helgen et al. 2013; Paijmans et al. 2017), resolve phylogenetic relationships (Mitchell et al. 2016), and clarify biogeographic history (Kehlmaier et al. 2017; Yao et al. 2017). DNA sequences obtained from dry and alcohol-preserved museum specimens have been collected using a range of techniques, including Sanger sequencing (Sorenson et al. 1999), restriction site-associated DNA sequencing (Tin et al. 2015; Ewart et al. 2019), and sequence capture of reduced (McCormack et al. 2016; Linck et al. 2017; Ruane and Austin 2017) or whole genomes (Enk et al. 2014; Hung et al. 2014). However, DNA sequences collected from these museum specimens are subject to errors associated with contamination (Malmström et al. 2005), DNA degradation (Briggs et al. 2007; Sawyer et al. 2012), and low coverage in read depth (Tin et al. 2015), which all present challenges in distinguishing evolutionary signal from noise.Sequence capture of ultraconserved elements (UCEs) is a popular approach for collecting orthologous genomic markers in phylogenomic studies (Faircloth et al. 2015; Chakrabarty et al. 2017; Esselstyn et al. 2017) and is increasingly used for historical specimens (Hosner et al. 2016; McCormack et al. 2016; Ruane and Austin 2017). A common finding is that the loci recovered are typically shorter in older samples (Hosner et al. 2016; McCormack et al. 2016; Ruane and Austin 2017). Shorter loci are potentially problematic because the sequence capture approach targets the invariable UCE core, limiting the portion of the flanking region that contains polymorphic sites. Another factor that may cause differences among historical and modern samples is that phylogenomic pipelines that do not involve variant calling typically employ read-specific filtering, where the average read depth across all positions along a locus is used to determine whether the locus is excluded (e.g., Faircloth 2016). Under this type of scenario, low coverage characters may pass typical filters, exacerbating differences among historical and modern samples. Although some studies only use DNA sequences collected from historical or ancient samples (e.g., Hung et al. 2014), most phylogenetic approaches involving noncontemporaneous samples combine with those from modern samples. For those that do use DNA from both sample types, additional challenges in downstream analyses may arise due to an asymmetry in the phylogenetic signal caused by nonrandom missing data (e.g., Hosner et al. 2016).The impact of missing data on phylogenetic inference remains contentious (Lemmon et al. 2009; Wiens and Morrill 2011; Simmons 2012, 2014; Hovmöller et al. 2013; Jiang et al. 2014; Streicher et al. 2016). Missing data have been shown to bias phylogenetic relationships, particularly when the missing characters are nonrandomly distributed (e.g., Lemmon et al. 2009; Simmons 2012, 2014) However, findings also suggest that even when some taxa have a large proportion of characters with no data, phylogenetic signal is retained if enough characters are present (Philippe et al. 2004; Roure et al. 2013; Shavit Grievink et al. 2013; Molloy and Warnow 2018). Bias may manifest as inflated support values and erroneous branch lengths, or as inconsistencies between optimality criteria or phylogenomic approaches (i.e., concatenation vs. the multispecies coalescent). The increased availability of phylogenomic data has provided a more nuanced look at missing data’s effect on phylogenetic inference (Philippe et al. 2004; Huang and Knowles 2016; Streicher et al. 2016; Xi et al. 2016). One means of dealing with missing data in phylogenomic data sets is to filter loci based on the proportion of either missing characters or missing species in the data set (Hosner et al. 2016). However, this approach may not directly target problematic regions of an alignment, and phylogenetically informative signal may be discarded unnecessarily. A more direct approach would entail identifying which specific sites or genes are influenced by missing data.Analyses of outlier sites or loci in phylogenomic data indicate that a few genes can have a large impact on a topology (Arcila et al. 2017; Brown and Thomson 2017; Shen et al. 2017; Walker et al. 2018). These conflicting genealogies can be due to biological processes (e.g., incomplete lineage sorting, introgression, and horizontal gene transfer) or to spurious phylogenetic signal caused by poor alignments, paralogy, and/or sequencing error. Putative outlier loci have been identified using topology tests (Arcila et al. 2017; Esselstyn et al. 2017), Bayes factors (Brown and Thomson 2017), and site/locus-wise log-likelihood differences among alternative topologies (Shen et al. 2017; Walker et al. 2018). Support for particular phylogenetic hypotheses may be driven by a small subset of loci (Brown and Thomson 2017; Walker et al. 2018), and the targeted removal of outlier loci can reconcile differences among topologies. Outlier analyses provide a framework for assessing how differences between historical and modern DNA sequences impact phylogenetic inference. In this study, we performed site and locus likelihood outlier analyses to evaluate whether sequence coverage and missing data impact phylogenetic relationships in our focal group, the Loriini.Lories and lorikeets, commonly known as the brush-tongued parrots, are a speciose clade (Tribe: Loriini) of colorful birds that are widely distributed across the Australasian region (Forshaw and Cooper 1989). The characterization of distributional ranges, phenotypic variation, and systematics of the clade was the product of expansive biological inventories that peaked during the early 1900s (Mivart 1896; Forshaw and Cooper 1989). The geographical extent of this work encompasses thousands of large and small islands spread across many countries in the Australasian region. Given these immense logistical constraints, modern collecting expeditions that aim to produce voucher specimens with genetic samples for continued systematic work (e.g., Kratter et al. 2006; Andersen et al. 2017) have been much more focused in scope relative to the pioneering work of the 20th century that produced extensive series of specimens across species’ entire ranges (e.g., Mayr 1933, 1938, 1942; Amadon 1943). Thus, the lack of modern genetic samples means that phylogenetic relationships in many groups, like the Loriini, remain unresolved. To get around this constraint, phylogenomic studies have sourced DNA from historical specimens to fill modern sampling gaps (Moyle et al. 2016; Andersen et al. 2018).Prior phylogenetic work on the Loriini showed evidence for at least three paraphyletic genera (Trichoglossus, Psitteuteles, and Charmosyna) and highlighted the need for increased taxon and genomic sampling to fully resolve relationships among taxa (Schweizer et al. 2015). To this end, we collected UCEs from 105 described taxa in the Loriini, including species and subspecies. Our sampling design used DNA isolated from fresh tissues (hereafter modern) and historical specimens, including some over 100 years old (hereafter historical; supplementary fig. S1 and table S1, Supplementary Material online). We anticipated challenges with processing, recovering, and analyzing UCEs from historical specimens and expected that biases in the DNA sequence data might yield misleading relationships. To evaluate biased phylogenetic signal and explore options for maximizing the amount of data recovered, we produced alignments using different site coverage thresholds that produced alignments with varying levels of missing data. We then estimated phylogenies with particular sites and loci removed, and with varying percentages of data completeness to evaluate topological stability. To target specific sites or loci that may be influencing relationships, we used site-wise and locus-wise likelihoods to identify which portions of the alignment drive topological differences among trees estimated with and without low coverage characters, and with and without missing data. From these analyses, we produced a series of trees with different subsets of putative outliers removed and quantified the change in topology using a tree distance metric and support values and summarized the information content of each locus. Next, we assessed whether the likelihood scores from the outlier analyses could be predicted by locus-specific alignment statistics. Finally, we took a more general approach evaluating how data completeness impacted the estimated topology by producing a series of alignments with varying levels of missing data. The alternative data reduction approaches we employed allowed us to compare the utility of precise versus general filtering of missing data on phylogenetic inference. After rigorously assessing potential biases in the data, we propose a phylogenetic hypothesis for lories and lorikeets.
Materials and Methods
We sampled all 12 genera, 58/59 species, and 102/112 named taxa (species and subspecies; Clements et al. 2019) within the Loriini, and three additional subspecies (Glossopsitta concinna concinna, Glossopsitta concinna didimus, and Trichoglossus haematodus caeruleiceps) recognized by Gill and Donsker (2019) and Forshaw (2010), respectively. In total, we sampled 105 taxa within Loriini. Charmosyna diadema is the only species not included in our study, which is extinct and known from a single female specimen (Forshaw and Cooper 1989). Two additional taxa (Eos histrio talautensis and Eos squamata riciniata) produced few loci with high missing data in those loci and were excluded from final analyses. We did not obtain samples from the following taxa: Charmosyna rubronotata kordoana, Psitteuteles iris rubripileum, Neopsittacus pullicauda socialis, Eos histrio challengeri, Trichoglossus haematodus brooki, and Trichoglossus moluccanus septentrionalis. We treated Trichoglossus haematodus rosenbergii as Trichoglossus rosenbergii and Trichoglossus haematodus intermedius as Trichoglossus haematodus haematodus following Gill and Donsker (2019). We also followed Gill and Donsker (2019) and used Parvipsitta for P. pusilla and P. porphyrocephala. When possible, we sampled more than one individual per species to verify the phylogenetic position of a taxon. For outgroups, we used Melopsittacus undulatus, Psittaculirostris edwardsii, and Cyclopsitta diophthalma, which together with the Loriini form the clade Loriinae (Joseph et al. 2012; Provost et al. 2018). Sampling map and specimen details and locality information are available in supplementary figure 1 and table S1, Supplementary Material online.We extracted total genomic DNA from muscle tissue using QIAamp DNeasy extraction kits (Qiagen, Valencia, CA). For historical samples, we used a modified DNeasy extraction protocol that used QIAquick PCR filter columns that size selected for smaller fragments of DNA. The modified protocol also included washing the sample with H2O and EtOH prior to extracting as well as extra time for digestion. DNA extraction from historical samples was done in a dedicated lab for working with degraded samples to reduce contamination risk. We quantified DNA extracts using a Qubit 2.0 Fluorometer (Thermo Fisher Scientific). Library preparation of UCEs and enrichment, and Illumina sequencing were performed by RAPiD Genomics (Gainesville, FL). The Tetrapod UCE 5K probe set was used to enrich 5,060 UCE loci (Faircloth et al. 2012). Variant bases increase with distance from the UCE core and these variant sites are phylogenetically informative (Faircloth et al. 2012). Even at shallow phylogenetic scales (i.e., within species), the majority of loci have been shown to be polymorphic with an average of two to three variant sites per locus (Smith et al. 2014). The wet-lab component of this study was carried out over 3 years and the number of individuals multiplexed per lane ranged from 48 to 384. Sequencing was done on an Illumina HiSeq 2500 PE 125 or HiSeq 3000 PE 150. Fastq files are available on the Sequence Read Archive (SRA Bioproject ID: 498485).We used a modified data-processing pipeline that incorporated PHYLUCE (Faircloth 2016), a software package developed for analyzing UCE data, and seqcap_pop (Smith et al. 2014; Harvey et al. 2016). We used FastQ Screen to map raw reads to bacterial genomes and filter contaminant DNA (Wingett and Andrews 2018). Low-quality bases and adapter sequences were trimmed from multiplexed fastq files using Illumiprocessor v1 (Faircloth 2013; Bolger et al. 2014). Next, reads were assembled into contigs with Trinity v2.0.6 (Grabherr et al. 2011) and contigs were mapped to UCE probes. We chose the sample that produced the largest number of UCEs as the reference for subsequent mapping for all individuals. We generated an index of the reference sequence and independently mapped reads from each sample to the same reference sequence using BWA v0.7.13-r1126 (Li and Durbin 2009). SAM files produced from the BWA mapping were converted to BAM files, which were sorted with SAMtools (Li et al. 2009), and cleaned with Picard v1.106 (http://broadinstitute.github.io/picard). Then, we used the mpileup function in SAMtools to call variant sites and produce a VCF file (-C 30; -Q 20), vcfutils to convert from VCF to fastq (excluding sites with quality scores <20), and seqtk (github.com/lh3/seqtk) to convert fastq to fasta. From this last step, we produced two sets of DNA sequences that were analyzed independently. One data set retained all variant sites irrespective of coverage (hereafter Low Coverage data set), and a second set that excluded variant sites with <6× coverage using bcftools (hereafter Filtered data set). These collective steps produced single fasta files containing all UCE loci for each individual sample. The following steps were independently performed for both data sets (Low Coverage and Filtered). Loci with >30% missing characters were removed from each individual before alignment. In PHYLUCE, we concatenated fasta files of each sample, aligned sequences in MAFFT (Katoh and Standley 2013), and retained loci where 75% of the samples were present in a locus for the final concatenated alignment. Both the conserved UCE core and variable flanking region of each locus were retained for all analyses.
Low Coverage and Filtered Outlier Analyses
We estimated trees with 171 tips, which included multiple individuals per taxon. This data set was used to check if samples from the same taxon grouped together as a means of identifying problematic samples. To focus on phylogenetic relationships among named taxa, all subsequent analyses are based on a reduced data set that contained one sample per taxon with 105 ingroup taxa and three outgroup samples. We estimated phylogenomic trees for both the Low Coverage and Filtered concatenated alignments containing only unique taxa in IQ-TREE (Nguyen et al. 2015) using ModelFinder (Kalyaanamoorthy et al. 2017) to select the best-fit substitution model for each locus partition (Chernomor et al. 2016). To assess support, we estimated 1,000 rapid bootstraps (BSs). The trees from the two different alignments did not produce the same phylogenetic relationships, so we performed an outlier site/locus analysis to identify which sites were causing the topologies to be different. We performed a two-topology, site-specific log-likelihood test that estimated the site-likelihoods based on the locus partition of the Low Coverage alignment using topologies estimated from the Low Coverage (T1) and Filtered alignments (T2) in RAxML (Stamatakis 2014). We then estimated the change in site-wise log-likelihoods (hereafter Δ s-lk = T1 site log-likelihood − T2 site log-likelihood). We binned putative outlier sites into bins representing Δ s-lk: >20, >10, >2, <−2, <−10, and <−20. We produced new concatenated alignments that corresponded to each Δ s-lk threshold bin, for which outlier sites were converted to ambiguous characters (N) in all individuals. This approach allowed us to estimate trees with different levels of outlier sites removed from the alignment. Next, we converted the DNA sequence of each locus alignment into only parsimony informative sites using FASconCAT-G (Kück and Longo 2014) and summarized the amount of parsimony informative sites and missing data at these sites for modern and historical samples. To visualize how different the trees were, we measured the distance among 100 BS trees using Robinson–Foulds distances (Robinson and Foulds 1981) with the multiRF function in phytools (Revell 2012) and used multidimensional scaling to plot the distances in two-dimensional space. All trees were processed and visualized using phytools and ape (Paradis et al. 2004) in R (R Core Team 2019). We classified samples into two categories: 1) samples that were collected within the last 30 years and came from frozen or ethanol preserved tissue (hereafter Modern) and 2) samples that came from dry museum skins with ages ranging from the late 1800s through the 1960s (hereafter Historical). To visualize the distribution of each sample type on the tree, we colored tips blue (historical) or red (modern).
Subclade Outlier Analyses
We performed a complementary outlier analysis assessing subclades, but in this set of analyses, we compared trees estimated from alignments with and without missing data. By performing this analysis on subclades, we were able to examine how missing data impacted different portions of the tree. This approach could not be applied to alignments containing all clades because at least one individual had missing data at every site in the alignment. The six clades including a single outgroup sample were based on preliminary phylogenetic analysis and were 1) Eos, Trichoglossus, Glossopsitta concinna, and Psitteuteles iris (n = 22), 2) Parvipsitta and Psitteuteles (n = 7), 3) Neopsittacus (n = 7), 4) Chalcopsitta and Pseudeos (n = 14), 5) Lorius (n = 18), and 6) Charmosyna, Vini, and Phigys (n = 32). To produce a concatenated alignment for a clade, we followed the same steps listed above. To retain more characters in the larger clades we did not include redundant taxa or samples. We further reduced the sample size from 58 to 22 in the diverse clade containing Eos, Trichoglossus, Ps. iris, and Glossopsitta because the amount of missing data in this clade was high. We estimated subclade trees in IQ-TREE following the same procedures described above to produce alternative topologies (T1 and T2) estimated from alignments with (T1) and without missing data (T2). Each tree was rooted with a single outgroup (Oreopsittacus arfaki for Charmosyna, Vini, and Phigys; Psitteuteles goldiei for all other clades) using phyx (Brown et al. 2017). We performed the same site-specific log-likelihood procedure described above for the two alternative topologies (T1 and T2), except that site-likelihoods were converted to locus-wise log-likelihoods to assess the impact of missing data across an entire locus by summing the site-likelihoods for each locus using the scripts in Walker et al. (2018). We then estimated the Δ locus-wise log-likelihood (hereafter Δ l-lk).To explore how these putatively biased loci impacted phylogenetic inference, we grouped loci into bins representing Δ l-lk scores of >2, >10, >20, <−2, and <−10 (there were no genes where Δ l-lk < −20 in the Filtered data set). We followed the same procedure for producing concatenated alignments corresponding to each likelihood threshold bin, but for this step, we excluded the outlier loci from the global concatenated alignment. We then estimated phylogenies from each alignment to assess how sensitive phylogenetic relationships were to excluding loci in each of these approaches. If missing data bias phylogenetic relationships then the exclusion of loci with positive Δ l-lk should alter relationships driven by missing data. The removal of loci with negative Δ l-lk will enhance relationships driven by biases in missing data.
Trees with Varying Levels of Complete Data
To determine which percentage of data completeness was necessary to produce a topology similar to the Filtered tree, we generated a series of alignments from the Low Coverage data set with increasing levels of complete data. Using trimal (Capella-Gutiérrez et al. 2009), we converted all missing characters to gaps (-) to conform to the software requirements and trimmed alignments by setting the percentage of individuals required to have an unambiguous site to retain the position in the alignment. In increments of 10%, we removed all sites where 0–100% of the sites had no missing data. This approach produced a range of alignments keeping all sites (0%) through no missing data (100%). We estimated phylogenies for each of the 11 data sets in IQ-TREE using the same approach previously described, except that we estimated the best-fit model across the entire alignment because the locus partitions were not retained after filtering.
Manipulating Modern Samples to Mimic Historical Samples
To provide a complementary approach for assessing whether missing data versus data quality were biasing phylogeny, we converted a percentage of characters in five modern samples (Trichoglossus rubritorquis KU22839, Trichoglossus chlorolepidotus DOT2422, Trichoglossus ornatus DOT7930, Phigys solitarius KU22543, and Charmosyna placentis pallidior DOT20055) to missing data. If the position of these samples was sensitive to the addition of missing data and the samples clustered with historical samples then missing data are the more likely culprit of the bias. We explored different percentages of missing data (50–99.9%) and converting random characters versus only parsimony informative sites. We found that the position of these samples only changed when we converted 99.9% of parsimony informative sites to missing data, and we do not present results from the other thresholds. The reason why such a high percentage was necessary is because the positions of these taxa were supported by a small number of parsimony informative sites, and the only way to influence the sites driving their relationships was to use a high threshold. We converted sites independently ten times and estimated phylogenies in IQ-TREE from the alignments using the same approach previously described.
Summarizing Phylogenetic Signal in Modern and Historical Samples
To explicitly compare the information content in DNA sequence from historical and modern samples, we calculated alignment statistics for each locus and for alignments of only parsimony informative sites partitioned into sample types using AMAS (Borowiec 2016). To determine if older samples had more missing data, we also regressed the amount of missing data in each sample versus the age of the sample. Because we sequenced samples over multiple years, batch effects or biases attributable to differences among sequencing runs could also bias our results (Leigh et al. 2018). To provide a qualitative assessment of batch effects, we provide plots of trees where tips have been colored according to one of three plates that they were sequenced on. If there were substantial batch effects in the data then phylogenetic relationships could be, in part, due to whether samples were sequenced together.We built a neural network in the R package caret v. 6.0.79 (Kuhn 2008) to test whether the Δ l-lk of each locus partition could be predicted by the alignment statistics. The alignment statistics (alignment length, the number of undetermined characters, the number of parsimony informative sites, the number of variable sites, and GC content) were specified as the input neurons, and the output neuron was the Δ log-likelihood. The input data were scaled to the minimum and maximum for each statistic, and the percentage of training/test data was set to 75%/25%, respectively. We produced 100 training/test data sets, independently ran each analysis, and reported mean R2, root-mean-square-error, and variable importance. We performed this analysis on the Low Coverage alignment that included all taxa and independently on the six subclades using the data from both filtering schemes.
Results
Data Characteristics
We sequenced 176 unique samples, including 16 that were resequenced to improve the amount of data recovered. We dropped five individuals that had aberrant relationships and long branches in the tree, patterns that were presumably driven by limited data. The final data set comprised 171 individuals (168 in the ingroup; three outgroups) where 54% and 46% were from historical and modern samples, respectively. Of the 58 species sampled, 27 had intraspecific sampling that included historical and modern samples. Historical samples on average had more reads (mean = 5.5 million; SD = 4.6 million) than modern samples (mean = 3.5 million; SD = 2.3 million), but a higher percentage of the reads in modern samples mapped to the reference (modern: mean = 87.1%; SD = 13.3%; historical: mean = 52.0%; SD = 21.7%). In modern samples, a greater number of positions were masked for having coverage <6× (modern: mean = 443,180; SD = 202,764; historical: mean = 359,958; SD = 152,304). The mean per-site coverage across individuals was similar between the two sample types (modern: mean = 67.9; SD = 25.2; historical: mean = 72.6; SD = 28.2). Additional read and locus statistics are available in supplementary tables S2 and S3, Supplementary Material online. We produced a Low Coverage data set that included all variant sites irrespective of coverage, and Filtered data set that excluded variant sites with <6x coverage. In the Low Coverage and Filtered data sets, loci had a mean length of 498 bp (range: 140–1,708 bp) and 482 bp (range: 105–1,413 bp), respectively. The mean and range number of taxa per gene was 164 for the Low Coverage (128–171) and 152 for the Filtered (5–171) data sets. After retaining loci where 75% of the individuals were present in any one locus, the Low Coverage data set had 4,208 loci, 2,105,994 bp, and 47,338 parsimony informative sites, whereas the Filtered concatenated alignment had 3,765 loci, 1,917,997 bp, and 39,404 parsimony informative sites. Additional supplementary data are available at https://doi.org/10.5061/dryad.n5tb2rbsp.Overall, the Low Coverage data set had more parsimony informative sites than the Filtered data set (fig. 1). A comparison of sample types shows that the range in the number of parsimony informative sites among loci was lower in the modern samples in contrast to the historical samples (fig. 1). In the Filtered data set (fig. 1), there was greater variability in the number of samples per locus in historical samples than the Low Coverage data set (fig. 1). For each alignment type, the modern samples contained a greater number of parsimony informative sites and less missing data than the historical samples (1.7× and 2.1× more parsimony informative sites in the Low Coverage and Filtered data sets, respectively; fig. 1). In the Filtered data set, the number of parsimony informative sites dropped, and the range of the number of individuals in each locus alignment increased. Plotting nonparsimony informative sites and missing data at those positions showed a similar pattern where there was more missing data in historical samples (supplementary fig. S2, Supplementary Material online). The percentage of missing data in samples in the Filtered data set decreased with specimen age (adjusted R2 = 0.31; n = 144; P value < 0.0001; supplementary fig. S2, Supplementary Material online).
. 1
Modern samples have more parsimony informative sites (PIS), less missing data at PIS, and less variation in number of samples among loci. Shown are histograms of the number of samples per locus in the Low Coverage (A) and Filtered (B) alignments. (C, D) Boxplots showing the number of parsimony informative sites (C) and number of missing characters at parsimony informative sites (D) in the ingroup samples. The data are partitioned into the modern versus historical samples, and Low Coverage versus Filtered alignments. In all plots, modern samples are shown in red and historical samples in blue.
Modern samples have more parsimony informative sites (PIS), less missing data at PIS, and less variation in number of samples among loci. Shown are histograms of the number of samples per locus in the Low Coverage (A) and Filtered (B) alignments. (C, D) Boxplots showing the number of parsimony informative sites (C) and number of missing characters at parsimony informative sites (D) in the ingroup samples. The data are partitioned into the modern versus historical samples, and Low Coverage versus Filtered alignments. In all plots, modern samples are shown in red and historical samples in blue.
Resolving Phylogenomic Relationships among Lorikeets
The backbone phylogeny we inferred for the Loriini generally had high support and the placement of genera was stable (supplementary figs. S3 and S4, Supplementary Material online). Summarizing higher-level relationships, Oreopsittacus was sister to all other ingroup taxa, then Charmosyna was sister to the clade containing Neopsittacus, Lorius, Pseudeos, Chalcopsitta, Psitteuteles, Glossopsitta, Eos, Trichoglossus, and Parvipsitta. The placements of Neopsittacus, Lorius, Pseudeos, and Chalcopsitta were well supported in the tree, and each of these genera was monophyletic. Trichoglossus, Charmosyna, and Psitteuteles were not monophyletic. Psitteuteles was found in three separate places in the tree: Psitteuteles versicolor was sister to the recently erected genus Parvipsitta; Ps. iris was nested within a clade of Trichoglossus taxa that are from Indonesia; and Psitteuteles goldei was sister to the clade containing Glossopsitta, Eos, Trichoglossus, and Ps. iris. Vini and Phigys are strongly supported as nested within Charmosyna. Relationships within Charmosyna (including Vini and Phigys) and Chalcopsitta were generally stable across filtering schemes, as were relationships of the less diverse clades (Oreopsittacus, Neopsittacus, and Parvipsitta). Within the remaining clades, there were several notable differences in topological relationships among the Low Coverage and Filtered trees.The Filtered tree has four clades containing Trichoglossus, Eos, Ps. iris, and G. concinna with varying levels of support (fig. 2). Glossopsitta concinna was sister to a clade containing a monophyletic Eos, Trichoglossus, and Ps. iris. Within this tree, Eos was monophyletic and sister (BS = 100%) to a clade containing Trichoglossus taxa that occur in Indonesia and the Philippines (T. ornatus, Trichoglossus flavoviridis, and Trichoglossus johnstoniae) and Ps. iris. The Eos, Trichoglossus, and Ps. iris clade was sister (BS = 87%) to a clade containing the remaining Trichoglossus, which was supported by a BS value of 92%. This Trichoglossus clade had several short internodes and poorly supported relationships, particularly among Trichoglossus haematodus subspecies, which primarily came from historical samples. Trichoglossus euteles, Trichoglossus forsteni, Trichoglossus capistratus, Trichoglossus weberi, and Trichoglossus rubritorquisare nested within T. haematodus. Trichoglossus forsteni stresemanni was more closely related to Trichoglossus capistratus than to other T. forsteni taxa. In contrast, the Low Coverage tree has two well-supported (BS ≥ 95%) clades composed of Trichoglossus, Eos, Ps. iris, and G. concinna (fig. 2). One clade consists of entirely historical samples (N = 23), whereas the other was primarily modern samples (13/16). Within each of these clades, tips have similar relationships among taxa as seen in the Filtered tree. Trichoglossus that occur in Indonesia or the Philippines and Ps. iris are sister to Eos and the remaining Trichoglossus form a clade with the exception of one historical sample (T. haematodus haematodus). Support values are higher in the clade composed of mostly modern samples. In both trees (Low Coverage and Filtered), Charmosyna was composed of four clades, Charmosyna wilhelminae was sister to all other taxa in the clade, Charmosyna rubronotata and Charmosyna placentis are sister and form a clade, Charmosyna multistriata was sister to Charmosyna josefinae and Charmosyna papou, and the remaining Charmosyna taxa (Charmosyna margarethae, Charmosyna rubrigularis, Charmosyna meeki, Charmosyna palmarum, and Charmosyna amabilis) and Ph. solitarius and Vini form a clade. The position of Charmosyna pulchella and Charmosyna toxopei will be discussed below.
. 2
Alternative topologies for the subclade that differs the most among filtering schemes. Shown is the subclade containing Trichoglossus/Eos/Psitteuteles iris/Glossopsitta from trees estimated without (A: Filtered Tree) and with low coverage characters (B: Low Coverage Tree). In the Low Coverage tree are clades composed of mostly historical versus modern samples. Bootstrap nodes are colored on a gradient from 100% (black) to <70% (gray). Taxon names are colored according to whether their DNA came from modern tissues (red) or historical specimens (blue).
Alternative topologies for the subclade that differs the most among filtering schemes. Shown is the subclade containing Trichoglossus/Eos/Psitteuteles iris/Glossopsitta from trees estimated without (A: Filtered Tree) and with low coverage characters (B: Low Coverage Tree). In the Low Coverage tree are clades composed of mostly historical versus modern samples. Bootstrap nodes are colored on a gradient from 100% (black) to <70% (gray). Taxon names are colored according to whether their DNA came from modern tissues (red) or historical specimens (blue).Similar clustering patterns based on sample type (historical vs. modern) are observed in Lorius, Vini, and Charmosyna in the Low Coverage trees (supplementary fig. S3, Supplementary Material online). The two subspecies of Lorius lory that come from modern samples are sister taxa. The one modern sample of C. placentis was sister to Charmosyna rubronotata. Charmosyna palmarum (a modern sample) was strongly supported as sister to Phigys and Vini. The three Vini from historical samples group together. The two C. papou subspecies that come from historical samples are sister to a clade containing the remaining C. papou subspecies. None of these relationships are observed in the Filtered tree.A qualitative assessment of batch effects, by coloring each tip in the tree according to sequencing run, did not detect biases whereby samples would have clustered together based on sequencing plate (supplementary fig. S15, Supplementary Material online). In the Low Coverage tree (supplementary fig. S6, Supplementary Material online), biases in clustering were more apparent when tips are colored according to whether the sample came from a modern or historical source, which was not observed in the Filtered tree (supplementary fig. S6B, Supplementary Material online).
Outlier Sites and Loci
The outlier analyses assessing the change in site-likelihoods scores between the Low Coverage versus the Filtered topology identified 3,084 (3,084 sites: Δ s-lk > 2; 473 sites: Δ s-lk > 10 = 473; and 112 sites: Δ s-lk > 20) and 1,925 (1,925 sites: Δ s-lk < −2; 89 sites: Δ s-lk < −10; and three sites Δ s-lk, −20) outlier sites in the alignment (1,980,082 bp) with positive and negative Δ s-lk values, respectively (fig. 3). Higher and more positive Δ s-lk are sites that better support the topology estimated from the Filtered alignment, and lower and more negative values favor the tree estimated from the Low Coverage alignment. The 1,925 outlier sites with negative Δ s-lk were found on 1,381 loci, and the 3,084 outlier sites with positive values were on 1,878 loci. The 3,084 sites with Δ s-lk > 2, which favored the topology of the Filtered tree, exhibited a disproportionate number of missing sites in the historical versus modern samples (fig. 3). We plotted Δ s-lk scores versus the best-fit nucleotide substitution models from IQ-TREE to assess whether there was a relationship between particular models and the extent of the score but we observed that high and low Δ s-lk were found across a wide array of models (supplementary fig. S7, Supplementary Material online).
. 3
Outlier sites have high missing data in historical samples. (A) Outlier site plot showing Δ sites-wise log-likelihoods (Δ s-lk) for topologies estimated with and without low coverage sites. The y axis is the Δ s-lk score and the x axis represents individual sites in the concatenated alignment, where K and M represent thousand and million, respectively. Points are colored according to the magnitude of the Δ site-wise log-likelihood scores according to a gradient reflecting the different likelihood thresholds (>2, >10, >20, <−2, <−10, and <−20). (B) Boxplot of historical (blue) and modern (red) samples showing the amount of missing data in the 3,084 outlier sites (Δ s-lk > 2) identified in plot A.
Outlier sites have high missing data in historical samples. (A) Outlier site plot showing Δ sites-wise log-likelihoods (Δ s-lk) for topologies estimated with and without low coverage sites. The y axis is the Δ s-lk score and the x axis represents individual sites in the concatenated alignment, where K and M represent thousand and million, respectively. Points are colored according to the magnitude of the Δ site-wise log-likelihood scores according to a gradient reflecting the different likelihood thresholds (>2, >10, >20, <−2, <−10, and <−20). (B) Boxplot of historical (blue) and modern (red) samples showing the amount of missing data in the 3,084 outlier sites (Δ s-lk > 2) identified in plot A.Overall, the subclade outlier analyses for the Low Coverage alignment identified more outlier loci (fig. 2 and supplementary fig. S8, Supplementary Material online). There were 61/2 (Low Coverage/Filtered), 396/47, and 1,608/431 loci in the three bins (Δ l-lk of >20, >10, and >2), respectively (fig. 4). There were 121/11 (Low Coverage/Filtered) and 1,309/255 loci in the three bins (Δ l-lk of <−10 and <−2), respectively. The maximum and minimum Δ l-lk were much higher in the Low Coverage (in Trichoglossus/Eos/Psitteuteles/Glossopsitta: Δ l-lk = −33.089 to 394.392) versus the Filtered data set (in Trichoglossus/Eos/Psitteuteles/Glossopsitta: Δ l-lk = −15.742 to 33.106). There were 1,164 loci identified by both the Low Coverage versus Filtered tree and subclade outlier analyses. In the Low Coverage and Filtered analyses, the outlier sites were found on 444 loci uniquely identified, and 740 loci identified by the subclade clade analyses.
. 4
Likelihood plots showing Δ locus-wise log-likelihood (Δ l-lk) for topologies estimated with and without missing data for the Low Coverage data set. The y axis is the Δ l-lk and the x axis represents individual loci across the full alignment. Shown are the results for six subclades assessed within Loriini using the Low Coverage data set: (A) Parvipsitta and Psitteuteles, (B) Chalcopsitta and Pseudeos, (C) Neopsittacus, (D) Charmosyna, Vini, and Phigys, (E) Eos, Trichoglossus, Glossopsitta concinna, and Psitteuteles iris, and (F) Lorius. Points are colored according to the magnitude of the Δ l-lk scores according to a gradient ranging from >20 (blue) through <−10 (orange).
Likelihood plots showing Δ locus-wise log-likelihood (Δ l-lk) for topologies estimated with and without missing data for the Low Coverage data set. The y axis is the Δ l-lk and the x axis represents individual loci across the full alignment. Shown are the results for six subclades assessed within Loriini using the Low Coverage data set: (A) Parvipsitta and Psitteuteles, (B) Chalcopsitta and Pseudeos, (C) Neopsittacus, (D) Charmosyna, Vini, and Phigys, (E) Eos, Trichoglossus, Glossopsitta concinna, and Psitteuteles iris, and (F) Lorius. Points are colored according to the magnitude of the Δ l-lk scores according to a gradient ranging from >20 (blue) through <−10 (orange).We found that by converting parsimony sites to missing data, modern samples could cluster with historical samples. The extent of the shift of the sample and the position of the manipulated sample in the tree varied across the trees (supplementary fig. S9, Supplementary Material online). For example, T. ornatus, which was strongly supported as sister to Trichoglossus flavoviridis in the Low Coverage Tree (supplementary figs. S3 and S9, Supplementary Material online), was nested within the clade containing only historical samples in the Trichoglossus/Eos/Psitteuteles clade in some of the trees with manipulated sequences (supplementary fig. S9E and I, Supplementary Material online). Charmosyna placentis pallidior was strongly supported as sister to Charmosyna rubronotata rubronotata, and when most of its parsimony informative sites are converted to missing data it is nested within in its correct position in C. placentis (supplementary fig. S9D, E, and I, Supplementary Material online). In some trees, the position of taxa (e.g., Trichoglossus chlorolepidotus) did not change at all (supplementary fig. S9B–F, Supplementary Material online), and in others, the same taxon placed well outside their clade (supplementary fig. S9I and J, Supplementary Material online).According to our neural network, alignment statistics predicted ∼4% of the variation of Δ l-lk scores in the Low Coverage versus Filtered trees (mean and SD; R2 =0.04 [0.02]; RMSE = 0.02 [0.006] Δ l-lk scores). In the model, GC content (mean variable importance: GC content = 31.04) and the number of parsimony informative sites (23.07) were more important than the other statistics (alignment length = 18.90; no. of taxa = 18.90; undetermined characters = 11.59; and no. variable sites = 1.30). For the neural networks on the six subclades (supplementary table S4, Supplementary Material online), the models for the Eos/Trichoglossus/Glossopsitta/Psitteuteles predicted ∼9% of the variation in Δ l-lk scores (R2 = 0.088) and the most important variable in the model was parsimony informative sites (Low Coverage: 52.57; Filtered: 60.72). The Eos/Trichoglossus/Glossopsitta/Psitteuteles clade had the most variable topology among filtering schemes. For the remaining subclades, the neural nets performed poorly (supplementary table S4, Supplementary Material online) or had positive R2 values for clades with limited variation in Δ l-lk scores.
Impacts of Filtering Sites and Loci
The removal of outlier loci with positive Δ l-lk scores broke up some of the same-type clusters, particularly at a threshold value of all loci with Δ > 2 (supplementary fig. S3B–D, Supplementary Material online). However, the removal of this many loci (n = 1,608) also reduced the support for other nodes in the tree. Trees estimated with the removal of negative outlier loci retained the apparent sample-type clusters (supplementary fig. S3, Supplementary Material online). Individual taxa whose position varied the most among filtering schemes were G. concinna and Trichoglossus rubiginosus. In the Filtered data set, which did not exhibit the sample-type clusters, the removal of outlier loci (Δ l-lk > 2; supplementary fig. S4D, Supplementary Material online) increased the support for the placement of G. concinna as sister to the clade containing Trichoglossus, Eos, and Ps. iris. In contrast, the removal of outlier loci (Δ l-lk < −2) placed G. concinna within the clade containing Trichoglossus, Eos, and Ps. iris. This placement received moderate BS support for either being sister to the clade containing T. haematodus and allies or the entire clade containing Trichoglossus, Eos, Ps. iris, and G. concinna. Lorius lory has seven subspecies, which formed a well-supported clade in the Filtered tree (supplementary fig. S9, Supplementary Material online), with the exception of Lorius lory viridicrissalis, whose placement was equivocal. The Low Coverage tree has L. lory viridicrissalis within the L. lory clade with low support (supplementary fig. S3, Supplementary Material online). Filtering of outlier loci changed support values but never unequivocally placed L. lory viridicrissalis within L. lory. Charmosyna pulchella and C. toxopei are sister taxa, however, their position within Charmosyna varied across trees. Trees estimated with all loci or loci with negative Δ l-lk scores excluded had these taxa as sister (often with high support) to the clade containing the subclades Charmosyna multistriata; C. josefinae and C. papou; and Charmosyna margarethae, Charmosyna rubrigularis, Charmosyna meeki, C. palmarum, Charmosyna amabilis, Ph. solitarius, and Vini (supplementary fig. S3E and F, Supplementary Material online). Alternatively, trees where positive Δ l-lk scores >2 were excluded had these taxa as sister, albeit with lower support (BS = 63%) to a clade containing Charmosyna, Phigys, and Vini (supplementary fig. S3D, Supplementary Material online).Examining the differences among topologies in multidimensional space showed distances among trees change across filtering schemes (fig. 5). In the trees where outlier sites were excluded (fig. 5), the Robinson–Foulds distances among the Low Coverage and Filtered trees decreased (supplementary fig. S10, Supplementary Material online). At a filtering threshold of Δ s-lk > 2 (3,084 sites), the distance between the two trees was minimal (fig. and supplementary fig. S10D, Supplementary Material online). Filtering sites with negative Δ s-lk values maintained a topology similar to the Low Coverage tree, except at a threshold of Δ s-lk < −2, which was distant from all other trees in multidimensional space. In the subclade analyses, the filtering of loci did not yield similar topologies between the Low Coverage and Filtered trees (fig. 5). However, the Low Coverage tree where loci were excluded at a threshold of Δ s-lk > 2 was the least distant from the Filtered tree (fig. 5). The Low Coverage trees with all loci and Δ s-lk < −2 produced similar topologies and were the most distant from the Filtered trees (fig. 5). Despite some differences in the placement of taxa across the Filtered trees, the Robinson–Foulds distances among trees were comparatively low.
. 5
Multidimensional scaling of Robinson–Foulds distances among 100 bootstrap trees with differing levels of outlier sites or loci excluded. (A) Compares distances among Filtered and Low Coverage trees where outlier sites have been removed at different increments. Outlier sites were excluded in the Low Coverage alignment using Δ site-wise log-likelihood (Δ s-lk) thresholds of >20, >10, >2, <−2, <−10, and <−20. (B) The distances among trees produced from the subclade outlier analyses. Shown is a comparison of the Low Coverage and Filtered trees with topologies estimated with outlier loci excluded using Δ locus-wise log-likelihood (Δ l-lk) thresholds of >2 and <−2.
Multidimensional scaling of Robinson–Foulds distances among 100 bootstrap trees with differing levels of outlier sites or loci excluded. (A) Compares distances among Filtered and Low Coverage trees where outlier sites have been removed at different increments. Outlier sites were excluded in the Low Coverage alignment using Δ site-wise log-likelihood (Δ s-lk) thresholds of >20, >10, >2, <−2, <−10, and <−20. (B) The distances among trees produced from the subclade outlier analyses. Shown is a comparison of the Low Coverage and Filtered trees with topologies estimated with outlier loci excluded using Δ locus-wise log-likelihood (Δ l-lk) thresholds of >2 and <−2.
Impact of Data Completeness
The alignment length ranged from 2,105,994 bp (0% or all sites) through 41,504 bp (100% or no sites with missing data [table 1]), and the trees estimated from these alignments are in supplementary figure S11, Supplementary Material online. Across this same range of filtering, there were 30,380 (0%) through 373 (100%) parsimony informative sites (table 1). At 60% completeness, the sample-type clusters started to break-up (supplementary fig. S11G, Supplementary Material online), and at 70% the tree was similar to the Filtered tree (supplementary figs. S4, Supplementary Material online). By 90% completeness, some relationships differed from the Filtered tree (supplementary figs. S4, Supplementary Material online), and by 100% the tree had lower resolution and support (supplementary fig. S11K, Supplementary Material online). Between the 70% and the 90% data completeness threshold, the alignment was reduced from 1,186,107 to 800,137 bp and 15,404 to 8,632 bp parsimony informative sites.
Table 1
Data Completeness, Alignment Length, and Number of Parsimony Informative Sites at Differing Thresholds of Missing Data Allowance
% of Data Completeness
Alignment Length (bp)
PIS
0
2,105,994
30,382
10
1,901,418
30,162
20
1,786,228
28,755
30
1,744,235
27,936
40
1,661,160
25,796
50
1,498,337
21,765
60
1,354,025
18,811
70
1,186,107
15,404
80
1,024,281
12,341
90
800,137
8,632
100
41,504
372
Note.—Shown are the percentage of individuals at each site with nonambiguous characters across the Low Coverage alignment. As the alignment length and number of parsimony informative sites (PIS) decrease, the percentage of data completeness increases and more characters are excluded.
Data Completeness, Alignment Length, and Number of Parsimony Informative Sites at Differing Thresholds of Missing Data AllowanceNote.—Shown are the percentage of individuals at each site with nonambiguous characters across the Low Coverage alignment. As the alignment length and number of parsimony informative sites (PIS) decrease, the percentage of data completeness increases and more characters are excluded.
Discussion
We showed that systematic bias caused by missing informative sites between DNA sequences from modern versus historical specimens can produce aberrant or unstable phylogenetic relationships. To obtain dense taxon sampling in our focal group, the Loriini, we leveraged samples collected over the last 100+ years and assessed how this sampling scheme impacted phylogenetic relationships by producing alignments with low coverage characters included and excluded. These two trees exhibited some striking differences. In the Low Coverage tree, there were numerous cases where historical or modern samples clustered together that were not observed in the Filtered tree (e.g., fig. 2). We employed a targeted and general approach to assess how missing data were influencing these unexpected relationships. The targeted method using a site outlier analysis showed that a small number of sites were driving the topological differences, and at these sites, historical samples had substantially more missing data (fig. 1). Excluding low coverage characters reduced the discrepancy in missing data between historical and modern samples, and at this level of disparity, the tree did not contain sample-type clusters and was similar in topology to the Filtered tree (fig. 5). A more nuanced look at outlier loci within subclades showed that many loci supported alternative topologies when sites with missing data were excluded, and the position of some branches shifted when these loci were excluded (supplementary fig. S3, Supplementary Material online). Biases in the historical samples could also be observed in modern samples by dropping the majority of their parsimony informative sites, which produced similar sample-type clusters observed in the Low Coverage tree (supplementary fig. S9, Supplementary Material online). The neural network was able to show that the number of parsimony informative sites could predict likelihood scores for the clade most impacted by missing data (Eos/Trichoglossus/Glossopsitta/Psitteuteles; supplementary table S4, Supplementary Material online), but for most clades, which did not have as many outlier loci, the models were poor fits to the data. The more general approach of data reduction using the percentage of data completeness indicated that sites with high data completeness were necessary to avoid spurious relationships, but more stringent conditions of data completeness produced less-resolved trees. After accounting for biased loci and understanding the stability of nodes, we inferred a more robust phylogenetic hypothesis for the Loriini. Taxonomic relationships within the clade can now be revised to reflect natural groupings, but for some groups, additional work is still necessary.
Asymmetric Information Content among Sample Types
We found that alignments with high missing data produced biased phylogenetic relationships. In these trees, subclades consisting of mostly modern samples presumably formed because there was not enough information to place historical samples among the modern samples. Our analyses suggest that an asymmetry in phylogenetic information content among sample types is the primary culprit of the bias because only 3,084 sites (0.15% of total sites) drove the topological differences among trees, and the historical samples had 10.9× more missing data at these sites (fig. 3). By filtering for data completeness, we produced a similar result and inferred the expected phylogeny by including only sites where 70% of the individuals had unambiguous characters. Previous work has shown that ambiguous characters can bias the probability of taxa being sister (Lemmon et al. 2009) and increase the resolution and support of clades (Simmons 2012, 2014). These previous studies did not deal with historical versus modern samples and did not have the magnitude of characters in our data set, but a similar mechanism is likely operating. Although we accounted for among-site rate variation, which has been shown to lead to biases in missing data (Lemmon et al. 2009), we did not evaluate how multispecies coalescent approaches would deal with our data set. We concentrated instead on a concatenated approach because our data met two criteria in which species-tree summary methods perform poorly (Molloy and Warnow 2018); namely, our data comprised 1) many poorly resolved gene trees with high missing data from 2) loci with low information content found in UCEs.By including low coverage characters, we were able to explore potential biases that can arise between historical and modern samples. Filtering according to a read coverage threshold at each variant site is common practice in population genomic studies (e.g., Thom et al. 2018), but this approach is less frequently employed in phylogenomic bioinformatic pipelines (e.g., Faircloth 2016). In the Low Coverage tree, we found clusters of historical or modern samples that were not present in the Filtered tree (fig. 2 and supplementary figs. S3 and S4, Supplementary Material online). Besides an asymmetry in informative sites, these clusters could be caused by sequencing errors present in one sample type, batch effects, or contamination. We address sample type in detail below, but we suspect that biases of batch effects and contamination were minimal. For example, we had limited power to test for batch effects because we did not randomly and evenly sequence samples across runs, therefore, there are portions of the tree where clades are composed almost entirely of samples from the same sequencing run (supplementary fig. S5, Supplementary Material online). In these cases, we do not interpret these patterns as batch effects because the tips occur in their expected topological position and samples from different sequencing lanes are distributed throughout the tree. We took great care to avoid contamination during wet-lab procedures (Mundy et al. 1997) and we have no strong reason to suggest that contamination is driving the observed pattern, particularly after exploring the impacts of missing data on the topology. The impact of contamination may have been more pronounced on low-quality characters, which were filtered out in all treatments because unreported preliminary trees estimated with these low-quality characters produced trees with long branches. However, more subtle effects of contamination on a small number of characters may not be directly detectable in the approaches we employed. Although we cannot rule out additional artifacts caused by contamination or sequencing error, the topology within each of the most apparent sample-type clusters in the Trichoglossus/Eos/Psitteuteles clade exhibited the expected relationships among taxa.The outlier analysis on subclades also found loci that were impacted by missing data. In the Low Coverage tree, the sample-type clusters were broken up when outlier loci with positive values were excluded but also reduced support values (supplementary fig. S3, Supplementary Material online). The exclusion of outlier loci with negative values retained the biased relationships. These loci had negative values because the topologies estimated with all sites with missing data removed were more likely given the alignment. The removal of these loci produced alignments that only included loci that were either biased or not impacted by missing data. In the Filtered data set, the number of identified outlier loci was reduced and exclusion of outlier loci was less profound. Nonetheless, the removal of outlier loci in the Filtered data set showed how the placement of G. conccina, Trichoglossus rubiginosus, and the clade containing C. pulchella and C. toxopei was sensitive to missing data (supplementary fig. S4, Supplementary Material online). Interestingly, about 72% of the loci identified by the subclade outlier analyses were the same loci of the outlier sites identified by the Low Coverage versus Filtered outlier analysis. The information content of the nonoverlapping loci is important because the more targeted site-wise outlier analysis was better at reconciling topological differences among the Low Coverage and Filtered trees than was the subclade approach.There was a tendency for historical samples to fall outside of their clade or even the ingroup, as evident in previous phylogenomic studies on birds (Hosner et al. 2016; Moyle et al. 2016; Andersen et al. 2019; McCullough et al. 2019). This was the case for seven of our excluded samples, which produced limited data and could not be accurately placed in their genus or higher-level clade. The sample-type cluster within Trichoglossus/Eos/Psitteuteles is an extreme example of this pattern, and the pattern is so striking because of the high number of historical samples in this particular clade (fig. 2). In the trees wherein a subset of modern samples we converted most parsimony informative sites to missing data, we observed the same pattern whereby some of the manipulated samples were inferred outside of their expected clade (e.g., supplementary fig. S9J, Supplementary Material online). Without prior information on whether a taxon is sister or falls outside of a clade of closely related taxa, samples with high missing data in large alignments may not be able to be accurately placed on a phylogeny.
Identifying Biased Samples and Loci
Our neural network models were good predictors of Δ l-lk scores in some tests, but not others. A factor to consider for interpreting our model results is that the range in Δ l-lk scores varied substantially among clades, and typically the models that performed poorly were for the clades with low variation in Δ l-lk scores. In contrast, the Eos/Trichoglossus/Glossopsitta/Psitteuteles clade, which had the widest range in Δ l-lk scores, had the best performing models. In both the Filtered and Low Coverage data sets, parsimony informative sites were the most important variable in the models for Eos/Trichoglossus/Glossopsitta/Psitteuteles, suggesting that missing information at these sites in historical samples influenced the topological differences. The neural network for Δ l-lk values estimated between the Low Coverage and Filtered trees explained 4% of the variation in outlier scores, and GC content was the most important variable in the model, followed by parsimony informative sites. However, the outlier sites on these loci had high missing data, and when these sites were removed, the estimated phylogeny was similar to relationships in the Filtered tree. Because the magnitude of the Δ l-lk score is going to be partially dictated by how much information there is at a site or across a locus, the outlier analysis is expected to identify sites or loci that have enough information to distinguish alternative trees. Missing data at less informative sites is also known to bias phylogenetic inference (Simmons 2012, 2014), and the outlier analysis we used may not capture the full extent of missing data on our inferred phylogenies.Preferentially selecting phylogenetically informative loci is expected to produce trees with better support (Gilbert et al. 2018), but our results suggest that this practice can produce less reliable relationships when the data content dramatically varies among samples. Other work has shown that filtering phylogenomic markers by information content had mixed results in terms of resolving discordance among trees estimated with different phylogenetic methods (Mclean et al. 2019). Outlier analysis using site and locus likelihood scores (Shen et al. 2017; Walker et al. 2018) provides a rapid means of identifying loci that have a large impact on phylogeny reconstruction, but, as we showed, the resolution of this approach will depend on the trees that are available for comparison (e.g., the a priori expected phylogeny vs. an alternative phylogeny). As mentioned above, a targeted outlier approach will not address all potential biases that missing data can cause, but it can identify sites that are having a strong influence on the phylogeny. Despite the limitations of site-likelihoods, the precision of identifying specific sites/loci may be the more favorable option to filtering data because the alternative of using percentage of data completeness to remove sites resulted in removing positions in the alignment that were important for other portions of the tree. This idiosyncratic behavior of filtering for data completeness to achieve higher topological support for one recalcitrant historical sample occurred in a recent study of honeyeaters. Andersen et al. (2019) increased the filtering stringency toward more complete data sets to improve support for Gymnomyza aubryana, however, previously well-supported nodes elsewhere in the tree were negatively impacted due to a reduction in total parsimony informative sites. The optimal percentage of data completeness will vary among data sets and depend on how asymmetric the information content is among sample types. For our data set, there was a narrow window for when data completeness produced a reliable phylogeny (e.g., 70% vs. 90%; supplementary fig. S11H and J, Supplementary Material online) because data completeness >70% led to a less-resolved tree.
Taxonomic Implications
Our study builds on previous phylogenetic work on the Loriini by further clarifying relationships and adding 64 previously unsampled taxa (fig. 6). We inferred a backbone phylogeny of relationships among genera that was fairly well resolved with the exception of the clade containing Trichoglossus, Ps. iris, Eos, and Glossopsitta, and some nodes in Charmosyna. Our analyses corroborated recently proposed taxonomic changes where Pseudeos cardinalis was moved into Pseudeos from Chalcopsitta, and Parvipsitta was resurrected to contain P. pusilla and P. porphyrocephala, which were previously placed in Glossopsitta (Schweizer et al. 2015). In all of our trees, Pseudeos fuscata and Pseudeos cardinalis were sisters and were in turn sister to Chalcopsitta. Parvipsitta pusilla and P. porphyrocephala were sisters and not closely related to G. concinna. However, we found strong support for P. pusilla and P. porphyrocephala being sister to Ps. versicolor, a novel result. Psitteuteles versicolor and Parvipsitta could be subsumed under a single genus. Irrespective of this taxonomic decision, the polyphyly of Psitteuteles will require that Ps. goldei and Ps. iris be moved into new genera. Psitteuteles goldei is sister to the clade containing Trichoglossus, Eos, Ps. iris, and Glossopsitta. The taxonomic revision of Ps. iris will depend on how Trichoglossus is treated because Ps. iris is nested within a geographically coherent clade of taxa distributed largely to the west of New Guinea. The clade containing Charmosyna, Phigys, and Vini represents a deep, diverse, and geographically widespread group. The species in these genera are varied in terms of body size and shape, tail length, plumage color, and sexual dimorphism (Forshaw and Cooper 1989; Merwin et al. 2020), and these morphological traits are not found in monophyletic groups in our phylogeny. Species-level relationships among species in Charmosyna were well supported and stable with the exception of the placement of C. toxopei and C. pulchella. Overall, the taxonomic revision of this clade will present challenges regarding when and where to split or lump taxa and how best to circumscribe genera.
. 6
Maximum likelihood tree containing unique taxa in Loriini. The tree was inferred from a concatenated alignment where loci identified with the locus likelihood analysis with Δ locus-wise log-likelihood (Δ l-lk) values of >10 were excluded. On each node are shown rapid bootstrap values and the taxon names are colored according to whether their DNA came from modern tissues (red) or historical specimens (blue). Bootstrap nodes are colored on a gradient from 100% (black) to <70% (gray).
Maximum likelihood tree containing unique taxa in Loriini. The tree was inferred from a concatenated alignment where loci identified with the locus likelihood analysis with Δ locus-wise log-likelihood (Δ l-lk) values of >10 were excluded. On each node are shown rapid bootstrap values and the taxon names are colored according to whether their DNA came from modern tissues (red) or historical specimens (blue). Bootstrap nodes are colored on a gradient from 100% (black) to <70% (gray).Relationships among subspecies within species varied substantially among taxa. Support for relationships among species within Lorius were generally stable except the placement of L. lory viridicrissalis, which was only nested within L. lory in the Low Coverage tree. Our L. lory viridicrissalis was a historical sample with a high degree of missing parsimony informative sites and its position as sister to L. lory and L. hypoinochrous is most likely an artifact. There were also varying levels of support for relationships among the other subspecies in L. lory, the most diverse species in the genus. Relationships among subspecies in C. papou, C. josefinae, and C. placentis had high support. Our analyses inferred a paraphyletic T. haematodus (with low support) and T. forsteni, the latter of which is still included in T. haematodus by some taxonomic checklists (Dickinson and Remsen 2013; Clements et al. 2019). This clade had many historical samples, which likely contributed to the clade’s low support, but even several of the taxa from modern samples were not placed with high support in the clade. Resolving these challenging relationships within Trichoglossus will likely require finer-resolution genetic data and expanded population-level sampling.
Conclusion
Next-generation sequencing has provided systematists with an unprecedented amount of information for inferring phylogenetic relationships (McCormack et al. 2013). However, phylogenomic data sets are being produced faster than the development of best practices for assembling, processing, and analyzing large data sets for phylogenetic inference, particularly as the use of low-quality museum samples increases. Alignments produced without careful inspection may harbor biased loci that can have a large impact on downstream analyses (Springer and Gatesy 2018). Our findings have general implications for phylogenomic studies where there is an asymmetry in parsimony informative sites among closely related taxa. Although missing data have shown ambiguous impacts on phylogenetic inference (Lemmon et al. 2009; Wiens and Morrill 2011; Simmons 2012, 2014; Hovmöller et al. 2013; Streicher et al. 2016), the combination of a much higher number of informative sites in contemporary phylogenomics with an asymmetry between samples of different quality warrants new investigations on biases that can arise in alignments. The magnitude of biases will likely vary according to clade diversity and age and the number of loci collected. We found that the bias was most extreme in a diverse and rapid radiation where there was likely limited information, even in complete loci, for teasing apart relationships. Shallow systematic and phylogeographic studies are expected to be the most difficult temporal scale for resolving relationships when there are high missing data associated with particular samples. Moving forward, having an understanding of the informational content of a locus, and how that information affects genealogy, will help avoid inferring dubious phylogenomic relationships.Click here for additional data file.
Authors: Hervé Philippe; Elizabeth A Snell; Eric Bapteste; Philippe Lopez; Peter W H Holland; Didier Casane Journal: Mol Biol Evol Date: 2004-06-02 Impact factor: 16.240
Authors: Peter A Hosner; Brant C Faircloth; Travis C Glenn; Edward L Braun; Rebecca T Kimball Journal: Mol Biol Evol Date: 2015-12-29 Impact factor: 16.240
Authors: Adrian W Briggs; Udo Stenzel; Philip L F Johnson; Richard E Green; Janet Kelso; Kay Prüfer; Matthias Meyer; Johannes Krause; Michael T Ronan; Michael Lachmann; Svante Pääbo Journal: Proc Natl Acad Sci U S A Date: 2007-08-21 Impact factor: 11.205
Authors: Christian Kehlmaier; Axel Barlow; Alexander K Hastings; Melita Vamberger; Johanna L A Paijmans; David W Steadman; Nancy A Albury; Richard Franz; Michael Hofreiter; Uwe Fritz Journal: Proc Biol Sci Date: 2017-01-11 Impact factor: 5.349
Authors: Michael G Harvey; Brian Tilston Smith; Travis C Glenn; Brant C Faircloth; Robb T Brumfield Journal: Syst Biol Date: 2016-06-10 Impact factor: 15.683
Authors: Derek D Houston; Jordan D Satler; Taylor K Stack; Hannah M Carroll; Alissa M Bevan; Autumn L Moya; Kevin D Alexander Journal: Mol Phylogenet Evol Date: 2021-10-07 Impact factor: 5.019
Authors: Rahul Jamdade; Maulik Upadhyay; Khawla Al Shaer; Eman Al Harthi; Mariam Al Sallani; Mariam Al Jasmi; Asma Al Ketbi Journal: Plants (Basel) Date: 2021-12-13