James G Baldwin-Brown1, Scott M Villa1,2, Anna I Vickrey1, Kevin P Johnson3, Sarah E Bush1, Dale H Clayton1, Michael D Shapiro1. 1. School of Biological Sciences, University of Utah, Salt Lake City, UT 84112, USA. 2. Department of Biology, O. Wayne Rollins Research Center, Emory University, Atlanta, GA 30322, USA. 3. Illinois Natural History Survey, Prairie Research Institute, University of Illinois, Champaign, IL 61820, USA.
Abstract
The pigeon louse Columbicola columbae is a longstanding and important model for studies of ectoparasitism and host-parasite coevolution. However, a deeper understanding of its evolution and capacity for rapid adaptation is limited by a lack of genomic resources. Here, we present a high-quality draft assembly of the C. columbae genome, produced using a combination of Oxford Nanopore, Illumina, and Hi-C technologies. The final assembly is 208 Mb in length, with 12 chromosome-size scaffolds representing 98.1% of the assembly. For gene model prediction, we used a novel clustering method (wavy_choose) for Oxford Nanopore RNA-seq reads to feed into the MAKER annotation pipeline. High recovery of conserved single-copy orthologs (BUSCOs) suggests that our assembly and annotation are both highly complete and highly accurate. Consistent with the results of the only other assembled louse genome, Pediculus humanus, we find that C. columbae has a relatively low density of repetitive elements, the majority of which are DNA transposons. Also similar to P. humanus, we find a reduced number of genes encoding opsins, G protein-coupled receptors, odorant receptors, insulin signaling pathway components, and detoxification proteins in the C. columbae genome, relative to other insects. We propose that such losses might characterize the genomes of obligate, permanent ectoparasites with predictable habitats, limited foraging complexity, and simple dietary regimes. The sequencing and analysis for this genome were relatively low cost, and took advantage of a new clustering technique for Oxford Nanopore RNAseq reads that will be useful to future genome projects.
The pigeon louse Columbicola columbae is a longstanding and important model for studies of ectoparasitism and host-parasite coevolution. However, a deeper understanding of its evolution and capacity for rapid adaptation is limited by a lack of genomic resources. Here, we present a high-quality draft assembly of the C. columbae genome, produced using a combination of Oxford Nanopore, Illumina, and Hi-C technologies. The final assembly is 208 Mb in length, with 12 chromosome-size scaffolds representing 98.1% of the assembly. For gene model prediction, we used a novel clustering method (wavy_choose) for Oxford Nanopore RNA-seq reads to feed into the MAKER annotation pipeline. High recovery of conserved single-copy orthologs (BUSCOs) suggests that our assembly and annotation are both highly complete and highly accurate. Consistent with the results of the only other assembled louse genome, Pediculus humanus, we find that C. columbae has a relatively low density of repetitive elements, the majority of which are DNA transposons. Also similar to P. humanus, we find a reduced number of genes encoding opsins, G protein-coupled receptors, odorant receptors, insulin signaling pathway components, and detoxification proteins in the C. columbae genome, relative to other insects. We propose that such losses might characterize the genomes of obligate, permanent ectoparasites with predictable habitats, limited foraging complexity, and simple dietary regimes. The sequencing and analysis for this genome were relatively low cost, and took advantage of a new clustering technique for Oxford Nanopore RNAseq reads that will be useful to future genome projects.
Parasites represent a large proportion of eukaryotic biodiversity, and it is estimated that 40% of insect diversity is parasitic (de Meeûs and Renaud 2002). Parasitic lice (Insecta: Phthiraptera) comprise a group of about 5000 species that parasitize all orders of birds and most orders of mammals (Mullen and Durden 2009; Clayton ). Two thirds of louse species are associated with only a single host species (Durden and Musser 1994; Smith 2004). The genus Columbicola comprises 91 known species, all found on pigeons or doves (Bush ; Gustafsson ; Adly ); most of these louse species are found on a single host species (Johnson , 2009).Like all feather lice (suborder Ischnocera), Columbicola are “permanent” parasites that complete their entire life cycle on the body of the host (Marshall 1981). Feather lice feed primarily on feathers, which they metabolize with the assistance of endosymbiotic bacteria (Fukatsu ; Smith ). The feather damage caused by lice has a chronic effect that leads to reduced host survival (Clayton ) and mating success (Clayton 1990). Birds are able to defend themselves against feather lice by preening them with the beak. However, Columbicola lice escape from preening by hiding in grooves between feather barbs, and the sizes of these grooves scale with host body size. In micro-evolutionary time, the result is stabilizing selection on body size of lice (Clayton ; Bush and Clayton 2006). In macro-evolutionary time, the result is that host defense (preening) and body size interact to reinforce the host specificity and size matching of Columbicola species to their hosts (Clayton ). Similarly, selection for visual crypsis drives the evolution of color similarities between Columbicola species their hosts (Bush , 2019).Within the feather lice, the biology of C. columbae (Figure 1) is better known than that of any other louse species, including details about its morphology, physiology, ecology, and behavior (Martin 1934; Stenram 1956; Rakshpal 1959; Nelson and Murray 1971; Eichler ; Rudolph 1983; Clayton 1990, 1991; Clayton and Tompkins 1995; Clayton , 2003, 2008; Bush and Clayton 2006; Bush ; Harbison and Clayton 2011). A unique feature of the C. columbae study system is that its host, the rock pigeon Columba livia, has been under artificial selection by pigeon breeders for millennia, resulting in dramatic phenotypic variation (Darwin 1868; Shapiro and Domyan 2013), similar to that seen across the 300+ other species of pigeons and doves (Gibbs ). This variation makes it possible to transfer C. columbae among diverse size and color phenotypes within the single native host species. Recently, we showed that switching lice to pigeons of different sizes and colors elicits rapid population-level changes in louse size and color (Bush ; Villa ). Despite the wealth of phenotypic data about real-time adaptation of C. columbae to changes in host environment, the underlying molecular mechanisms remain unknown.
Figure 1
Slender pigeon louse (Columbicola columbae)—about 2 mm in length—clinging to a rock pigeon feather. The thumblike processes on the antennae of the male louse shown here are used to grasp a female when mating. Dubbed “the bird louse par excellence” (Eichler ), C. columbae has long been a model for studies of host-parasite coevolution. Photo by Scott Villa and Juan Altuna.
Slender pigeon louse (Columbicola columbae)—about 2 mm in length—clinging to a rock pigeon feather. The thumblike processes on the antennae of the male louse shown here are used to grasp a female when mating. Dubbed “the bird louse par excellence” (Eichler ), C. columbae has long been a model for studies of host-parasite coevolution. Photo by Scott Villa and Juan Altuna.A deeper understanding of louse evolution and genetics is limited largely by a paucity of genomic resources. The louse with the best available genomic resources is the blood-feeding human body lousePediculus humanus, the draft genome of which was assembled using low-coverage shotgun sequencing (Kirkness ). Pediculus humanus had the smallest insect genome known at that time (108 Mb), with a repertoire of 10,773 annotated genes. Presently, what we know about the genomic signatures of parasitism in Phthiraptera is largely limited to this one species. Prior work on C. columbae has generated valuable DNA sequence datasets for phylogenetics (Sweet and Johnson 2018; Sweet ) and studies of mitochodrial evolution (Sweet ), but whole-genome data are still lacking.Here, we report a high-quality draft genome assembly and annotation for C. columbae that incorporates short-read Illumina (Bennett 2004) sequences, long-read Oxford Nanopore (Jain ) sequences, and scaffolding using Hi-C data (van Berkum ). These new resources will enable genomic approaches to understanding the molecular basis of rapid adaptation in C. columbae. More generally, the C. columbae genome provides comparative genomic data to understand the molecular basis of traits associated with parasitism that are shared among lice.
Materials and methods
Animal tissue samples
All lice used in this study were drawn from natural populations infesting wild-caught feral rock pigeons (C. livia) from Salt Lake City, UT. We maintained 15–20 infested pigeons in cages to provide a reliable source of C. columbae for sequencing.We reduced the nucleotide heterozygosity of our colony by creating a partially inbred population of lice. Initially, a single pair of lice (1 male and 1 female) was arbitrarily drawn from the pigeon colony and allowed to reproduce on a new, individually caged louse-free feral pigeon. After a period of 21 days, all immature lice were removed from the pigeon using CO2. At this point, these F1 lice were all full siblings. All offspring were then individually placed in glass vials with pigeon feathers for food, and allowed to mature. Rearing lice individually in vials ensured that F1s could not mate. Once mature, a single pair of unmated F1 adults (1 male and 1 female) were arbitrarily chosen and placed on a new, louse-free feral pigeon to mate and reproduce. Thus, all offspring on this new pigeon were the product of full-sibling mating and represented the first generation of inbreeding. These methods were repeated for eight generations.After eight rounds of full-sibling inbreeding, the partially inbred lice were transferred to a new louse-free pigeon and left to mature and produce offspring. We left the lice on this pigeon for 4 months, which allowed the population to grow enough to provide sufficient numbers for sequencing. The lice used for Illumina genomic DNA sequencing were derived from this partially inbred population. Reduced heterozygosity should have resulted in higher quality polishing of the Oxford Nanopore-derived contigs with our Illumina data (see below). We pooled 100 adult lice for Illumina genomic DNA sequencing, 2000 adult lice for Oxford Nanopore genomic DNA sequencing, and 100 adult lice for Illumina RNA sequencing. All lice were drawn from the same partially inbred laboratory population, except for the lice used in Oxford Nanopore DNA sequencing, which were drawn from the main laboratory population from which the inbred population was derived. Ultimately, the inbred louse population was too small to provide sufficient material for Oxford Nanopore DNA sequencing. We generated Oxford Nanopore RNA sequencing reads from four different life stages of lice, all drawn from the inbred population (100 lice each from nymphal instars 1, 2, and 3 adults).
Isolation of genetic material
DNA was isolated by grinding with a disposable homogenizer pestle (VWR, Radnor PA, USA) on ice for 30 min followed by DNA extraction with the Qiagen DNeasy extraction kit (Qiagen, Germantown, MD, USA). DNA for long read sequencing was extracted using the Qiagen DNA Blood and Tissue Midi kit. RNA was isolated using the Qiagen Oligotex mRNA mini kit.
Illumina genomic DNA and RNA sequencing
Illumina DNA sequencing was performed using an Illumina HiSeq 2500 sequencer at the University of Utah High Throughput Genomics Core. We generated four libraries with mean insert sizes of 180, 500, 3500, and 8200 bp. Genomic DNA was sequenced with paired-end 125-bp reads. cDNA sequencing was also performed on the Illumina HiSeq 2500 sequencer, producing paired end reads with a read length of 125 bp.
Oxford Nanopore genomic DNA and RNA sequencing
We generated long read genomic data using Oxford Nanopore MinION sequencers and a custom library preparation designed to increase read length. This protocol followed the standard procedure for producing 1d2 reads with kit LSK308 (Oxford Nanopore community, https://community.nanoporetech.com/protocols/), with the following modifications. (1) During all alcohol washes of magnetic SPRI beads, an additional wash was performed using Tris-EDTA to remove small DNA fragments. This step was performed quickly and without disturbing the beads to avoid dissolving all available DNA into solution. (2) All elutions from magnetic SPRI beads were performed after an incubation in elution buffer at 37° for 30 min. These practices improve the length of Oxford Nanopore sequencing reads (Urban ).We generated long mRNA reads using Oxford Nanopore MinION sequencers and a standard cDNA PCR-based sequencing method (PCS109, Oxford Nanopore community, https://community.nanoporetech.com/protocols/).
Genome size estimation
We used the following formula (Liu ) to estimate genome size from 21-mers counted from the Illumina sequencing data using jellyfish (Marçais and Kingsford 2011):
where G is the genome size, n is the total number of sequenced bases, c is the expected sequence coverage depth, L is the average sequencing read length, and K is the k-mer length.
Genome assembly
We used Trimmomatic version 0.36 (Bolger ) to trim Illumina input reads using the following settings: ILLUMINACLIP: adapters.fa: 2:30:10 LEADING: 20 TRAILING: 20 MINLEN: 30 CROP: 85. We then used fastq-join from ea-utils version 1.1.2-537 (Aronesty 2011) to join all short reads into pair joined reads, and used these throughout the assembly process. We used Canu version 1.6 (Koren ) with the parameter genome Size = 220m to assemble Oxford Nanopore genomic DNA reads, then polished the assembled contigs using pilon v1.22 (Walker ) and the Illumina genomic DNA reads. The pilon software was run with the following switches: –changes –vcf –vcfqe –tracks –fix all.The polished draft assembly was scaffolded by Phase Genomics using their proprietary scaffolding software (Burton ; Bickhart et al. 2017; Peichel ). We supplied Phase Genomics with approximately 1600 lice preserved at −80° for high molecular weight DNA extraction, Hi-C library preparation, and sequencing (Belton ).
Transcript selection and assembly
Standardized pipelines do not yet exist for selecting transcripts from raw Oxford Nanopore RNAseq reads. Therefore, we produced a custom pipeline that identifies putatively full-length transcripts to serve as evidence for genome annotation. In short, we aligned all RNAseq reads using Minimap (Li 2018), then clustered these alignments into sets that represent a gene using Carnac-LR (Marchet ). We wrote a program, wavy_choose, that extracts the aligned reads from the original data, then identifies reads that likely represented full-length transcripts using scipy’s function scipy.signal.find_peaks_cwt() (Du ; Figure 2). A more detailed version of the pipeline follows, below.
Figure 2
wavy_choose identifies likely full-length transcripts from clustered Oxford Nanopore reads. Depicted here is a histogram of read lengths (blue) for one carnac-LR-clustered set of reads. wavy_choose is able to identify two length peaks (red lines) in this transcript set, and discards all reads of other lengths. This process simplifies the transcriptome evidence dataset for MAKER, which uses the identified reads for gene annotation.
wavy_choose identifies likely full-length transcripts from clustered Oxford Nanopore reads. Depicted here is a histogram of read lengths (blue) for one carnac-LR-clustered set of reads. wavy_choose is able to identify two length peaks (red lines) in this transcript set, and discards all reads of other lengths. This process simplifies the transcriptome evidence dataset for MAKER, which uses the identified reads for gene annotation.Minimap aligns long, low-quality reads against one another, and can do so in an all-by-all comparison. Carnac-LR then clusters these reads into groups according to their alignments. Each Carnac-LR-clustered group of mRNA sequencing reads should represent all of the reads associated with a single gene, but if a gene has multiple alternative transcripts, Carnac-LR will not distinguish between them. The custom tool wavy_choose takes all of the clustered reads identified by Carnac-LR and identifies clusters within clusters that are most similar in both length and sequence. Because Oxford Nanopore reads are generally long enough to span an entire mRNA transcript, wavy_choose identifies the reads most likely to be complete transcripts by identifying the most common read lengths. It then removes all nonfull-length reads from the analysis. This tool is especially well suited to transcript discovery, as multiple alternative transcripts may be identified from a single cluster of reads with overlapping sequence, and wavy_choose makes no assumptions as to the number of transcripts to identify.The function find_peaks_cwt() uses continuous wavelet transformation, a technique from signal processing (Grossmann and Morlet 1984) to identify peaks in a 2-dimensional dataset. It does this by first convolving (transforming) the dataset to amplify the portion of the dataset that matches a wavelet with specified parameters (here, the default “Mexican hat” wavelet) and dampens the portions of the dataset that do not match the wavelet. The program then identifies local relative maxima that appear at the specified peak widths (here, 50–200 bp) and have sufficiently high signal-to-noise ratio (here 1.0). This widely accepted technique is straightforward to apply in this context, but it is limited to detecting transcripts that have unique lengths. Two alternative transcripts of matching lengths would appear as a single peak in the length histogram. In these cases, reads from both alternative transcripts were retained in the final dataset. We kept at least one read per cluster of reads.Untrimmed Illumina cDNA reads were assembled using Trinity with the –jaccard_clip setting (Grabherr ).
Genome annotation
We used a combination of wavy_choose-selected Oxford Nanopore-derived transcripts, Illumina RNAseq-derived Trinity assemblies, and orthology information from Swissprot as evidence for gene models in MAKER (Cantarel ), a widely used genome annotation tool. We used AUGUSTUS 3.3.1 to perform the gene finding portion of the MAKER pipeline. BUSCO (Simão ) trains AUGUSTUS as part of the BUSCO pipeline, so we ran BUSCO on the genome assembly and used its AUGUSTUS training model during gene finding. We used both WU BLAST (Chao ) and InterProScan (Jones ) to match genes to their orthologs in the Uniprot-Swissprot database, and to provide the GO terms associated with genes in the final annotation set.
Feature density analysis
We used bedops (Neph ) to generate a.bed file of sliding windows across all chromosomes, then used bedmap (Neph ) to count genes and repetitive sequences in these windows. Sliding windows were 1 Mb in width with a step length of 100 kb. For genes, we counted the total number of features identified by MAKER as “gene”s in its output.gff file. For repeats, we counted all MAKER-identified repeatmasker “match”es.
Detection of bacterial contaminants
After assembly and annotation, we manually checked the louse genome for contamination with bacterial genomic sequences by identifying regions with unusually high gene density, repeatmasker-identified (Chen 2004) artifacts, and contiguous runs of bacterial genes. We also used kraken (Wood and Salzberg 2014) with the DustMasked MiniKraken DB database (https://ccb.jhu.edu/software/kraken/dl/minikraken_20171101_8GB_dustmasked.tgz) to identify known bacterial kmer contaminants.We identified two sections of the genome that likely contained bacterial contamination, and removed them from the final assembly. The first section, at the beginning of chromosome 4, had a higher density of genes than any other region of the genome (280 genes per 10 kb, vs. 64 genes per 10 kb in the bacteria-free genome). It also had a paucity of repetitive elements (262 repeats per 10 kb, as opposed to 800 elsewhere). MAKER’s annotation (see below) indicated that the majority of the region’s genes were bacterial in origin, and BLASTn searches (Zhang ) against the NCBI nr database (https://blast.ncbi.nlm.nih.gov/) confirmed this, as did kraken. The region also contained the annotation’s only instance of an explicit bacterial artifact identified by repeatmasker. The second region, on chromosome 8, was flagged as containing bacterial content by kraken. Both the chromosome 4 and 8 regions contained genes annotated by MAKER as similar to genes from the Sodalis clade, which contains the endosymbiont of the tsetse fly and a known bacterial endosymbiont of C. columbae (Fukatsu ; Smith ). Two hundred nineteen of the 554 genes in the chromosome 4 section are annotated as being Sodalis-related, as are 3 of the 4 genes in the chromosome 8 section. Thus, the totality of evidence led us to conclude that these regions on chromosomes 4 and 8 of our preliminary C. columbae genome assembly were bacterial contaminants from a known Sodalis-clade endosymbiont.Lice were starved for 24 h and the transparent gut was checked visually for content before DNA and RNA extraction, reducing the likelihood of contamination due to eukaryotic tissue in the gut. Nevertheless, these measures do not completely rule out sequence contamination from the pigeon host, humans, or other eukaryotes. We searched for contamination from eukaryotes by performing BLASTn searches (Zhang ) against the human reference genome (Schneider ), the C. livia reference genome (Holt ), and the NCBI nr nucleotide database. We did not find any regions in the C. columbae genome greater than 3 kb in length and with identity greater than 90% to any of the target sequence databases. Therefore, we concluded that there is not substantial eukaryotic contamination in the final assembly.
Data availability
Raw sequence data for this project are publicly available through NCBI SRA (SAMN16076762-SAMN16076765). All analysis scripts are available through GitHub at https://github.com/jgbaldwinbrown/jgbutils. The genome assembly and annotation are available at NCBI GenBank (PRJNA662097).
Results and discussion
We generated 2.92 × 1010 bases of genomic sequence using the Illumina short-read platform (mean read length after trimming = 107.2 bp). We estimated the genome size via k-mer counting (Liu ) using jellyfish (Marçais and Kingsford 2011) (Figure 3). Using a k-mer size of 21, we estimate the genome size of C. columbae to be 230 Mb, within the range expected for insects.
Figure 3
Jellyfish-derived (Marçais and Kingsford 2011) 21-mer histogram based on Illumina reads from the C. columbae genome.
Jellyfish-derived (Marçais and Kingsford 2011) 21-mer histogram based on Illumina reads from the C. columbae genome.
Genome assembly summary
We generated a high-quality draft genome assembly using a combination of Illumina and Oxford Nanopore sequencing data, and Hi-C scaffolding (Table 1). Our initial, unscaffolded assembly with Canu consists of 1193 contigs with a total length of 206 Mb, and an N50 contig length of 511 kb. We scaffolded the assembly using Hi-C data, producing chromosome-size scaffolds from the initial contigs. The final assembly comprises 12 chromosome-sized scaffolds and 380 small scaffolds, totaling 208 Mb of sequence. The N50 scaffold length for the final assembly is 17.7 Mb. Karyotyping evidence (Ries 1932) indicates that the C. columbae genome consists of 12 holocentric chromosomes. Based on this physical evidence, and the striking difference in size between the 12 largest scaffolds and all other scaffolds in the assembly (Figure 4), we predict that each of the 12 largest scaffolds in the assembly represents one of the 12 karyotyped chromosomes.
Table 1
Assembly and annotation statistics
Genome size
208 Mb
Illumina sequencing coverage
102
Oxford nanopore sequencing coverage
35
Pre-scaffolding contigs
—
Total number of contigs
1,193
Contig N50
511 kb
Contig N90
93 kb
Contig L50
103
Contig L90
466
Scaffolds
—
Chromosome-size scaffolds (≥ 12 Mb)
12
Total number of scaffolds
386
Scaffold N50
17.6 Mb
Scaffold N90
13.7 Mb
Scaffold L50
6
Scaffold L90
11
Annotation
—
Annotated genes
13362
Annotated transcripts
19140
Annotated genes (AED ≥ 0.5)
1972
Repeat content
9.70%
BUSCO score
96.4%
Figure 4
Cumulative coverage of initial and final (scaffolded) C. columbae genome assemblies, illustrating the improvement in contiguity by scaffolding with Hi-C data. All scaffolds in the assembly are plotted largest to smallest, from left to right. The x-axis indicates cumulative length of an assembly, and the y-axis corresponds to the cumulative portion of the genome covered by initial contigs (red dots) and final scaffolds (blue dots).
Assembly and annotation statisticsCumulative coverage of initial and final (scaffolded) C. columbae genome assemblies, illustrating the improvement in contiguity by scaffolding with Hi-C data. All scaffolds in the assembly are plotted largest to smallest, from left to right. The x-axis indicates cumulative length of an assembly, and the y-axis corresponds to the cumulative portion of the genome covered by initial contigs (red dots) and final scaffolds (blue dots).
Annotation
We annotated the genome using the MAKER pipeline, with transcriptome evidence from Trinity-assembled Illumina RNAseq reads and wavy_choose-selected Oxford Nanopore RNAseq reads (Figure 2). We identified 19,139 transcripts from 13,362 genes. 13,246 of these genes are functionally annotated by BLAST using the Swissprot database, 8354 are functionally annotated by similarity to InterPro or Pfam, and 13,248 are functionally annotated by either Swissprot, InterPro, or Pfam. MAKER produces a combined quality statistic called Annotation Edit Distance (AED); Eilbeck ; Holt and Yandell 2011). Perhaps owing to our use of long-read transcriptome sequencing, 10.3% of our annotated transcripts have ideal AED scores of 0 (Figure 5), and only 5.6% of annotated transcripts have low-quality AED scores above 0.5. The abundance of low AED scores and relative dearth of low scores are indicators of a high-quality annotation.
Figure 5
Cumulative annotation edit distance (AED) for all genes in the MAKER-derived annotation. 10.3% of genes have an AED of 0, while only 5.6% of genes have an AED above 0.5 (vertical dashed line).
Cumulative annotation edit distance (AED) for all genes in the MAKER-derived annotation. 10.3% of genes have an AED of 0, while only 5.6% of genes have an AED above 0.5 (vertical dashed line).
Genome completeness
We used BUSCO (Simão ) to measure completeness of the genome by counting the number of highly conserved, single copy genes that should be present in insects (Table 2). The reference genome, transcriptome, and translated transcriptome contain complete copies of 96%, 90%, and 87% of insect BUSCOs, respectively. These high values indicate that the C. columbae genome assembly is sufficiently complete for downstream comparative genomic analyses.
Table 2
BUSCO results for genome completeness for the reference genome assembly, the annotated transcriptome, and the predicted proteome
Count
Genome
Transcriptome
Proteome
Complete, single-copy BUSCOs
1,593
1,440
1,438
Complete, duplicated BUSCOs
6
50
12
Fragmented BUSCOs
25
54
55
Missing BUSCOs
34
114
153
Complete BUSCOs (%)
96.44
89.86
87.45
Complete, single-copy BUSCOs (%)
96.07
86.85
86.73
Complete, duplicated BUSCOs (%)
0.36
3.01
0.72
Fragmented BUSCOs (%)
1.50
3.25
3.31
Missing BUSCOs (%)
2.05
6.87
9.22
Total BUSCO groups searched
1,658
1,658
1,658
BUSCO results for genome completeness for the reference genome assembly, the annotated transcriptome, and the predicted proteome
Repetitive elements
Repeatmasker identified 20.2 Mb (9.70%) of the genome as repetitive content. Of this 20.2 Mb, 65.8% is DNA transposons, 14.8% is LINEs, 8.6% is simple repeats, and 5.7% is LTR transposons (Wicker et al. 2007). The remainder (5.1%) is an assortment of transposable elements, low-complexity regions, and satellites (Table 3). Repetitive content in the C. columbae genome is, therefore, considerably higher than in P. humanus. In the latter species, only 1% of the genome is annotated as class I (retrotransposons, including LTR, LINE, and SINE) or class II (DNA transposons) transposable elements, and 6.9% is tandem repeats (Kirkness ). One caveat to this conclusion is the lower contiguity of the earlier P. humanus assembly. Because genomes often fail to assemble at repetitive sites, the P. humanus assembly may have captured a smaller proportion of repetitive sequences than the more contiguous C. columbae assembly.
Table 3
Repetitive elements in the C. columbae genome
Identity
Number of bases
Percent of all bases
Percent of repetitive elements
DNA
13,283,184
6.39
65.9
LINE
2,980,785
1.43
14.8
Low_complexity
506,296
0.244
2.51
LTR
1,156,688
0.556
5.74
Other
684
0.000329
0.00339
RC
126,151
0.0607
0.626
Retroposon
749
0.000360
0.00371
RNA
16,720
0.00804
0.0829
rRNA
33,934
0.0163
0.168
Satellite
48,279
0.0232
0.239
Simple_repeat
1,745,924
0.840
8.660
SINE
112,911
0.0543
0.560
snRNA
24,864
0.0120
0.123
tRNA
54,873
0.0264
0.272
Unknown
66,922
0.0322
0.332
Repetitive elements in the C. columbae genomeKirkness predicted that the monophagous, permanently parasitic lifestyle of lice should lead to reduced genomes due to the reduced need to seek food and avoid enemies compared to free-living species. While Kirkness et al. identified a reduction in gene families related to sensing, their conclusion that overall genome size is affected by lifestyle is not supported by the genome size of C. columbae, which has a genome size and number of genes that are more typical for a free-living insect. Indeed, both C. columbae and P. humanus appear to have a full complement of genes, and while P. humanus has a small genome and a reduction in transposable elements, C. columbae has neither of these. The pattern of reduced genome size and reduction in TE content without loss of genes is characteristic of high-population-size species (Lefébure ). However, a robust estimate of the population size of P. humanus, combined with evidence ruling out alternative hypotheses, would be necessary to demonstrate that population size drove the reduced genome size in P. humanus. Other authors (Oliver ) have hypothesized that large populations may not actually be under selection to have smaller genomes.
Genomic evidence for the lack of centromeres
Centromeres are characterized by a depletion of genic content and an increase in repetitive content (Jain ). Based on these criteria (Figure 6), we find no evidence for centromeres in any of the C. columbae chromosomes. Presence of genes is moderately anti-correlated with presence of simple repetitive sequences (r = −0.28, 1 Mb sliding windows). Still, the overall repeat density is not correlated with gene density, and both measures are relatively consistent across the genome. Many chromosomes (cf., Figure 6, chromosomes 6 and 7) have a twin-peaked pattern of simple repeats, in which chromosome ends and centers have high genic content and low repeat content, but the genomic segments between the ends and the center have high repeat content and low-genic content. It is possible that these twin peaks of simple repeat content are the centromeres in a polycentromeric chromosome, and that the chromosomes were actually misclassified as holocentric based on karyotyping evidence.
Figure 6
Chromosome-wide feature density in the C. columbae genome. Gene and repeat density in 1 Mb-wide sliding windows across the C. columbae genome show that there are no clear centromeres, and gene and simple repeat density are negatively correlated.
Chromosome-wide feature density in the C. columbae genome. Gene and repeat density in 1 Mb-wide sliding windows across the C. columbae genome show that there are no clear centromeres, and gene and simple repeat density are negatively correlated.
Comparisons to the closest sequenced relative
The closest relative of C. columbae with an assembled genome is the human body louseP. humanus. C. columbae, and P. humanus are thought to have diverged 65 million years ago (Johnson ). Pediculus humanus has five metacentric chromosomes and one telocentric chromosome (Kirkness ), in contrast to the 12 putatively holocentric chromosomes described here. Pediculus humanus has a genome assembly size of 108 Mb, approximately half that of the 208-Mb C. columbae genome assembly. The C. columbae genome has a typical genome-wide GC content of 36%, while P. humanus has an extremely AT-rich genome with 28% GC content, making C. columbae the more typical insect genome of the two.
Synteny analysis
We used the default settings of SynIma (Farrer 2017) to identify synteny between C. columbae and P. humanus (Figure 7). We were unable to test for chromosome-scale syntenic blocks between P. humanus and C. columbae due to the low contiguity of the P. humanus genome. However, we found very few locations in which synteny is broken between a P. humanus scaffold and a C. columbae scaffold, showing that short-range synteny is almost entirely conserved between these species.
Figure 7
Short range synteny is largely conserved between C. columbae (bottom) and P. humanus (top) genomic scaffolds. Lines connecting scaffolds from each genome assembly represent the positions of orthologous genes. P. humanus contigs were aligned to the C. columbae genome in order and orientation using SynIma. Chromosome-size scaffolds in C. columbae are labeled 1–12.
Short range synteny is largely conserved between C. columbae (bottom) and P. humanus (top) genomic scaffolds. Lines connecting scaffolds from each genome assembly represent the positions of orthologous genes. P. humanus contigs were aligned to the C. columbae genome in order and orientation using SynIma. Chromosome-size scaffolds in C. columbae are labeled 1–12.
Functional annotation reveals depletion of environmental sensing and metabolic genes
Pediculus humanus has a small complement of opsins (3, as opposed to 275 in D. melanogaster) and G protein-coupled receptors (GPCR, 104, as opposed to 408 in D. melanogaster) (Kirkness ; Thurmond ). Similarly, we find that only 2 annotated genes in C. columbae are associated with the opsin gene ontology term (GO:00007602) and only 107 genes are associated with the GPCR GO category (GO:00004930). This reduced repertoire of sensory system genes supports the hypothesis that the relatively static environments encountered by lice and other ectoparasites relaxes selection on the ability to sense and respond to stimuli in more variable environments (Kirkness ). Columbicola columbae is incapable of surviving off of its obligate host, so there might be little selection to retain complex visual, olfactory, or other complex sensory acuity. We find support for the hypothesis that specific gene families, such as those relating to sensory capabilities and metabolism, are reduced in obligate parasites (Jackson 2015).Pediculus humanus is massively depleted in terms of odorant receptors, gustatory receptors, and chemosensory proteins, and C. columbae shows the same pattern. For example, C. columbae has only 13 genes with olfactory receptor activity (GO: 0004984) and P. humanus has only 10, compared with 152 in D. melanogaster (Thurmond ). Columbicola columbae has 2 genes associated with taste receptor activity (GO: 0008527) and P. humanus has 6, yet D. melanogaster has 150. We speculate that this dramatic depletion of taste receptor genes is due to the homogeneous diet of ectoparasitic lice. The diet of C. columbae, for instance, consists entirely of pigeon feathers and flakes of dead skin (Ash 2008; Nelson and Murray 1971; Singh ).Another highly depleted gene functional category in P. humanus is the insulin signaling/TOR pathway. Kirkness show that the canonical pathway appears nonfunctional in P. humanus, with only one gene having P. humanus EST-derived evidence for its expression. BLAST evidence indicates that other TOR pathway genes are reduced to a single copy in P. humanus, including genes such as dilps and eIF-4E (class I), which respectively have 6 and 7 copies in D. melanogaster (Kirkness ). We find the same qualitative result in C. columbae, with no annotated genes associated with the insulin receptor signaling pathway (GO:0008286). Finally, the complement of detoxification genes is depleted in both P. humanus and C. columbae, with C. columbae having no annotated genes associated with detoxification (GO:0098754).The striking reduction in sensory and metabolic gene categories in C. columbae and P. humanus could be due to independent gene loss in each lineage, inheritance of a depleted repertoire from a common ancestor, or a combination of the two. Loss of the same suite of genes in each species would be consistent with inheritance of a reduced sensory repertoire from a common ancestor, while loss of different genes in each species would indicate independent reductions. Reciprocal best BLAST hits of C. columbae and P. humanus genes to a shared outgroup, Drosophila melanogaster, indicate that the identities of the lost and retained genes are mostly the same between the two louse species (Table 4), thereby supporting the hypothesis of ancestral loss. We note the possibility that these “missing” genes are not actually absent from the genomes of C. columbae and P. humanus, but are simply not annotated in their respective genomes. However, the BUSCO completeness score of 96.4% for the C. columbae genome renders large-scale incompleteness and misannotation less likely.
Table 4
Reciprocal best-hit BLAST of the proteomes of C. columbae and P. humanus against D. melanogaster reveals the identity of the retained genes in depleted gene families is largely the same in both species
Gene family
D. melanogaster genes
C. columbae hits
P. humanus hits
Shared hits
Opsin
275
40
40
28
GPCR
408
66
69
50
Olfactory receptor activity
152
6
8
6
Taste
150
4
3
3
Odorant binding
248
6
7
4
Insulin
349
59
61
46
Tor
225
51
57
47
Chemosensory behavior
441
54
57
44
Detoxification
132
16
18
16
The first column is the tested family of genes. The second column is the number of genes assigned the corresponding GO term in the D. melanogaster proteome. The third and fourth columns, respectively, are the numbers of reciprocal best BLAST hits with D. melanogaster genes by genes from either C. columbae or P. humanus. The fifth column is the number of reciprocal best BLAST hits that had the same D. melanogaster-derived identity when BLASTing against C. columbae or P. humanus.
Reciprocal best-hit BLAST of the proteomes of C. columbae and P. humanus against D. melanogaster reveals the identity of the retained genes in depleted gene families is largely the same in both speciesThe first column is the tested family of genes. The second column is the number of genes assigned the corresponding GO term in the D. melanogaster proteome. The third and fourth columns, respectively, are the numbers of reciprocal best BLAST hits with D. melanogaster genes by genes from either C. columbae or P. humanus. The fifth column is the number of reciprocal best BLAST hits that had the same D. melanogaster-derived identity when BLASTing against C. columbae or P. humanus.
Authors: Scott M Villa; Juan C Altuna; James S Ruff; Andrew B Beach; Lane I Mulvey; Erik J Poole; Heidi E Campbell; Kevin P Johnson; Michael D Shapiro; Sarah E Bush; Dale H Clayton Journal: Proc Natl Acad Sci U S A Date: 2019-06-10 Impact factor: 11.205
Authors: Kevin P Johnson; Nam-Phuong Nguyen; Andrew D Sweet; Bret M Boyd; Tandy Warnow; Julie M Allen Journal: Biol Lett Date: 2018-05 Impact factor: 3.703
Authors: Carson Holt; Michael Campbell; David A Keays; Nathaniel Edelman; Aurélie Kapusta; Emily Maclary; Eric T Domyan; Alexander Suh; Wesley C Warren; Mark Yandell; M Thomas P Gilbert; Michael D Shapiro Journal: G3 (Bethesda) Date: 2018-05-04 Impact factor: 3.154
Authors: Stephany Virrueta Herrera; Kevin P Johnson; Andrew D Sweet; Eeva Ylinen; Mervi Kunnasranta; Tommi Nyman Journal: Mol Ecol Date: 2022-07-01 Impact factor: 6.622