Literature DB >> 32075855

A High-Quality Genome Assembly of the North American Song Sparrow, Melospiza melodia.

Swarnali Louha1, David A Ray2, Kevin Winker3, Travis C Glenn4,5.   

Abstract

The song sparrow, Melospiza melodia, is one of the most widely distributed species of songbirds found in North America. It has been used in a wide range of behavioral and ecological studies. This species' pronounced morphological and behavioral diversity across populations makes it a favorable candidate in several areas of biomedical research. We have generated a high-quality de novo genome assembly of M. melodia using Illumina short read sequences from genomic and in vitro proximity-ligation libraries. The assembled genome is 978.3 Mb, with a physical coverage of 24.9×, N50 scaffold size of 5.6 Mb and N50 contig size of 31.7 Kb. Our genome assembly is highly complete, with 87.5% full-length genes present out of a set of 4,915 universal single-copy orthologs present in most avian genomes. We annotated our genome assembly and constructed 15,086 gene models, a majority of which have high homology to related birds, Taeniopygia guttata and Junco hyemalis In total, 83% of the annotated genes are assigned with putative functions. Furthermore, only ∼7% of the genome is found to be repetitive; these regions and other non-coding functional regions are also identified. The high-quality M. melodia genome assembly and annotations we report will serve as a valuable resource for facilitating studies on genome structure and evolution that can contribute to biomedical research and serve as a reference in population genomic and comparative genomic studies of closely related species.
Copyright © 2020 Louha et al.

Entities:  

Keywords:  Dovetail genomics; Melospiza melodia; Passeriformes; de novo assembly; whole genome sequencing

Mesh:

Year:  2020        PMID: 32075855      PMCID: PMC7144075          DOI: 10.1534/g3.119.400929

Source DB:  PubMed          Journal:  G3 (Bethesda)        ISSN: 2160-1836            Impact factor:   3.154


The oscine passerines (Order Passeriformes) are songbirds having specialized vocal learning capabilities (Liu ). Many species of songbirds have been widely used by neuroscientists to study the processes underlying memory and learning and social interactions (Doupe and Kuhl 1999, White 2010). The song sparrow (Melospiza melodia) is one of the most morphologically diverse songbirds found in North America, with 26 recognized subspecies (Pruett ). It has been recognized as a model vertebrate species for field studies of birds and has been the subject of extensive research integrating behavioral and ecological studies over the last 70 years (Arcese ). The species is widespread across North America, occupying diverse ecosystems and exhibiting pronounced phenotypic variation in plumage color, seasonal migration and sedentariness, body size, and bill size (Arcese , Pruett & Winker 2010, Greenberg ). Though several species of songbirds have been sequenced and studied (Warren , Jarvis ), few offer the plethora of biomedical research potential presented by the song sparrow. This species might serve as a model system in areas such as hepatic lipogenesis (through phenotypic variation in seasonal fat deposition for migration; Gosler 1996, Schubert ), craniofacial development (through variation in bill size and shape; Brugmann , Powder ), and variations in body size (Sutter , Lango Allen et al. 2010). The latter is a polygenic trait, and elucidation of the underlying gene network affecting different metabolic pathways can help clarify several biological phenomena, including human diseases. Other areas of interest are differences in neural growth and song-center brain development among different song sparrow populations and potential applications in brain neurogenesis (NIH 2001), and also the regeneration of “hair” cells in the song sparrow auditory system and potential therapies useful in hearing loss (Hawkins , Hawkins & Lovett 2004). Given its significant biomedical potential and experimental tractability in the field and aviary, the song sparrow will continue to be used for answering research questions related to mechanisms causing variation in behavior, morphology, and demographics across populations (Arcese , Nietlisbach ). Prior work on song sparrows in Alaska has shown how the song sparrow population in the Aleutian Archipelago is thought to have colonized from the mainland since the last glacial maximum and undergone a series of population bottlenecks to give rise to a naturally inbred population with large body size (Pruett and Winker 2005). The lower genetic variability in this naturally inbred population makes song sparrows from the Aleutian islands a favorable resource for generating a reference genome assembly, because lower levels of polymorphism between both copies of a diploid genome can improve assembly quality. Previous work has also been done on the song sparrow transcriptome, developing genomic markers to screen at population levels (Srivastava ). A high-quality genome assembly of M. melodia furthers the development of genomic markers to screen loci associated with phenotypic traits of interest. An ever-growing number of songbirds have sequenced genomes, but relatively few have been published so far, including the American crow (Corvus brachyrhynchos), golden-collared manakin (Manacus vitellinus; Jarvis ), Zebra finch (Taeniopygia guttata; Warren ), medium ground finch (Geospiza fortis; Parker ) and the dark-eyed junco (Junco hyemalis; Friis ). In this study, we provide the genome assembly of Melospiza melodia, a member of the family Passerellidae. This genome assembly will serve as a reference genome for this species as well as facilitating genomic and phylogenetic comparisons among songbirds and other taxa. Our high-quality draft genome assembly of M. melodia was created by combining both traditional Illumina paired-end libraries and a de novo proximity-ligation Chicago library. The Chicago library method together with Dovetail Genomics’ HiRise software pipeline is designed to significantly reduce gaps in alignment arising from repetitive elements in the genome (Putnam ) and increases assembly contiguity. The draft genome was annotated using transcribed RNA and protein sequences from M. melodia and related songbird species, Junco hyemalis and Taeniopygia guttata. Genomic features of interest other than coding sequences, such as microsatellites, repeat elements, transposable elements, and non-coding RNA, were also annotated and the genome assembly was evaluated for quality by comparing it to related avian species.

Methods

Library preparation and de novo shotgun assembly

The de novo assembly of the song sparrow genome was constructed using Illumina paired end libraries. A blood sample from a single male song sparrow was obtained from the wild in the Aleutian Islands of Alaska (Coordinates: 52.8275 / 173.206) on 16 Sep 2003 and archived as a voucher specimen at the University of Alaska Museum (http://arctos.database.museum/guid/UAM:Bird:31500). We chose a male because females are the heterogametic sex in birds and sex chromosomes are known to have highly repetitive DNA content. This together with the selection of an individual from a population known to have lower genetic variation can improve the quality of our assembled genome, without changing the genome structurally. Whole blood was preserved during specimen preparation and shipped overnight in lysis buffer to UGA, where PCI extraction of DNA was performed. We sheared the genomic DNA using a Covaris S2 (Covaris, Woburn, MA, USA) targeting a 600bp average fragment size. The sheared DNA was end-repaired, adenylated, and ligated to TruSeq LT adapters using a TruSeq DNA PCR-Free Library Preparation Kit (Illumina, San Diego, CA, USA). We purified the ligation reaction using a Qiaquick Gel Extraction Kit (Qiagen, Venlo, The Netherlands) from a 2% agarose gel. We sequenced the library on an Illumina HiSeq 2500 at the HudsonAlpha Institute for Biotechnology (Huntsville, AL, USA) to obtain paired-end (PE) ∼100 bp reads. The sequence data consisted of 276 million read pairs sequenced from a total of 41.3 Gbp of paired-end libraries (∼49× sequencing coverage). Reads were trimmed for quality, sequencing adapters, and mate pair adapters using Trimmomatic (Bolger ). The reads were assembled at Dovetail Genomics (Santa Cruz, CA, USA) using Meraculous 2.0.4 (Chapman ) with a k-mer size of 29. This yielded a 972.4 Mbp assembly with a contig N50 of 22.5 Kbp and a scaffold N50 of 33 Kbp.

Chicago library preparation and scaffolding the draft genome

To improve the de novo assembly, a Chicago library was prepared at Dovetail Genomics using previously described methods (Putnam ). In brief, about 500 ng of high-molecular-weight genomic DNA (mean fragment length = 50 kbp) was used for chromatin reconstitution in vitro and fixed with formaldehyde. Fixed chromatin was digested with DpnII, the 5′ overhangs filled in with biotinylated nucleotides, and free blunt ends were ligated together. After ligation, crosslinks were reversed and DNA was purified from protein. Purified DNA was treated to remove biotin that was not internal to ligated fragments. Next, DNA was sheared to ∼350 bp mean fragment size and sequencing libraries were generated using NEBNext Ultra enzymes (New England Biolabs, Ipswich, MA, USA) and Illumina-compatible adapters. Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of the library. The Chicago library was sequenced on an Illumina HiSeq 2500 to produce 47 million 150 bp paired end reads (1-50 kb pairs). Dovetail Genomics’ HiRise scaffolding software pipeline (Putnam ) was used to map the shotgun and Chicago library sequences to the draft de novo assembly using a modified SNAP read mapper (http://snap.cs.berkeley.edu). The separations of Chicago read pairs mapped within draft scaffolds were analyzed by HiRise to produce a likelihood model for genomic distance between read pairs, and the model was used to identify and break putative misjoins, to score prospective joins, and make joins above a threshold. After scaffolding, shotgun sequences were used to close gaps between contigs.

Identification of microsatellites and transposable elements

Transposable elements (TEs) in the song sparrow genome were identified using a combination of de novo and homology-based TE identification methods, in addition to a manual curation step (Platt ). First, we used RepeatModeler v1.0.11 (Smit and Hubley 2008-2015) with default parameters (File S1) to generate a custom repeat library consisting of 672 consensus repeat sequences. RepeatModeler uses two de novo repeat identification programs, RECON v1.08 (Bao and Eddy 2002) and RepeatScout v1.0.6 (Price et al. 2005), for identifying repetitive elements from sequence data. To ensure accurate and complete representation of putative TEs, the RepeatModeler derived consensus sequences were filtered for size (>100 bp), and then subjected to iterative homology-based searches against the genome, followed by manual curation (Platt ). The final set of manually curated TEs was queried against CENSOR (Kohany ) and TEclass (Abrusan ) for classification. TEs not identifiable in CENSOR were also searched against the NCBI nucleotide and protein databases using BLASTN and BLASTX respectively. Finally, a custom repeat library consisting of 900 repeat elements (File S24) comprising song sparrow-specific TEs and existing repeats in other related avian species was used to screen for repeats in the song sparrow genome assembly with RepeatMasker v4.0.9. Microsatellites in the song sparrow genome were identified and described with GMATA v2.01 (Wang and Wang 2016) with sequence motifs ranging in length from 2-20 bp, and each motif repeated at least 5 times (File S2).

De novo gene annotation and function prediction

Genes were predicted in the song sparrow genome with the MAKER v2.31.9 genome annotation pipeline (Campbell ). A custom repeat library of 900 repeat sequences (File S24) consisting of TEs identified in the song sparrow genome and other existing avian repeat elements was used to soft mask the genome. Transcriptome evidence sets for MAKER included the assembled song sparrow transcriptome (Srivastava ) and Trinity (v2.4.0) mRNA-seq assemblies from multiple tissues of Junco hyemalis (Peterson , NCBI BioProject Accession: PRJNA256328). Protein evidence sets used by MAKER included annotated proteins for song sparrow, Junco hyemalis, and Taeniopygia guttata from the NCBI Protein database. The MAKER pipeline consisted of the following steps: 1) Transcriptomic and protein evidence sets were used to make initial evidence-based annotations with MAKER; 2) the initial annotations were used to train two ab initio gene predicters: Augustus (Stanke ), which was trained once, and SNAP (Korf 2004), which was iteratively trained twice; and 3) the trained gene prediction tools SNAP and Augustus were used to generate the final set of gene annotations (File S3-S8). Functional annotations of the predicted genes were obtained by making homology-based searches with BLASTP against the Uniprot/Swiss-Prot protein database (Pundir , File S9). InterProScan v5.29 (Zdobnov and Apweiler 2001) was used to find protein domains associated with the genes. The putative functions and protein domains were added to the gene annotations using scripts provided with MAKER (File S9). To quantitatively assess the completeness of the song sparrow genome assembly and annotated gene set, we ran BUSCO (Benchmarking Universal Single-Copy Orthologs) v3.0.2 (Waterhouse ) with 4,915 single-copy orthologous genes in the Aves lineage group (Aves_odb9; https://busco.ezlab.org/), using “chicken” as the Augustus reference species (File S10). The 4,915 orthologous genes are present in at least 90% of the 40 species included within the Aves lineage group, and thus are likely to be found in the genome of related species. Additionally, we used the JupiterPlot pipeline (https://github.com/JustinChu/JupiterPlot) to visually compare the zebra finch (T. guttata) genome assembly (Warren ) to our assembly in a Circos plot, using the largest scaffolds making up 85% of our genome assembly, and all scaffolds greater than 100 kbp in the Zebra finch genome (File S11). We also used the JupiterPlot pipeline to compare our assembly to the genome assemblies of the collared flycatcher (Ficedulla albicollis), great tit (Parus major) and house sparrow (Passer domesticus). These birds were selected for comparison because they have highly complete genomes, and are often used for comparative genomic studies in birds.

Non-coding RNA prediction

Transfer RNAs (tRNAs) were predicted in the song sparrow genome with tRNAscan-SE v2.0 (Lowe and Chan 2016, File S12). A training set comprising eukaryotic tRNAs was used to train the covariance models employed by tRNAscan-SE, and tRNAs were searched against the genome with Infernal v1.1.2 (Nawrocki 2014). tRNAscan-SE also provides functional classification of tRNAs based on a comparative analysis using a suite of isotype-specific tRNA covariance models. A random sample of 10 predicted tRNAs were selected and searched against the tRNA databases GtRNAdb (Chan and Lowe 2016) and tRNAdb (Jühling ). Identification of miRNAs (microRNAs), snoRNAs (small nucleolar RNAs), snRNAs (small nuclear RNAs), rRNAs (ribosomal RNAs), and lncRNAs (long non-coding RNAs) was achieved by using a homology-based prediction method. Structural homologs to eukaryotic ncRNA covariance models from the Rfam database v14.1 (Gardner et al. 2009) were searched against the song sparrow genome using Infernal’s (v1.1.2) “cmscan” program (File S13). All low-scoring overlapping hits and hits with an E-value greater than 10−5 were discarded, and the remaining ncRNAs were grouped into different classes. Lastly, we compared the predicted classes of different ncRNAs in the song sparrow genome to those reported in the genomes of related birds, Taeniopygia guttata and Ficedula albicollis (collared flycatcher).

Data availability

Raw reads have been deposited in the NCBI Sequence Read Archive (SRR10491484 and SRR10451714 for the Meraculous assembly, and SRR10424475 for the Chicago HiRise assembly). The M. melodia Chicago HiRise genome sequence (Mmel_1.0), and annotations are available in GenBank under the accession RZID00000000 (NCBI BioProject accession: PRJNA511035). Supplemental File S1 contains submission script for RepeatModeler. Supplemental File S2 contains primary configuration file used to run GMATA (default_cfg.txt). Supplemental File S3 contains submission script for MAKER. Supplemental File S4 contains MAKER executable file (maker_exe.ctl). Supplemental File S5 contains specifications for downstream filtering of BLAST and Exonerate alignments (maker_bopts.ctl). Supplemental File S6 contains primary configuration of MAKER specific options (maker_opts.ctl). Supplemental File S7 contains scripts for training SNAP. Supplemental File S8 contains scripts for training Augustus. Supplemental File S9 contains scripts for running BLASTP and InterProScan for functional annotation of predicted genes; and scripts for adding the functional annotations to gene annotation files. Supplemental File S10 contains submission script for BUSCO. Supplemental File S11 contains submission scripts for JupiterPlot pipeline. Supplemental File S12 contains submission script for tRNAscan-SE. Supplemental File S13 contains submission script for Infernal. Supplemental File S14 contains classification of predicted transposable elements. Supplemental File S15 contains annotation of microsatellites with their genomic locations. Supplemental File S16 contains percentage of different microsatellites present in the genome. Supplemental File S17 contains frequency of occurrence of microsatellites in each scaffold of the genome. Supplemental File S18 contains the distribution of the length of microsatellites. Supplemental File S19 contains predicted function of annotated genes by BLASTP. Supplemental File S20 contains prediction of protein domains, GO annotations and pathway annotations of predicted genes by InterProScan. Supplemental File S21 contains sequence and structure of tRNAs identified in the song sparrow genome. Supplemental File S22 contains classification of predicted tRNAs. Supplemental File S23 contains classification of different ncRNAs predicted in the genome with Infernal. Supplemental File S24 contains custom repeat library used to screen for repeats in the song sparrow genome. Supplemental Table S1 contains genome sizes of birds related to M. melodia. Supplemental Figure S1 contains the distribution of the percentage of annotated genes with their corresponding AED scores. Supplemental Figure S2 contains the distribution of the top base-pair composition of microsatellite motifs in the M. melodia genome. Supplemental Figure S3 contains comparison of the M. melodia genome assembly with genome assemblies of related birds. Supplemental material available at figshare: https://doi.org/10.25387/g3.11676441.

Results And Discussion

Assembly

We produced the de novo genome assembly of song sparrow, with a total length of 978.3 Mb, using a Chicago library and the HiRise assembly pipeline. The N50 scaffold size was 5.6 Mb and contig size was 31.7 Kb. This assembly showed significant improvement over the initial shotgun assembly, with a 169-fold increase in scaffold N50 and a 60-fold increase in scaffold N90 (Table 1). These increases in scaffold size were also accompanied by an increase in assembly contiguity, with the total number of scaffolds decreasing from 74,832 to 13,785 (Figure 1, Table 1).
Table 1

A comparison of assembly quality statistics from the initial shotgun sequencing assembled by Meraculous and the final HiRise assembly

Meraculous AssemblyChicago HiRise Assembly
Total length972.4 Mb978.3 Mb
Scaffold N5033 kb5.58 Mb
Scaffold N905 kb303 kb
Scaffold L507,552 scaffolds48 scaffolds
Scaffold L9035,731 scaffolds324 scaffolds
Longest scaffold366,14926,942,064
Number of scaffolds74,83213,785
Number of scaffolds > 1kb74,80613,768
Contig N5022.5 kb31.7 kb
Number of gaps53,57795,490
Percent of genome in gaps1.427%1.847%
Number of N’s per 100 kbp1427.151847.03
GC content41.07%41.08%
Figure 1

Comparison of assembly contiguity.

Comparison of assembly contiguity.

Microsatellites and transposable elements

In total, 88 as yet unnamed TEs were identified in the song sparrow genome. Fifty-five of these did not have any significant matches in CENSOR (Kohany ) and are considered novel (File S14). A TE was considered to have a significant match to a known element in CENSOR only when it had a length of at least 80 bp and 80% identity to the known element over 80% of its length, the 80-80-80 rule (Wicker et al. 2007). The predicted TEs were classified into DNA transposons and retrotransposons (i.e., LINEs, LTRs, and SINEs) using CENSOR and TEclass (File S14). Approximately 7.4% of the genome comprises repeats with the majority of that consisting of TEs (∼48%). Among the different TEs, LTRs (∼40%) and LINEs (∼49%) were found to be most abundant (Table 2). The song sparrow genome assembly was found to be less repetitive when compared to sequenced genomes of related songbirds, primarily due to the lower content of LTRs and LINEs than other songbirds (Figure 2).
Table 2

Number and percentage of repeats in the M. melodia genome assembly

ClassificationNumber of copiesPercentage of assembly
LINEs104,0323.01
LTRs85,2762.83
SINEs6,6950.08
DNA Transposons13,5210.21
Unclassified4,8840.12
Total transposable elements214,4086.25
Satellites5690.00
Low complexity repeats38,5610.20
Microsatellites192,9960.90
Total446,5347.35
Figure 2

Comparison of percentages of transposable elements (TEs) among related songbird genome assemblies. * Data from: Zhang Science. 346: 1311-1320.

Comparison of percentages of transposable elements (TEs) among related songbird genome assemblies. * Data from: Zhang Science. 346: 1311-1320. Overall, 112,419 microsatellites with motifs ranging in size from 2-20 bp were found in the song sparrow genome (File S15 contains all microsatellites with their genomic locations). The majority of the microsatellites were made up of 2-, 3-, 4-, and 5-mers, with 2-mers making up about 71% of all microsatellites identified (Figure 3, File S16). The distribution of the top base-pair composition of microsatellite motifs present in the genome is shown in Fig S2. The frequency of occurrence of microsatellites in every scaffold and a distribution of their lengths are provided in Files S17 and S18, respectively.
Figure 3

Abundance of microsatellite repeat motif size classes in the M. melodia genome assembly (details are given in Supplemental File S16).

Abundance of microsatellite repeat motif size classes in the M. melodia genome assembly (details are given in Supplemental File S16).

Gene annotation and function prediction

The MAKER genome annotation pipeline predicted 15,086 genes and 139 pseudogenes in the song sparrow genome, fewer than T. guttata, F. albicollis, and M. vitellinus, but higher than G. fortis (Table 3). The average gene length, exon length, intron length, and the total number of exons and introns predicted are also less compared to closely related species (Table 3). Of the 15,086 predicted genes, 12,541 genes were assigned putative functions with BLASTP (File S19). InterProScan assigned functional domains to 11,298 (74.9%) predicted genes (File S20). A total of 7,010 genes obtained GO annotations. Pathway annotations were assigned to 2,716 genes.
Table 3

Characteristics of genes predicted in the M. melodia genome compared to Taeniopygia guttata (zebra finch), Ficedulla albicollis (collared flycatcher), Manacus vitellinus (golden-collared manakin) and Geospiza fortis (medium ground finch)

M. melodiaT. guttata1F. albicollis2M. vitellinus3G. fortis4
Number of genes15,08617,56116,76318,97614,388
Mean gene length (bp)14,45726,45831,39427,84730,164
Mean CDS length (bp)1,3251,6771,9421,9291,766
Number of exons131,940171,767189,043190,390164,721
Mean exon length (bp)153225253264195
Mean number of exons/gene8.6710.2512.2211.5111.41
Number of introns116,724153,909171,236171,089149,563
Mean intron length (bp)1,6952,9303,2573,2942,813

https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Taeniopygia_guttata/103/

https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Ficedula_albicollis/101/

https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Manacus_vitellinus/102/

https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Geospiza_fortis/101/

https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Taeniopygia_guttata/103/ https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Ficedula_albicollis/101/ https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Manacus_vitellinus/102/ https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Geospiza_fortis/101/ Annotated genes were assigned annotation edit distance (AED) scores with values ranging from 0 to 1. AED is a distance metric score that signifies how closely gene models match transcript and protein evidence. Gene models with AED scores closer to 0 have better alignment with the evidence provided in the MAKER pipeline. A distribution of the percentage of genes with their corresponding AED scores shows close similarity of the annotated genes with the transcript and protein evidence provided in the MAKER pipeline (Fig S1). The song sparrow genome assembly contained 4,318 complete universal single-copy orthologs (BUSCOs; 87.9%) from a total of 4,915 BUSCO groups searched. Among all complete BUSCOs, 99.4% were present as single-copy genes and 0.6% were duplicated. About 7.4% (356) of the orthologous gene models were partially recovered, and 4.9% (241) had no significant matches. The incomplete and missing gene models could either be partially present or missing, or could indicate genes that are too divergent or have very complex structures, making their prediction difficult. Incomplete and missing gene models could also suggest problems associated with the genome assembly and gene annotation. The results from the BUSCO analysis are in agreement with the Circos plot (Figure 4), in which few scaffolds in the T. guttata genome assembly are not represented in our assembly and very few inconsistent arrangements of scaffolds exist between the two genome assemblies. Comparison of our assembly to F. albicollis, P. major, and P. domesticus genome assemblies showed many more inconsistencies in the arrangements of scaffolds between the genomes of these birds and M. melodia (Fig S3) than between T. guttata and M. melodia.
Figure 4

Jupiter plot correlating zebra finch and song sparrow genome assemblies, considering scaffolds greater than 100 kbp in the reference zebra finch genome and the largest scaffolds representing 85% of the song sparrow genome.

Jupiter plot correlating zebra finch and song sparrow genome assemblies, considering scaffolds greater than 100 kbp in the reference zebra finch genome and the largest scaffolds representing 85% of the song sparrow genome.

Non-coding RNA prediction and identification

A total of 267 tRNAs were detected in the song sparrow genome by tRNAscan-SE (see File S21 for sequence and structure of tRNAs), out of which 129 were found coding for the standard twenty amino acids. The predicted output from tRNAscan-SE (File S22) contained 114 tRNAs with low Infernal as well as Isotype scores; these were characterized as pseudogenes lacking tRNA-like secondary structures (Lowe and Chan 2016). Two tRNAs had undetermined isotypes and 22 were chimeric, with mismatch isotypes. Chimeric tRNAs contain point mutations in their anticodon sequence, rendering different predicted isotypes than those predicted by structure-specific tRNAscan-SE covariance models. Among all predicted tRNAs, 11 contained introns within their sequences. No suppressor tRNAs and tRNAs coding for selenocysteine were predicted. The subset of 10 randomly selected tRNAs was also predicted in many other species in both GtRNAdb and tRNAdb databases. Infernal searches predicted a total of 364 ncRNAs in the song sparrow genome, comprising 166 miRNAs, 8 rRNAs, 154 snoRNAs, 16 snRNAs, and 20 lncRNAs (File S23). Compared to the genomes of related avian species (T. guttata and F. albicollis), the song sparrow genome has the highest number of predicted tRNAs, but fewer other ncRNAs (Table 4).
Table 4

Number of ncRNAs predicted in the Melospiza melodia genome compared to Taeniopygia guttata (zebra finch) and Ficedulla albicollis (collared flycatcher)

M. melodiaT. guttata1,2F. albicollis1,3
tRNA267184179
miRNA166302510
snRNA164432
snoRNA154241199
rRNA810022
lncRNA209081473

http://useast.ensembl.org/info/data/ftp/index.html

https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Taeniopygia_guttata/103/

https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Ficedula_albicollis/101/

http://useast.ensembl.org/info/data/ftp/index.html https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Taeniopygia_guttata/103/ https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Ficedula_albicollis/101/

Conclusion

The Chicago and shotgun sequencing libraries along with the HiRise assembly software enabled accurate and highly contiguous de novo assembly of the song sparrow genome. The genome assembly is 978.3 Mb, with 48 scaffolds (L50) making up half the genome size. A previous estimate of genome size of M. melodia from densitometry analysis provided a C-value of 1.43 pg (1,398.54 Mb) (Andrews et al. 2009). Our own k-mer based estimate of genome size from paired reads in the shotgun and Chicago libraries using Kmergenie v1.7044 (Chikhi and Medvedev 2014) yielded an estimated size of 1,127.25 Mb. Both these genome size estimates and the genome sizes of related birds (Table S1) are slightly higher than our genome assembly (978.3 Mb). Our small assembly size may be attributed to the compression of repetitive regions, which is generally observed in assemblies generated from short-read sequencing data. This is also consistent with the fact that our genome contains fewer repeats when compared to related songbirds (Figure 2). Although short reads limit our ability to characterize the total number of repeats within long tandem arrays, we have been able to characterize vast majority of repeats, resolving them into LINEs, SINEs, LTRs, and DNA retrotransposons (Figure 2, Table 2). Our genome is highly complete, with 87.5% full-length genes present out of 4,915 universal orthologous genes in avian species. A large set of genes (15,086) with known homology to related birds was annotated in our study. A majority of these genes (83%) were assigned with putative functions. The improved scaffold lengths and gene model annotations will facilitate studies to identify genes responsible for multiple phenotypic traits of interest. Additionally, longer scaffolds in the Chicago HiRise assembly will help detect regions under selection, including SNPs and structural variants such as insertions/deletions or copy number variations which are potentially responsible for the phenotypic diversity observed in this species. Though we report fewer miRNAs, snRNAs, snoRNAs, rRNAs, and lncRNAs in this genome than in related songbirds, we have high confidence in the predicted ncRNAs we report because we used conservative cutoffs to reduce false positives. Pending the availability of long-read data, this genome assembly provides an excellent reference for a range of genetic, ecological, functional, and comparative genomic studies in song sparrows and other songbirds.
  40 in total

1.  InterProScan--an integration platform for the signature-recognition methods in InterPro.

Authors:  E M Zdobnov; R Apweiler
Journal:  Bioinformatics       Date:  2001-09       Impact factor: 6.937

2.  Automated de novo identification of repeat sequence families in sequenced genomes.

Authors:  Zhirong Bao; Sean R Eddy
Journal:  Genome Res       Date:  2002-08       Impact factor: 9.043

3.  Annotating functional RNAs in genomes using Infernal.

Authors:  Eric P Nawrocki
Journal:  Methods Mol Biol       Date:  2014

Review 4.  Genes and vocal learning.

Authors:  Stephanie A White
Journal:  Brain Lang       Date:  2009-11-13       Impact factor: 2.381

5.  A single IGF1 allele is a major determinant of small size in dogs.

Authors:  Nathan B Sutter; Carlos D Bustamante; Kevin Chase; Melissa M Gray; Keyan Zhao; Lan Zhu; Badri Padhukasahasram; Eric Karlins; Sean Davis; Paul G Jones; Pascale Quignon; Gary S Johnson; Heidi G Parker; Neale Fretwell; Dana S Mosher; Dennis F Lawler; Ebenezer Satyaraj; Magnus Nordborg; K Gordon Lark; Robert K Wayne; Elaine A Ostrander
Journal:  Science       Date:  2007-04-06       Impact factor: 47.728

6.  Rfam: Wikipedia, clans and the "decimal" release.

Authors:  Paul P Gardner; Jennifer Daub; John Tate; Benjamin L Moore; Isabelle H Osuch; Sam Griffiths-Jones; Robert D Finn; Eric P Nawrocki; Diana L Kolbe; Sean R Eddy; Alex Bateman
Journal:  Nucleic Acids Res       Date:  2010-11-09       Impact factor: 16.971

7.  Meraculous: de novo genome assembly with short paired-end reads.

Authors:  Jarrod A Chapman; Isaac Ho; Sirisha Sunkara; Shujun Luo; Gary P Schroth; Daniel S Rokhsar
Journal:  PLoS One       Date:  2011-08-18       Impact factor: 3.240

8.  AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome.

Authors:  Mario Stanke; Ana Tzvetkova; Burkhard Morgenstern
Journal:  Genome Biol       Date:  2006-08-07       Impact factor: 13.583

9.  Whole-genome analyses resolve early branches in the tree of life of modern birds.

Authors:  Erich D Jarvis; Siavash Mirarab; Andre J Aberer; Bo Li; Peter Houde; Cai Li; Simon Y W Ho; Brant C Faircloth; Benoit Nabholz; Jason T Howard; Alexander Suh; Claudia C Weber; Rute R da Fonseca; Jianwen Li; Fang Zhang; Hui Li; Long Zhou; Nitish Narula; Liang Liu; Ganesh Ganapathy; Bastien Boussau; Md Shamsuzzoha Bayzid; Volodymyr Zavidovych; Sankar Subramanian; Toni Gabaldón; Salvador Capella-Gutiérrez; Jaime Huerta-Cepas; Bhanu Rekepalli; Kasper Munch; Mikkel Schierup; Bent Lindow; Wesley C Warren; David Ray; Richard E Green; Michael W Bruford; Xiangjiang Zhan; Andrew Dixon; Shengbin Li; Ning Li; Yinhua Huang; Elizabeth P Derryberry; Mads Frost Bertelsen; Frederick H Sheldon; Robb T Brumfield; Claudio V Mello; Peter V Lovell; Morgan Wirthlin; Maria Paula Cruz Schneider; Francisco Prosdocimi; José Alfredo Samaniego; Amhed Missael Vargas Velazquez; Alonzo Alfaro-Núñez; Paula F Campos; Bent Petersen; Thomas Sicheritz-Ponten; An Pas; Tom Bailey; Paul Scofield; Michael Bunce; David M Lambert; Qi Zhou; Polina Perelman; Amy C Driskell; Beth Shapiro; Zijun Xiong; Yongli Zeng; Shiping Liu; Zhenyu Li; Binghang Liu; Kui Wu; Jin Xiao; Xiong Yinqi; Qiuemei Zheng; Yong Zhang; Huanming Yang; Jian Wang; Linnea Smeds; Frank E Rheindt; Michael Braun; Jon Fjeldsa; Ludovic Orlando; F Keith Barker; Knud Andreas Jønsson; Warren Johnson; Klaus-Peter Koepfli; Stephen O'Brien; David Haussler; Oliver A Ryder; Carsten Rahbek; Eske Willerslev; Gary R Graves; Travis C Glenn; John McCormack; Dave Burt; Hans Ellegren; Per Alström; Scott V Edwards; Alexandros Stamatakis; David P Mindell; Joel Cracraft; Edward L Braun; Tandy Warnow; Wang Jun; M Thomas P Gilbert; Guojie Zhang
Journal:  Science       Date:  2014-12-12       Impact factor: 47.728

10.  Trimmomatic: a flexible trimmer for Illumina sequence data.

Authors:  Anthony M Bolger; Marc Lohse; Bjoern Usadel
Journal:  Bioinformatics       Date:  2014-04-01       Impact factor: 6.937

View more
  4 in total

1.  A Highly Contiguous Genome for the Golden-Fronted Woodpecker (Melanerpes aurifrons) via Hybrid Oxford Nanopore and Short Read Assembly.

Authors:  Graham Wiley; Matthew J Miller
Journal:  G3 (Bethesda)       Date:  2020-06-01       Impact factor: 3.154

2.  A high-quality genome assembly and annotation of the dark-eyed junco Junco hyemalis, a recently diversified songbird.

Authors:  Guillermo Friis; Joel Vizueta; Ellen D Ketterson; Borja Milá
Journal:  G3 (Bethesda)       Date:  2022-05-30       Impact factor: 3.542

3.  Maintenance of local adaptation despite gene flow in a coastal songbird.

Authors:  Jonathan D Clark; Phred M Benham; Jesus E Maldonado; David A Luther; Haw Chuan Lim
Journal:  Evolution       Date:  2022-06-26       Impact factor: 4.171

4.  Chromosome-Level Genome Assembly of the Common Chaffinch (Aves: Fringilla coelebs): A Valuable Resource for Evolutionary Biology.

Authors:  María Recuerda; Joel Vizueta; Cristian Cuevas-Caballé; Guillermo Blanco; Julio Rozas; Borja Milá
Journal:  Genome Biol Evol       Date:  2021-04-05       Impact factor: 3.416

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.