Literature DB >> 27279738

Quality Assessment of Domesticated Animal Genome Assemblies.

Stefan E Seemann¹, Christian Anthon¹, Oana Palasca¹, Jan Gorodkin¹.

Abstract

The era of high-throughput sequencing has made it relatively simple to sequence genomes and transcriptomes of individuals from many species. In order to analyze the resulting sequencing data, high-quality reference genome assemblies are required. However, this is still a major challenge, and many domesticated animal genomes still need to be sequenced deeper in order to produce high-quality assemblies. In the meanwhile, ironically, the extent to which RNAseq and other next-generation data is produced frequently far exceeds that of the genomic sequence. Furthermore, basic comparative analysis is often affected by the lack of genomic sequence. Herein, we quantify the quality of the genome assemblies of 20 domesticated animals and related species by assessing a range of measurable parameters, and we show that there is a positive correlation between the fraction of mappable reads from RNAseq data and genome assembly quality. We rank the genomes by their assembly quality and discuss the implications for genotype analyses.

Entities: CellLine Chemical Disease Gene Species

Keywords: assembly quality; domesticated animals; genome assembly

Year: 2016 PMID： 27279738 PMCID： PMC4898645 DOI： 10.4137/BBI.S29333

Source DB: PubMed Journal: Bioinform Biol Insights ISSN： 1177-9322

Introduction

Domesticated farm animals are of the highest importance for human food supply. This implies a need for optimized productivity while demanding healthy animals living under justifiable ethical conditions. Some of the domestic animals are also used as model organisms for human diseases, eg, pig as a model for obesity, cardiovascular disease, gastroenteropathy, and immunological diseases, as well as a pharmacology and toxicology model.1–6 Variations in the genomic sequence are gaining increasing importance for improving strategies for domestic animal studies. However, the assembled genomes are of highly diverse assembly quality. High-quality genome assemblies are a prerequisite for high-quality genomic and transcriptomic analyses, while in contrast, poor genome assembly qualities increase the risk of poor transcriptome assemblies, which highly impact the value of any next-generation sequencing (NGS) experiment. In recent years, various NGS strategies are widely used to address a wide range of different questions from differential expression to epigenetic marks such as methylation signatures of RNA transcripts. Although the creativity in the ways to use NGS seems to be unlimited, the usage of NGS often requires a good reference genome to start with. Unfortunately, there are seemingly not many recent advances in improving the reference genome sequences accordingly, although ongoing development in the field such as PacBio holds the potential for a paradigm shift, pending on the overall costs.7 Currently, genomes such as pig (susScr10.2) and dog (canFam3) have not been improved since 2011, and the horse genome has not been improved since 2007. This leaves an apparent imbalance and a potential waste of resources by generating the data meant for genome-wide comparison that cannot be mapped. It also influences proper and full analyses. In the best case, this will result in an incomplete analysis, but in the worst case, it will lead to misinterpretation of the data. We briefly outline some of the genome assemblies in Figure 1A. Most of the species considered were sequenced using a hybrid approach, combining whole-genome shotgun sequencing (WGS) with a hierarchical BAC clone approach, and only a few of them solely relied on WGS (for example, dog, sheep, and goat). Organisms sequenced in the early 2000s benefit from the integration of Sanger-based sequencing, which is characterized by longer read length and better sequence quality, while more recently, sequenced organisms are mainly based on short-read NGS.8 The dog genome assembly is based on WGS with Sanger sequencing. CanFam2, on which the current canFam3 is based, was at its time of release (2004) much better than most other assemblies (eg, the mouse genome), due to high sequence coverage and good data quality in terms of read length or library insert sizes (up to 200 kb insert length).9 The horse genome, also known to be of high quality, is based on Sanger sequencing as well, but supplemented with BAC and fosmid clone maps, for better contiguity.10 The cow genome was sequenced using the Sanger method and incorporates BAC clones, accounting for a large proportion of the genome coverage.11 In cow, special attention was given to the genome assembly method, by using improved postprocessing algorithms that took into account synteny with the human genome, among others.11 On the other hand, the pig genome is mainly based on sequences from BAC clones obtained back in the 2000s, to which WGS Illumina sequences are added for resolving gaps.12

Figure 1

Discrepancy between phylogeny and gene annotation.

Notes: (A) The phylogenetic tree of the 21 investigated species is shown, with a clear separation between placental mammals and birds. The tree is a subset of the UCSC-generated 100-way tree. (B) A UCSC genome browser view in human of the genomic region around PROZ. PROZ is missing in the pig assembly susScr102 and in the phylogenetic subtree around dog, but the gene is conserved in the phylogenetic subtree of pig and even in the more distant birds.

The sheep and goat assemblies were obtained by short-read NGS. The sheep genome was iteratively improved with new sequencing data produced in different rounds, produced by Illumina and 454 technologies, in an attempt to cover the numerous gaps in the assembly.13 For the goat assembly based on Illumina reads, the newer optical mapping technology was employed,14 instead of radiation-hybrid or fluorescence in situ hybridization maps used in most other assemblies for the alignment of chromosomes. The chicken genome was initially sequenced in 2004 using the Sanger technology15 with the support of BAC clone physical maps for assembly scaffolding. The first version was updated with new 454 sequencing data, genetic maps, and better assembly algorithms. The genome assembly of turkey was obtained using Illumina and 454 sequencing and makes use of a BAC clone-based physical map for aligning to chromosomes only.16 Being less than half the size of mammalian genomes, but with a higher number of chromosomes, both chicken and turkey genomes showed particular assembly challenges in certain genomic regions, mainly due to repeats specific to the small chromosomes in birds.15,16 When conducting comparative genome analysis of specific regions of domesticated animal genomes in, eg, the University of California Santa Cruz (UCSC) genome browser, it is apparent that regions with some frequency are missing. For example, vitamin K-dependent plasma glycoprotein (PROZ), a gene encoding for a protein with a role in regulating blood coagulation in human, has annotated orthologs in cow, sheep, and horse, as well as mouse, and even xenopus and some birds, but orthologs are missing in pig and dog, among others (see Fig. 1B). However, it is not clear if this gene is missing because it is simply not in the genome assemblies or whether it was indeed lost in some of the lineages. Elucidation of these loci is highly relevant, so that it can be determined why the gene is missing in the genome. For example, if the missing genome sequence is a protein-coding gene highly conserved over mammals, one would also expect to find it in the syntenic region and a naive first approach is to look for the corresponding protein isoforms in the relevant databases. However, less conserved genomic loci will leave further analyses open for future considerations, or they would demand large resources to solve a local genomic region experimentally. To investigate the extent of this problem, we assessed the quality of the most recent genome assemblies of 20 domestic animals and related species and compared to the human genome (hg19), which we define as the gold standard assembly. The assembly quality is measured considering a range of parameters, as follows: nucleic acid conservation of highly conserved protein-coding and ultraconserved elements (UCs); amino acid homology of universal single-copy orthologs; structure conservation of housekeeping RNAs; assembly sequence quality; and assembly contiguity. With this information, we can further quantify the imbalance between NGS applications and genome assembly quality.

Methods

Genomes

The genomes investigated in this study are listed in Table 1. We focused on domesticated animals that are part of either the Laurasiatheria or the Aves. We supplemented the domesticated animal genomes with other species within these two phylogenetic classes based on the criteria of maturity of the assembly and the existence of genomic annotation. For comparison, the well-assembled genomes of human (hg19/GRCh37) and mouse (mm10/GRCm38) were added. The genomic sequence was downloaded from the UCSC webservers as so-called 2 bit files.

Table 1

Genome assembly quality features of human, domesticated animals, and related species.

SPECIES	ASSEMBLY			PROTEIN-CODING (PCE)				ULTFACONSEVED (UC)			ORTHOLOGS (BUSCO)				rRNA	tRNA	GAPS	CONTIGUITY (N50)
	ASSEMBLY			4,856 EXONS				473 LOCI			3,023 GENES				rRNA	21 AA	GAPS	CONTIGUITY (N50)
	VERSION	YEAR	SIZE [MBP]	D [#]	P [#]	S [#]	W [#]	D [#]	P [#]	S [#]	C [%]	CD [%]	F [%]	M [%]	[SCORE]	[#]	[#]	[KBP]
Human	hg19	2009	3,137	0	0	0	0	0	0	0	90	1.7	5.1	4.5	8	20	411	46,396
Laurasiatheria
Mouse	mm10	2012	2,731	17	23	2	5	0	0	0	91	2.2	4.8	3.8	3	21	582	52,589
Panda	ailMel1†	2009	2,300	30	34	11	44	0	0	0	88	0.5	8.0	3.4	0	21	108,147	1,282
Cow	bosTau8	2009	2,670	23	26	6	24	3	1	0	84	1.8	8.6	6.9	6	21	72,051	6,380
Dog	canFam3	2011	2,411	33	22	9	24	6	2	0	89	2.0	6.3	4.2	8	20	23,876	45,877
Domestic goat	capHir1	2013	2,636	126	91	23	79	3	1	0	79	1.0	12	8.2	0	21	260,474	14,391
Horse	equCab2	2007	2,485	76	39	12	27	2	4	1	86	0.5	8.9	4.0	8	21	55,283	46,750
Hedgehog	eriEur2†	2012	2,716	63	28	3	85	2	1	0	86	1.4	8.5	5.3	0	21	219,764	3,265
Cat	felCat8	2014	2,641	34	28	7	22	2	0	0	88	0.7	7.3	4.3	4	21	100,040	18,072
Ferret	musFur1†	2011	2,411	52	26	7	53	2	1	0	89	1.0	6.5	3.8	0	20	109,700	9,335
Microbat	myoLuc2†	2010	2,035	162	36	13	93	7	5	0	83	3.9	9.1	7.4	4	20	61,131	4,293
Sheep	oviAri31	2012	2,619	67	77	14	54	0	0	1	81	1.2	11	7.2	3	21	125,067	100,080
Megabat	pteVam2†	2014	2,198	65	29	17	110	0	0	1	87	0.7	7.7	4.8	4	20	189,339	5,954
Shrew	sorAra2†	2012	2,423	117	33	10	66	6	0	0	85	1.2	7.6	6.9	0	21	188,953	22,794
Pig	susScr102	2011	2,809	210	81	28	213	25	12	1	69	2.4	12	17	7	20	238,439	576
Dolphin	turTru2†	2012	2,552	24	41	23	469	1	3	8	72	1.5	14	13	4	21	313,713	116
Alpaca	vicPac2†	2013	2,172	48	33	19	56	4	0	0	87	1.2	7.9	4.1	0	21	174,225	7,264
Aves
Zebra finch	taeGut2	2013	1,232	816	125	15	81	13	12	6	77	2.0	8.7	13	4	20	87,710	8,237
Mallard duck	anaPla1†	2013	1,105	979	117	19	119	11	14	2	72	0.7	10	16	0	20	125,115	1,234
Chicken	galGal4	2011	1,047	686	79	8	50	10	7	0	85	0.9	5.5	8.8	4	20	11,109	12,877
Turkey	melGal5	2014	1,128	637	96	20	376	7	5	2	74	0.5	10	14	0	20	64,955	3,801

Notes: rRNA is the completeness of one 45S ribosomal DNA cluster consisting of pRNA, 28S, 5.8S, and 18S rRNAs in exactly this 5′ to 3′ order. tRNA is the occurrence of 21 amino acids (aa). Gaps are 10 or more nucleotides long. Contiguity is the scaffold N50. PCE, UC, tRNA, and Gaps are absolute counts [#], BUSCO is in percentage [%], rRNA is presented as a score, and genome size and N50 are sequence lengths. The assembly version is the UCSC Genome Browser assembly ID.

Assembly level is scaffold, otherwise chromosome.

Abbreviations: PCE, Protein-coding exons could be D, deleted; P, partially deleted; S, split; or in W, wrong order. UC, Ultraconserved elements could be D, deleted; P, partially deleted; or S, split. BUSCO, Universal single-copy orthologs could be C, complete; CD, complete duplicated; F, fragmented; or M, missing.

Genome assembly quality features

Genome assembly quality has previously been assessed in many different ways with focus on methodologies (eg, insert size distributions and sequence coverage), genome biases (eg, k-mer distributions), or fragment length distributions (eg, N50).17,18 Additionally, the completeness of highly conserved orthologous genes in genome assemblies has been investigated to reflect the expected gene content.19,20 In the current study, we combine a number of these previously proposed features along with nucleic acid conservation and synteny of highly conserved genomic loci.

Analysis of conserved genomic features

The analysis of highly conserved genomic features (conserved protein-coding genes and UCs) is based on pairwise sequence alignments of human and the 20 vertebrates. The pairwise alignments were built by lastz21 and the UCSC toolkit22 for chains and nets with human as query. We used the UCSC tool liftOver (parameter minMatch = 0.8) to convert genomic coordinates in human to the other species based on the pairwise alignments. We investigated the conservation of the union of 32 universal genes (COGs) described by Ciccarelli et al.23 and 444 conserved core eukaryotic genes19 with an ortholog in human. The majority of COGs are ribosomal proteins. The 4,856 merged exons from these 463 protein-coding genes were classified as deleted, partially deleted, split, or being in the wrong order compared with the exon order in human. Furthermore, we checked for the presence of 473 UCs (200 bp long loci of 100% identity in rat, mouse, and human),24 which were classified as deleted, partially deleted, or split. Note that these highly conserved genomic features cover both coding and noncoding intergenic regions which is in contrast to what is done in Benchmarking Universal Single-Copy Orthologs (BUSCOs; see below).

Benchmarking Universal Single-Copy Orthologs

Sets of BUSCOs are orthologous groups of single-copy genes described by Simão et al.20 Any BUSCO in vertebrates can be expected to be found as a single-copy ortholog in any genome from the phylogenetic clade of vertebrates. In short, for each BUSCO group, an amino acid consensus sequence is generated from its respective hidden Markov model profile, and a block profile is built to guide automated gene predictions with AUGUSTUS.25 During genome assessment, regions in a genome that are likely to encode BUSCO-matching genes are identified by tBLASTn searches, then genes are predicted in these candidate regions using the corresponding BUSCO group’s block profile and default gene finding parameters. Successful AUGUSTUS gene prediction for each BUSCO group produces an initial BUSCO gene set whose protein sequences are then evaluated using the BUSCO-specific cutoffs to determine true orthology and completeness. Finally, significant matching protein sequences are tested to be likely orthologous or just homologous by applying the BUSCO group’s hidden Markov model profile. We were running the BUSCO version 1.1b1 in genome mode with the lineage specific profile libraries of vertebrata and used the precomputed metaparameters of human for placental mammals and chicken for birds (parameter-species human/chicken).

45S ribosomal DNA cluster

Ribosomal RNAs (rRNAs) are the primary structural components of the ribosome. The rRNA species 28S and 5.8S from the large ribosomal subunit and 18S rRNAs from the small subunit are encoded by the 45S ribosomal DNA (rDNA) cluster. Transcription by RNA polymerase I yields a primary transcript (45S pre-rRNA), which is processed into the mature 28S, 18S, and 5.8S rRNAs found in cytoplasmic ribosomes. The rRNAs reciding in a single 45S transcription unit are separated by spacers and are always arranged in the same 5′ to 3′ order: 18S, 5.8S, and 28S. rDNA silencing is mediated through methylation of the rDNA promoter via DNMT3B and pRNA, a noncoding RNA which has been shown to originate from a spacer promoter located upstream of the pre-rRNA transcription start site.26 We predicted the 28S and 18S rRNAs with RNAmmer27 and searched the 5.8S rRNA Rfam family RF00002 and the pRNA Rfam family RF01518 with Infernal.28 We defined the rRNA score to describe the completeness of the 45S rDNA cluster as 2 × R − S, where R is the count of pRNA, 18S, 18S, or 5.8S rRNA in the correct order on the same chromosome or scaffold (2 < R < 4), and S is 1 if not all items are located on the same strand and 0 otherwise. The cluster with the highest rRNA score was reported.

Additional features

We counted the presence of tRNAs coding for each of the 20 standard amino acids and selenocysteine and required at least one tRNA coding for each. tRNAs are predicted with tRNAscan-SE.29 Assembly sequence quality was measured by counting gaps of 10 or more nucleotides in the genome assemblies. The assembly contiguity was described by the scaffold N50 metrics as documented in the NCBI Assembly database (http://www.ncbi.nlm.nih. gov/assembly). Scaffold N50 is a scaffold size such that scaffolds of this length or longer include half of the bases of the assembly.

Genome assembly quality ranking

A quality score for genome assemblies has been previously suggested by combining normalized feature scores.30 Besides using some of their presented features, we decided to rank the genome qualities without weighting each of the features used to analyze the genomes. The impact of each feature for describing the assembly quality is unknown, and a perfect vertebrate genome assembly to train the weighting parameters does not exist (even the human assembly is still incomplete). Hence, model training would necessarily result in a biased score toward the defined standard. Instead, we decided to measure the differences between the assembly qualities of studied species by reducing the variances of the applied features. This was done by a principal component analysis (PCA) of the features (princomp function from the built-in R stats package). Each genome assembly was represented by a vector consisting of 15 z-score normalized features (see Table 1): highly conserved protein-coding exons (PCE; 4 features), ultraconserved elements (UC; 3 features), universal single-copy orthologs (BUSCO; 4 features), 45S rDNA cluster (rRNA; 1 feature), tRNAs (1 feature), assembly size normalized gap count (1 feature) and contiguity (scaffold N50; 1 feature). Then, the ranking of assembly qualities was quantitatively measured in comparison to the human genome assembly that has been the most intensively investigated of all vertebrate genomes. We calculated the Euclidian distances of the first three principal components (PCA score) between each species and human and ranked the genome qualities accordingly. The number of principal components chosen for the distance measure explained most of the feature variances.

Transcriptome data and processing

For comparison of the read mappability to the genome assembly quality, we downloaded paired-end libraries of polyA-selected RNA and total RNA from the sequence read archive.31 All libraries were sequenced on an Illumina HiSeq 2000. Only species with at least three libraries were considered. The applied libraries are listed in the Supplementary File 1. For 11 species, we found polyA-selected RNA libraries, and for 4 species, we found total RNA libraries. For human, mouse, and sheep, we studied both polyA-selected and total RNA libraries. We processed the raw reads by removing low-quality reads and adapter sequences with cutadapt (version 1.8.3; parameter-m 30-q 20).32 Cleaned reads were aligned to their reference genome, which was built without annotations using STAR (version 2.4.0 j).33 After aligning, we removed rRNAs from the mapped transcriptomic data based on the rRNA predictions from RNAm-mer (8S, 18S, and 28S) and Infernal (Rfam families RF00001:5S and RF00002:5.8S). We counted uniquely and multimapped reads as mapped reads. For each organism, we documented the mean and standard deviation of mapped reads in the applied libraries.

Results

In this study, we analyzed the genome quality of the latest assemblies (September 2015) of 20 domesticated and phylogentically related animals from the classes Laurasiatheria (placental mammals) and Aves (birds), including several farm animals. As a gold standard assembly, the human genome (hg19) has been included. The phylogenetic relationship between the species is shown in Figure 1A. The analyzed genome assembly quality features of all 21 species are summarized in Table 1. At a first glance, we see that most of the applied features have their best values for the human and mouse genome assemblies, which is in agreement with the extensive efforts undertaken to study these organisms. The efforts to complete the genomes are especially reflected in the gap content, which is much lower for human and mouse than in the other species. Another strong signal is that, in general, features for UCs, conserved PCEs, and universal single-copy orthologs (BUSCOs) are of lowest quality for the bird genome assemblies, which may be partly explained by the evolutionary distance to mammals. However, chicken has arguably an assembly quality better than that of many mammals, and this is likely due to the usage of the Sanger sequencing technology. Nine genome assemblies are still at the scaffold level; however, we do not see a clear quality difference to the assemblies at the chromosome level. Strikingly, all genome assemblies lack a significant amount of the 3,023 BUSCOs. Human, mouse, dog, and ferret have the largest number of complete BUSCOs (89%–91%). The least complete genomes in terms of BUSCOs are those of pig, dolphin, mallard duck, turkey, zebra finch, and domestic goat (69%–79%). The sequence conservation and synteny of 4,856 PCEs generally agree with the BUSCO assessment. However, an exception is cow, which performs very well in terms of PCEs but less well in BUSCOs. The genome-wide alignment based comparisons to human (PCEs) are likely to perform better than BUSCOs because synteny with human had been used in the built of the cow assembly.11 The 473 UCs are well covered by all genomes. An exception of this trend is the pig assembly with 8% incomplete UCs, which is more than that for the genomes of birds. It has been suggested that during the initial pig genome project, only about 90% of the pig genome was accessible in BAC clones,12 which could explain the incomplete set of intergenic elements in our study. Extreme cases of assembly contiguity are the sheep, pig, and dolphin assemblies. The scaffold N50 of sheep is very high (100,080 Kbp), whereas the contig N50 is much lower (40,376 Kbp), suggesting issues in the scaffolding. In contrast, the scaffold N50 of pig and dolphin is very low, which is in agreement with the low quality of these two assemblies described by the other features. The 45S rDNA cluster is a highly repetitive genomic region that makes it hard to assemble in the correct order without very long reads. Besides human, only the genome assemblies of dog and horse have a complete 45S rDNA cluster. All the other genomes completely lack the cluster or contain only a part of it. All genomes have a complete set of standard tRNA codons and only part of them, including human, miss a codon for selenocysteine. The pig genome is clearly the least complete assembly of all placental mammals. It misses a substantial number of UCs and a large fraction of PCEs. On the other hand, it is one of the few genomes with a (almost) complete 45S rDNA cluster. 28S, 5.8S, and 18S are located in a tandem on chromosome 6, but the rDNA silencing mediator pRNA is positioned almost 100 kb upstream on the opposite (positive) strand. This suggests that the pig genome has been exhaustively assembled on highly incomplete genomic sequencing. The dolphin genome assembly represents another extreme with almost 10% of the PCEs in a rearranged order in comparison to human, which may be partly explained by the large amount of scaffolds (240,901; see also low scaffold N50). The genome assembly of panda illustrates that the features of genome sequence quality and gene content do not agree in all cases. Whereas the low scaffold N50 and the large number of gaps suggest a low quality, the gene content-based metrices are among the best of all examined assemblies (except of the rRNA feature). Hence, in the following, we propose a ranking of genome assembly quality combining all features.

Quantitative ranking of genome assembly qualities

For the 15 quality features, the first three principal components (PCs) account for 75% of their variance. Figure 2 illustrates important relationships between the assessed features and the three PCs. Almost 50% of the information in the features is reduced into the first PC (PC1), which is primarily composed of features describing misassembled protein-coding genes or UCs (fragmented, split, wrong order). Features describing missing genomic information are primarily represented by PC2 (deleted, partially deleted), and this accounts for 17.2% of the variance. PC3 describes another 10.8% of the feature variance that originates mostly from the complete duplicated BUSCO and the 45S rDNA cluster. The assembly contiguity (N50) cannot be grouped with the other features and, hence, contributes equally to all three PCs.

Figure 2

Relationship between principal components and quality features.

Notes: The first three principal components (PCs) account for 75% of the feature variance (PC1: 47.1%, PC2: 17.2%, and PC3: 10.8%). Rectangular nodes describe the 15 quality features. The edge weight describes how much variance of a feature is explained by the principal component. Green edges connect features negatively related to genome quality, and purple edges connect features positively related to genome quality. Relations (edges) are shown if greater than 0.3 or smaller than −0.3. See Table 1 for abbreviations of the quality features. Gaps are normalized by assembly size. The figure was made using the R qgraph package.

Based on these three PCs, we quantitatively rank the species by their Euclidian distance to human in the three-dimensional space, see PCA score in Figure 3. The PCA score can be interpreted as an assembly quality score. The genome assemblies of dog and mouse are of highest quality followed by cow, horse, cat, ferret, and microbat. The genomes of sheep, megabat, hedgehog, shrew, panda, alpaca, and chicken are all of medium quality, and the genomes of dolphin, mallard duck, pig, turkey, zebra finch, and domestic goat have the lowest PCA scores. Figure 3A shows a weak positive correlation between the PCA score and the divergence time between human and the compared species (Pearson’s correlation coefficient ρ = 0 54). However, after removing human (gold standard) and the bird genomes (which have large phylogenetic distance to the placental mammals), the correlation disappears (ρ = 0 26). Given that the mouse is more evolutionarily distant to human than any of the other mammals considered, and its genome assembly is of high quality, we note that the presented quality score is not biased by phylogeny.

Figure 3

Correlation of genome assembly quality to (A) phylogenetic distance to human and to (B) frequency of mapped reads from RNAseq experiments.

Notes: The genome assembly quality (PCA score) is measured as the Euclidian distance of principal component 1 (PC1), PC2, and PC3 between human and 20 other species. The human genome serves as reference and has a Euclidian distance of zero. RNAseq experiments are divided into polyA-selected RNA (blue circle) and total RNA (green diamond), and the mean and standard deviation of mapped reads are shown. After removing human (reference) and bird genomes (large phylogenetic distance), the Pearson’s correlation coefficient between assembly quality and phylogenetic distance is 0.26, between assembly quality and polyA-selected RNA mapped reads is 0.91, and between polyA-selected RNA mapped reads and phylogenetic distance is 0.43. Only the correlation between assembly quality and polyA-selected RNA mapped reads is significant (P <.005).

Mappability of sequencing data

Herein, we address how strongly the genome assembly quality impacts the mappability of RNAseq data to the reference genome (see Supplementary File 1). That is, we are interested in knowing how much information we loose in NGS studies merely due to a suboptimal genome assembly quality. In Figure 3B, we show a positive correlation between assembly quality (measured by the PCA score) and the percentage of mapped reads from polyA-selected RNA libraries (Pearson’s correlation coefficient ρ = 0 86; t-test P < 0.001), which indicates the importance of high-quality genome assemblies for maximal gain from NGS data. Without the human and the bird genomes, the Pearson’s correlation coefficient between mapped reads (polyA RNA) and assembly quality is even higher (ρ = 0 91; t-test P < 0.005), whereas the correlation between mapped reads (polyA RNA) and the evolutionary distance to human is not significant (ρ = 0 43). The trend for total RNA is similar, but a correlation analysis was not possible due to the small number of total RNA libraries in this study. However, the data suggest that the number of mapped reads depends even more on assembly quality for total RNA libraries in comparison to polyA-selected RNA libraries.

Cases of missing genotypes in pig

The varying assembly quality of domesticated animals may have a large impact on pathway reconstruction due to the missing genes. Below, we describe two examples in pig, which is an important production animal as well as a useful model organism. The first example is the DGAT2 gene, which codes for an enzyme that catalyzes triglyceride synthesis34 in eukaryotes. The gene does not have an annotated ortholog in pig. In our study, the lastz pairwise alignment to human aligns four exons of the human DGAT2 homolog to the DGAT2-like 6 gene in pig, but in the 5′ end of the gene, three of the exons are aligned to three different chromosomes. However, the gene has been isolated in pig by cDNA cloning procedures.35 The polymorphisms of the gene play a role in backfat tissue quality, which is an important trait for the meat product industry.34–36 DGAT2 is also of interest in the study of obesity in humans, and it has been shown to be upregulated in obese pigs, which are used as model organisms for obesity.37 Hence, missing this gene in systems biology analyses could have unfortunate consequences. The second example is the cholesteryl ester transfer protein (CETP), a protein playing a central role in atherosclerosis, the chronic inflammatory condition causing most cardiovascular diseases4 and therefore a leading cause of death worldwide.38 High levels of low-density lipoproteins (LDL) and low levels of high-density lipoproteins (HDL) play a main role in atherosclerosis,39 and the cholestryl ester transfer protein is specifically the one responsible for controlling HDL-to-LDL ratios.40 Pig is a suitable model organism in the study of atherosclerosis, due to the spontaneous occurrence of the disease, size, and its human-like cardiovascular anatomy.3,4 However, the CETP gene is not annotated in pig, and it is not clear whether this is a genome assembly issue, or whether the gene is really not present in the animal. The gene is known to be naturally lacking in mouse,4 despite being present in other mammals such as rabbit or dolphin. A genome analysis study performing de novo genome assembly in mini pig concluded that CETP was among the genes lost in the lineage,41 while the authors of a previous study have supposedly cloned the gene in pig.42 The low levels of the protein, detected in pig by antibody designed against the human CETP,43 could be explained by the presence of an inhibitor of the protein, a hypothesis supported by a study where a human CETP inhibitor was isolated from pig plasma.44 Due to the low quality of the pig genome, we cannot draw final conclusions about the existence or about the genotypes of these obviously important genes for meat production and disease modeling. In addition, genetic analyses of the respective epigenetic and regulatory marks of these genes are, therefore, not possible to be performed in pig.

Discussion

A key result of our study is the urgent need to reinforce the efforts for improving genome assembly quality in domesticated animals. Using a variety of different quality features, we show that many of the investigated genome assemblies are far from perfect, characterized by missing or fragmented vertebrate-wide conserved genomic loci and low scaffold contiguity. The low cost of short-read NGS-based sequencing has boosted the sequencing of domesticated animal genomes and transcriptomes, among other species. The consequences of poor genome assembly quality become most obvious in the mappability of NGS reads. While we lack enough total RNA-sequencing data for domesticated animals, we observe a clear trend of lower mappability of polyA-selected RNAseq in lower quality assemblies. Genome and transcriptome annotations are largely affected by missing or fragmented genomic content that may lead to wrong conclusions about the genes or transcripts present in the organism. Also comparative genetics relies on correctly sequenced, aligned, and annotated genomes, and we have shown the possible issues in two pig examples. The presented quality measure focuses on contiguity and completeness of genome assemblies. The human genome is the most studied, and hence, we used its assembly as a gold standard for characterizing the completeness of the other assemblies. Ideally, the genome quality should be exclusively based on independent features without a gold standard genome because the genomic difference between human and the analyzed species may introduce a phylogenetic bias. To increase the feature space in this study, we decided to include human-based completeness measures, and the high-quality score we obtained for the mouse genome illustrates the usability of the presented approach to quantify assembly quality. In addition, we used the human assembly to rank the quality of the species assemblies, which, however, has no impact on the measured quality features. We showed that traditional Sanger sequencing, characterized by longer read length and better quality, led to better assembly quality than short-read NGS-based sequencing. Dog, horse, or cow, all three Sanger based, are top scoring according to our ranking, while sheep and goat, based on short-read NGS, are of worse quality. The BACs used in the pig genome assembly, another assembly of low quality, were sequenced using Sanger technology, whereas the gaps between BACs were closed using short-read NGSs. Sheep has very high NGS coverage and its genome has been iteratively refined, which is reflected by higher scoring than both goat and pig. However, high coverage of short-read NGS is rare enough to achieve high-quality assemblies. This is primarily due to the repetitive content of the genomes, including repetitive DNA near centromeres and telomeres, large paralogous gene families, and retrotransposons such as LINEs and SINEs. An important step toward increased quality and usability of the genomes is the incorporation of new data based on long-read sequencing and mapping technologies. Most recently, long-range sequencing has been dramatically improved by Pacific Biosciences (PacBio) Single Molecule Real Time and Oxford Nanopore, and mapping by the Dovetail Genomics Chicago protocol and the 10X Genomics Chromium instrument. For example, the PacBio RS II technology updated in 2014 is advertised as producing raw reads with mean lengths of 15 kb at the cost of error rates as high as 15% and about 100-fold higher expenses than short-read NGS.45 However, per-nucleotide accuracy of 99.99% can be achieved through algorithmic techniques and sufficient coverage.7 Not surprisingly, several genome assemblies are currently complemented with PacBio sequencing, such as chicken (galGal5) and sheep (oviAri4). Mapping technologies improve scaffold contiguity and synteny by determining the long-range information on the arrangement of DNA without sequencing every base. For example, the Dovetail Genomics Chicago protocol,46 introduced in spring 2015, studies the 3D contacts of in vitro reconstituted chromatin through an optimized Hi-C approach. It can achieve DNA spanning up to ~150 kb length and has been successfully applied to improve the existing assembly of the American alligator.

Conclusion

The genomes and transcriptomes of domestic animals deserve optimal exploration for making improvements in productivity without compromising animal welfare, as well as for studying human genetics and diseases. The analyses of a comprehensive list of genome assembly features of domesticated animals and related species illustrate the large discrepancy between their assembly quality and NGS efforts. Especially the farm animals pig, chicken, sheep, and cow, which are of high economical and ecological importance, lack a significant number of core eukaryotic and universal genes in their current genome assemblies. Our study presents a novel way of ranking the assembly qualities in comparison to a gold standard. The data and pipeline presented in this study can be applied to judge the assembly quality and the number of unmapped reads in a NGS study. We show that the exploitation rate of RNAseq data is correlated with the genome assembly quality. We conclude that more efforts are needed to improve the genome assemblies of domestic animals. Especially due to the affordable access to the aforementioned new technologies, we expect a significant improvement in the quality of domesticated animal genomes in the near future. Supplementary File 1. RNAseq libraries used to assess genome assembly quality and read mappability to the reference genome.

41 in total

1. Assembly of large genomes using second-generation sequencing.

Authors: Michael C Schatz; Arthur L Delcher; Steven L Salzberg
Journal: Genome Res Date: 2010-05-27 Impact factor: 9.043

2. Interaction of noncoding RNA with the rDNA promoter mediates recruitment of DNMT3b and silencing of rRNA genes.

Authors: Kerstin-Maike Schmitz; Christine Mayer; Anna Postepska; Ingrid Grummt
Journal: Genes Dev Date: 2010-10-15 Impact factor: 11.361

3. ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies.

Authors: Scott C Clark; Rob Egan; Peter I Frazier; Zhong Wang
Journal: Bioinformatics Date: 2013-01-09 Impact factor: 6.937

4. Rapid communication: genetic linkage and physical mapping of the porcine cholesteryl ester transfer protein (CETP) gene.

Authors: X W Shi; Y D Zhang; M F Rothschild; C K Tuggle
Journal: J Anim Sci Date: 2002-05 Impact factor: 3.159

5. Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo): genome assembly and analysis.

Authors: Rami A Dalloul; Julie A Long; Aleksey V Zimin; Luqman Aslam; Kathryn Beal; Le Ann Blomberg; Pascal Bouffard; David W Burt; Oswald Crasta; Richard P M A Crooijmans; Kristal Cooper; Roger A Coulombe; Supriyo De; Mary E Delany; Jerry B Dodgson; Jennifer J Dong; Clive Evans; Karin M Frederickson; Paul Flicek; Liliana Florea; Otto Folkerts; Martien A M Groenen; Tim T Harkins; Javier Herrero; Steve Hoffmann; Hendrik-Jan Megens; Andrew Jiang; Pieter de Jong; Pete Kaiser; Heebal Kim; Kyu-Won Kim; Sungwon Kim; David Langenberger; Mi-Kyung Lee; Taeheon Lee; Shrinivasrao Mane; Guillaume Marcais; Manja Marz; Audrey P McElroy; Thero Modise; Mikhail Nefedov; Cédric Notredame; Ian R Paton; William S Payne; Geo Pertea; Dennis Prickett; Daniela Puiu; Dan Qioa; Emanuele Raineri; Magali Ruffier; Steven L Salzberg; Michael C Schatz; Chantel Scheuring; Carl J Schmidt; Steven Schroeder; Stephen M J Searle; Edward J Smith; Jacqueline Smith; Tad S Sonstegard; Peter F Stadler; Hakim Tafer; Zhijian Jake Tu; Curtis P Van Tassell; Albert J Vilella; Kelly P Williams; James A Yorke; Liqing Zhang; Hong-Bin Zhang; Xiaojun Zhang; Yang Zhang; Kent M Reed
Journal: PLoS Biol Date: 2010-09-07 Impact factor: 8.029

6. Quality scores for 32,000 genomes.

Authors: Miriam L Land; Doug Hyatt; Se-Ran Jun; Guruprasad H Kora; Loren J Hauser; Oksana Lukjancenko; David W Ussery
Journal: Stand Genomic Sci Date: 2014-12-08

7. Chromosome-scale shotgun assembly using an in vitro method for long-range linkage.

Authors: Nicholas H Putnam; Brendan L O'Connell; Jonathan C Stites; Brandon J Rice; Marco Blanchette; Robert Calef; Christopher J Troll; Andrew Fields; Paul D Hartley; Charles W Sugnet; David Haussler; Daniel S Rokhsar; Richard E Green
Journal: Genome Res Date: 2016-02-04 Impact factor: 9.043

8. Analyses of pig genomes provide insight into porcine demography and evolution.

Authors: Martien A M Groenen; Alan L Archibald; Hirohide Uenishi; Christopher K Tuggle; Yasuhiro Takeuchi; Max F Rothschild; Claire Rogel-Gaillard; Chankyu Park; Denis Milan; Hendrik-Jan Megens; Shengting Li; Denis M Larkin; Heebal Kim; Laurent A F Frantz; Mario Caccamo; Hyeonju Ahn; Bronwen L Aken; Anna Anselmo; Christian Anthon; Loretta Auvil; Bouabid Badaoui; Craig W Beattie; Christian Bendixen; Daniel Berman; Frank Blecha; Jonas Blomberg; Lars Bolund; Mirte Bosse; Sara Botti; Zhan Bujie; Megan Bystrom; Boris Capitanu; Denise Carvalho-Silva; Patrick Chardon; Celine Chen; Ryan Cheng; Sang-Haeng Choi; William Chow; Richard C Clark; Christopher Clee; Richard P M A Crooijmans; Harry D Dawson; Patrice Dehais; Fioravante De Sapio; Bert Dibbits; Nizar Drou; Zhi-Qiang Du; Kellye Eversole; João Fadista; Susan Fairley; Thomas Faraut; Geoffrey J Faulkner; Katie E Fowler; Merete Fredholm; Eric Fritz; James G R Gilbert; Elisabetta Giuffra; Jan Gorodkin; Darren K Griffin; Jennifer L Harrow; Alexander Hayward; Kerstin Howe; Zhi-Liang Hu; Sean J Humphray; Toby Hunt; Henrik Hornshøj; Jin-Tae Jeon; Patric Jern; Matthew Jones; Jerzy Jurka; Hiroyuki Kanamori; Ronan Kapetanovic; Jaebum Kim; Jae-Hwan Kim; Kyu-Won Kim; Tae-Hun Kim; Greger Larson; Kyooyeol Lee; Kyung-Tai Lee; Richard Leggett; Harris A Lewin; Yingrui Li; Wansheng Liu; Jane E Loveland; Yao Lu; Joan K Lunney; Jian Ma; Ole Madsen; Katherine Mann; Lucy Matthews; Stuart McLaren; Takeya Morozumi; Michael P Murtaugh; Jitendra Narayan; Dinh Truong Nguyen; Peixiang Ni; Song-Jung Oh; Suneel Onteru; Frank Panitz; Eung-Woo Park; Hong-Seog Park; Geraldine Pascal; Yogesh Paudel; Miguel Perez-Enciso; Ricardo Ramirez-Gonzalez; James M Reecy; Sandra Rodriguez-Zas; Gary A Rohrer; Lauretta Rund; Yongming Sang; Kyle Schachtschneider; Joshua G Schraiber; John Schwartz; Linda Scobie; Carol Scott; Stephen Searle; Bertrand Servin; Bruce R Southey; Goran Sperber; Peter Stadler; Jonathan V Sweedler; Hakim Tafer; Bo Thomsen; Rashmi Wali; Jian Wang; Jun Wang; Simon White; Xun Xu; Martine Yerle; Guojie Zhang; Jianguo Zhang; Jie Zhang; Shuhong Zhao; Jane Rogers; Carol Churcher; Lawrence B Schook
Journal: Nature Date: 2012-11-15 Impact factor: 49.962

9. Efficacy of the porcine species in biomedical research.

Authors: Karina Gutierrez; Naomi Dicks; Werner G Glanzner; Luis B Agellon; Vilceu Bordignon
Journal: Front Genet Date: 2015-09-16 Impact factor: 4.599

10. Altered Methylation Profile of Lymphocytes Is Concordant with Perturbation of Lipids Metabolism and Inflammatory Response in Obesity.

Authors: Mette J Jacobsen; Caroline M Junker Mentzel; Ann Sofie Olesen; Thierry Huby; Claus B Jørgensen; Romain Barrès; Merete Fredholm; David Simar
Journal: J Diabetes Res Date: 2015-12-21 Impact factor: 4.011

4 in total

1. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics.

Authors: Robert M Waterhouse; Mathieu Seppey; Felipe A Simão; Mosè Manni; Panagiotis Ioannidis; Guennadi Klioutchnikov; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal: Mol Biol Evol Date: 2018-03-01 Impact factor: 16.240

2. Genome Assembly and Analysis of the North American Mountain Goat (Oreamnos americanus) Reveals Species-Level Responses to Extreme Environments.

Authors: Daria Martchenko; Rayan Chikhi; Aaron B A Shafer
Journal: G3 (Bethesda) Date: 2020-02-06 Impact factor: 3.154

3. The first draft reference genome of the American mink (Neovison vison).

Authors: Zexi Cai; Bent Petersen; Goutam Sahana; Lone B Madsen; Knud Larsen; Bo Thomsen; Christian Bendixen; Mogens Sandø Lund; Bernt Guldbrandtsen; Frank Panitz
Journal: Sci Rep Date: 2017-11-06 Impact factor: 4.379

4. TISSUES 2.0: an integrative web resource on mammalian tissue expression.

Authors: Oana Palasca; Alberto Santos; Christian Stolte; Jan Gorodkin; Lars Juhl Jensen
Journal: Database (Oxford) Date: 2018-01-01 Impact factor: 3.451

4 in total