| Literature DB >> 34070911 |
Vidhya Jagannathan1, Christophe Hitte2, Jeffrey M Kidd3,4, Patrick Masterson5, Terence D Murphy5, Sarah Emery3, Brian Davis6, Reuben M Buckley7, Yan-Hu Liu8,9, Xiang-Quan Zhang8,9, Tosso Leeb1, Ya-Ping Zhang8,9, Elaine A Ostrander7, Guo-Dong Wang8,9.
Abstract
The domestic dog has evolved to be an important biomedical model for studies regarding the genetic basis of disease, morphology and behavior. Genetic studies in the dog have relied on a draft reference genome of a purebred female boxer dog named "Tasha" initially published in 2005. Derived from a Sanger whole genome shotgun sequencing approach coupled with limited clone-based sequencing, the initial assembly and subsequent updates have served as the predominant resource for canine genetics for 15 years. While the initial assembly produced a good-quality draft, as with all assemblies produced at the time, it contained gaps, assembly errors and missing sequences, particularly in GC-rich regions, which are found at many promoters and in the first exons of protein-coding genes. Here, we present Dog10K_Boxer_Tasha_1.0, an improved chromosome-level highly contiguous genome assembly of Tasha created with long-read technologies that increases sequence contiguity >100-fold, closes >23,000 gaps of the CanFam3.1 reference assembly and improves gene annotation by identifying >1200 new protein-coding transcripts. The assembly and annotation are available at NCBI under the accession GCF_000002285.5.Entities:
Keywords: Canis lupus familiaris; Pacific biosciences; annotation; contiguity; high quality; resource
Mesh:
Year: 2021 PMID: 34070911 PMCID: PMC8228171 DOI: 10.3390/genes12060847
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Dog_10k_Boxer_Tasha_1.0 assembly. (a) Assembly workflow pipeline. The different algorithms used in the pipeline have been indicated. N50 is the contig/scaffold length in the assembly where equal or longer contigs contain 50% of the genome. L50 count is the number of contigs whose length sum makes N50. (b) Ideogram showing chromosomes, contigs, and gaps. The grey regions indicate contigs of size less than 3 Mb.
Summary statistics for the Dog10K_Boxer_Tasha_1.0 genome assembly and comparison with current dog reference genome CanFam3.1.
| Statistic | CanFam3.1 | Dog10K_Boxer_Tasha_1.0 |
|---|---|---|
| Total sequence length | 2,410,976,875 | 2,312,802,206 |
| Total ungapped length | 2,392,715,236 | 2,312,743,367 |
| No. of scaffolds | 3310 | 147 |
| No. of unplaced scaffolds | 3228 | 107 |
| Scaffold N50 | 45,876,610 | 63,738,581 |
| Scaffold L50 | 20 | 14 |
| No. of unspanned gaps | 80 | 399 |
| No. of spanned gaps | 23,796 | 621 |
| No. of contigs | 27,106 | 1162 |
| Contig N50 | 267,478 | 27,487,084 |
| Contig L50 | 2436 | 31 |
| No. of chromosomes | 39 | 39 |
Comparison of BUSCO analysis of genomes.
| Statistic | Dog10k_Boxer_Tasha_1.0 | CanFam3.1 |
|---|---|---|
| Complete BUSCOs | 95.3% | 92.2% |
| Complete and single copy BUSCOs | 94.1% | 91.1% |
| Complete and duplicated BUSCOs | 1.2% | 1.1% |
| Fragmented BUSCOs | 2.1% | 4.0% |
| Missing BUSCOs | 2.6% | 3.8% |
Annotation statistics for NCBI annotation release 106. * are non-coding RNA genes that cannot be classified.
| Feature | Dog10k_Boxer_Tasha_1.0/Annotation Release 106 |
|---|---|
| Protein-coding genes | 20,100 |
| Non-coding genes | 15,306 |
| Small non-coding genes | 2083 |
| Long non-coding genes | 12,667 |
| Miscellaneous * non-coding genes | 10 |
| Pseudogenes | 4887 |
Repeat content of the Dog10K_Boxer_Tasha_1.0 and CanFam3.1 assemblies. Results are shown for the primary chromosome sequences.
| Dog10K_Boxer_Tasha_1.0 | CanFam3.1 | |||
|---|---|---|---|---|
| Repeat Class | Elements | bp | Elements | bp |
| DNA | 341,866 | 65,043,282 | 347,025 | 65,997,048 |
| LINE | 1,286,663 | 467,394,285 | 1,307,498 | 470,518,469 |
| LTR | 378,505 | 111,520,139 | 384,551 | 113,151,392 |
| Low_complexity | 123,075 | 6,525,287 | 120,803 | 6,009,804 |
| RC | 1636 | 345,889 | 1649 | 347,342 |
| RNA | 489 | 103,097 | 504 | 105,770 |
| SINE | 1,579,792 | 240,791,186 | 1,605,511 | 244,461,861 |
| Satellite | 5730 | 11,298,647 | 635 | 624,881 |
| Simple_repeat | 891,331 | 40,450,974 | 895,091 | 38,358,719 |
| Unknown | 3449 | 559,562 | 3487 | 565,722 |
| rRNA | 953 | 129,078 | 958 | 115,711 |
| scRNA | 70 | 4996 | 71 | 5156 |
| snRNA | 4492 | 278,022 | 4617 | 285,578 |
| srpRNA | 45 | 8900 | 47 | 9496 |
| tRNA | 35,501 | 2,608,084 | 35,906 | 2,636,278 |
Repeat content for the lowly diverged SINE and LINE sequences.
| Dog10K_Boxer_Tasha_1.0 | CanFam3.1 | |||
|---|---|---|---|---|
| Repeat Class | Elements | bp | Elements | bp |
| SINEC | 1,125,416 | 177,104,238 | 1,146,663 | 180,147,553 |
| SINEC < 10% divergence | 454,869 | 71,490,885 | 464,113 | 72,819,234 |
| LINE/L1 | 853,212 | 379,452,954 | 869,259 | 381,738,114 |
| LINE/L1 < 10% divergence and ≥4 kb | 4805 | 26,935,018 | 4229 | 23,359,516 |
Figure 2Size distribution of insertion–deletion differences identified between the Dog10K_Boxer_Tasha_1.0 and CanFam3.1 assemblies. The sizes of 22,330 sequences present in CanFam3.1 but absent in Dog10K_Boxer_Tasha_1.0 (red, deletions) and of 32,999 sequences present in Dog10K_Boxer_Tasha_1.0 but absent in CanFam3.1 (blue, insertions) are shown. The bins of each histogram are of equal size on a logarithmic scale.
Figure 3Discovery of deletion variants using PacBio reads. Deletions were identified based on alignment of PacBio reads to the CanFam3.1 (left) or Dog10K_Boxer_Tasha_1.0 (right) assemblies. The bins of each histogram are of equal size on a logarithmic scale.
Figure 4Structural variation at the amylase locus. A genome browser view illustrating structural variation at the amylase locus in Tasha is shown. The orange bars at the top indicate the locations of tandem duplications identified using the raw PacBio long-read data. This includes a large, 1.9 Mbp duplication (chr6:47977592-49898283) as well as a 14.8 kbp duplication (chr6:49729008-49743863). A read depth profile showing copy number estimated from Illumina sequencing data is depicted as a bar plot across the interval. An elevated copy number of 3, corresponding to the 1.9 Mb duplication, is observed, as well as a spike in copy number overlapping with the AMY2B gene. Mappings of discordant fosmid end sequences are shown in orange below the copy number profile. Each depicted clone has end sequences that align in an everted orientation consistent with the presence of a tandem duplication. The position of gene models derived from the NCBI gene annotation, release 106, are shown at the bottom of the figure. The LOC607460 gene model corresponds to pancreatic α-amylase (AMY2B).