| Literature DB >> 34927191 |
Ralf C Mueller1,2, Patrik Ellström3, Kerstin Howe4, Marcela Uliano-Silva4, Richard I Kuo5, Katarzyna Miedzinska5, Amanda Warr5, Olivier Fedrigo6, Bettina Haase6, Jacquelyn Mountcastle6, William Chow4, James Torrance4, Jonathan M D Wood4, Josef D Järhult3, Mahmoud M Naguib7, Björn Olsen3, Erich D Jarvis8, Jacqueline Smith5, Lél Eöry5, Robert H S Kraus1,2.
Abstract
BACKGROUND: The tufted duck is a non-model organism that experiences high mortality in highly pathogenic avian influenza outbreaks. It belongs to the same bird family (Anatidae) as the mallard, one of the best-studied natural hosts of low-pathogenic avian influenza viruses. Studies in non-model bird species are crucial to disentangle the role of the host response in avian influenza virus infection in the natural reservoir. Such endeavour requires a high-quality genome assembly and transcriptome.Entities:
Keywords: Aythya fuligula; Iso-Seq; Pacific Biosciences; RNA sequencing; Vertebrate Genomes Project; genome annotation; small RNA; transcriptome sequencing; tufted duck
Mesh:
Year: 2021 PMID: 34927191 PMCID: PMC8685854 DOI: 10.1093/gigascience/giab081
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Assembly statistics of the tufted duck genome
| Statistic | Value |
| Genome coverage (×) | 64.03 |
| Total sequence length (bp) | 1,127,004,725 |
| Ungapped sequence length (bp) | 1,117,587,328 |
| No. of scaffolds | 105 |
| Scaffolds assigned to chromosomes | 36 + 1 mitochondrion |
| Unplaced scaffolds | 68 |
| Contig NG50 (bp) | 17,816,505 |
| Scaffold NG50 (bp) | 85,905,639 |
| Base-call error in 10 kb | <1 nucleotide |
Figure 1:HiGlass Hi-C 2D maps of the tufted duck genome assembly before (left) and after (right) manual curation. Off-diagonal hits indicate missing joins, which have been corrected during curation. Broken patterns within scaffolds (e.g., at the end of the first scaffold before curation) can indicate intra-scaffold misassemblies, which were also addressed during curation. They can, however, also be features of the respective chromosome, as in the fourth post-curation scaffold, the structure of which was corrected and asserted during curation.
Illumina RNA-Seq reads before and after trimming, PacBio Iso-Seq reads before and after error correction, ZMW yield, and the number of FLNC reads
| Platform | Parameter | Brain | Ileum | Lung | Ovary | Spleen | Testis |
|---|---|---|---|---|---|---|---|
| Illumina | No. raw reads (PE) | 71,648,303 | 72,410,462 | 63,191,981 | 72,442,392 | 68,315,029 | 53,747,262 |
| No. trimmed (paired) | 70,025,835 | 70,902,250 | 61,059,701 | 70,892,177 | 66,813,691 | 52,652,923 | |
| PacBio | No. subreads | 9,249,099 | 12,363,369 | 19,206,097 | 1,279,561 | 8,401,115 | 12,616,953 |
| No. CCS | 158,698 | 415,314 | 529,108 | 68,112 | 167,077 | 288,984 | |
| ZMW yield (%) | 15.87 | 41.53 | 26.46 | 3.41 | 16.71 | 28.90 | |
| Mean No. of passes | 58.3 | 29.8 | 36.3 | 18.8 | 50.3 | 43.7 | |
| No. FLNC | 133,684 | 343,634 | 432,817 | 49,887 | 134,124 | 234,423 |
Two ZMW were used for lung and ovary. CCS: circular consensus sequencing; FLNC: full-length, non-chimeric; PE: paired end; ZMW: zero-mode waveguide.
Transcript model reconstruction per tissue and pipeline
| Parameter | Platform | Brain | Ileum | Lung | Ovary | Spleen | Testis |
|---|---|---|---|---|---|---|---|
| Mapped (%) | Illumina | 92.92 | 92.98 | 93.42 | 94.33 | 92.44 | 93.19 |
| PacBio | 96.66 | 98.13 | 97.92 | 99.40 | 95.67 | 96.57 | |
| No. genes | Illumina | 22,348 | 20,838 | 20,692 | 32,046 | 21,608 | 29,225 |
| PacBio | 15,776 | 10,813 | 12,912 | 8,862 | 6,773 | 11,746 | |
| No. transcripts | Illumina | 44,808 | 40,968 | 43,877 | 77,997 | 42,000 | 57,758 |
| PacBio | 37,601 | 35,284 | 46,587 | 19,513 | 14,030 | 28,852 | |
| No. exons | Illumina | 422,741 | 395,719 | 412,225 | 569,566 | 383,679 | 483,444 |
| PacBio | 138,038 | 217,391 | 243,566 | 119,080 | 71,211 | 160,380 | |
| Exons per gene (mean) | Illumina | 18.9 | 19.0 | 19.9 | 17.8 | 17.8 | 16.5 |
| PacBio | 8.7 | 20.1 | 18.9 | 13.4 | 10.5 | 13.7 | |
| Transcripts per gene (mean) | Illumina | 2.0 | 2.0 | 2.1 | 2.4 | 1.9 | 2.0 |
| PacBio | 2.4 | 3.3 | 3.6 | 2.2 | 2.1 | 2.5 |
Figure 2:Distribution of single- and multi-exon genes per tissue and pipeline. Only the first 50 groups are shown.
Figure 3:Distribution of single- and multi-transcript genes per tissue and pipeline. Only the first 15 groups are shown.
Functional annotation categorized by different matches
| Parameter | Genes | Transcripts | Isoforms |
|---|---|---|---|
| Total No. of entries | 49,746 | 345,870 | 7.0 |
| UniRef50 total hits | 17,911 | 208,274 | 11.6 |
| UniRef50 match | |||
| Full | 13,024 | 99,737 | 7.7 |
| 90% | 4,937 | 12,197 | 2.5 |
| 50% | 3,208 | 6,540 | 2.0 |
| <50% | 2,474 | 4,447 | 1.8 |
| ≥50% | 14,731 | 118,474 | 8.0 |
| ≥90% | 14,099 | 111,934 | 7.9 |
| No hit (but full-length) | 27,787 | 78,860 |
Note: UniRef50 total hits also includes 5′ degraded records, whereas the match classes only include full-length records.
Blastp results (≥90% match) of predicted ORFs from the functional annotation searched in 2 mallard RIG-I/DDX58 isoforms NP_001297309.1 (933 aa) and XP_038025643.1 (988 aa)
| Mallard | Tufted duck | |||||||
|---|---|---|---|---|---|---|---|---|
| Isoform | Chromosome | ORF | Start/End | nt | Frame | Strand | Exons | aa |
| NP_001297309.1 | NC_045564.1 (6) | G24916.1 | 21,885,153/21,914,958 | 29,805 | F2 | + | 17 | 1,003 |
| NP_001297309.1 | NC_045564.1 (6) | G24916.2 | 21,885,153/21,914,958 | 29,805 | F1 | + | 16 | 1,003 |
| NP_001297309.1 | NC_045564.1 (6) | G24916.3 | 21,887,313/21,915,355 | 28,042 | F1 | + | 17 | 994 |
| NP_001297309.1 | NC_045564.1 (6) | G24916.4 | 21,887,313/21,914,044 | 26,731 | F1 | + | 16 | 1,003 |
| XP_038025643.1 | NC_045564.1 (6) | G24916.7* | 21,887,508/21,915,353 | 27,845 | F1 | + | 16 | 1,044 |
| XP_038025643.1 | NC_045564.1 (6) | G24916.8* | 21,887,508/21,915,503 | 27,995 | F1 | + | 16 | 1,040 |
| NP_001297309.1 | NC_045593.1 (Z) | G46857.2* | 69,123,499/69,145,704 | 22,205 | F3 | + | 18 | 948 |
| NP_001297309.1 | NC_045593.1 (Z) | G46857.3* | 69,123,499/69,147,273 | 23,774 | F3 | + | 18 | 948 |
| NP_001297309.1 | NC_045593.1 (Z) | G46857.4* | 69,123,529/69,145,281 | 21,752 | F3 | + | 18 | 938 |
ORFs marked with an asterisk were flagged with “5prime_degrade," which means that the start codon was not found in the TAMA ORF/NMD prediction pipeline. aa: amino acid.
Figure 4:In the short-read data set, the highest total number of supported genes was found in testis (left panel, bottom), followed by ovary, brain, spleen, ileum, and lung. All 6 tissues intersected in 11,165 genes (main panel, left). The highest number of exclusively supported genes was also found in testis (988), and followed the same order as the total number of genes (main panel, yellow).
Figure 5:In the long-read data set, the highest total number of supported genes was found in brain (left panel, top), followed by lung, ileum, testis, ovary, and spleen. All 6 tissues intersected in 2,475 genes (main panel, left). The highest number of exclusively supported genes was found in brain (779), followed by testis, ileum, lung, ovary, and spleen (main panel, blue).
Small RNA read processing and assembly statistics
| Statistic | Brain | Ileum | Lung | Ovary | Spleen | Testis |
|---|---|---|---|---|---|---|
| No. raw reads (PE) | 78,078,195 | 58,021,264 | 70,381,224 | 79,425,103 | 65,767,638 | 67,436,837 |
| No. trimmed reads (PE) | 73,753,404 | 57,021,189 | 69,326,112 | 77,333,115 | 58,716,681 | 65,395,835 |
| Mapped uniquely (%) | 51.82 | 88.04 | 72.04 | 71.01 | 72.43 | 52.59 |
| Mapped multiply (%) | 44.90 | 9.48 | 24.42 | 26.59 | 18.18 | 31.97 |
| No. genes | 13,606 | 8,441 | 11,899 | 9,903 | 33,133 | 31,205 |
| No. transcripts | 13,685 | 8,520 | 11,995 | 9,954 | 33,761 | 31,342 |
| No. exons | 17,276 | 12,650 | 15,397 | 11,588 | 54,504 | 35,026 |
PE: paired end.
Figure 6:Distribution of single-exon and multi-exon small RNA transcripts for each tissue.
Results of cmscan on assembled small RNA transcripts after filtering
| Parameter | Brain | Ileum | Lung | Ovary | Spleen | Testis |
|---|---|---|---|---|---|---|
|
| 328 | 294 | 317 | 312 | 345 | 369 |
| Intersection | 310 | 274 | 295 | 293 | 315 | 345 |
| Additional | 18 | 20 | 22 | 19 | 30 | 24 |
Intersection refers to small RNAs predicted by the in silico genome scan. Additional refers to annotated small RNAs that were not detected by cmscan in the reference genome.