| Literature DB >> 33882130 |
Nedda F Saremi1, Jonas Oppenheimer1, Christopher Vollmers1, Brendan O'Connell2, Shard A Milne3, Ashley Byrne4, Li Yu5, Oliver A Ryder6, Richard E Green1, Beth Shapiro3,7.
Abstract
The Andean bear is the only extant member of the Tremarctine subfamily and the only extant ursid species to inhabit South America. Here, we present an annotated de novo assembly of a nuclear genome from a captive-born female Andean bear, Mischief, generated using a combination of short and long DNA and RNA reads. Our final assembly has a length of 2.23 Gb, and a scaffold N50 of 21.12 Mb, contig N50 of 23.5 kb, and BUSCO score of 88%. The Andean bear genome will be a useful resource for exploring the complex phylogenetic history of extinct and extant bear species and for future population genetics studies of Andean bears. © The American Genetic Association. 2021.Entities:
Keywords: zzm321990 Tremarctos ornatuszzm321990 ; Andean bear; CHiCago; Oxford nanopore; spectacled bear
Mesh:
Year: 2021 PMID: 33882130 PMCID: PMC8280923 DOI: 10.1093/jhered/esab021
Source DB: PubMed Journal: J Hered ISSN: 0022-1503 Impact factor: 2.645
Programs used in assembly
| Purpose | Program | Version |
|---|---|---|
| Oxford Nanopore basecalling | Albacore | 2.0.2 |
| Illumina read trimming | SeqPrep2 | 1.1 |
| Illumina read trimming | Trimmomatic | 0.33 |
| De novo assembly | Meraculous-2D | 2.2.4 |
| Scaffolding | HiRise | 2.1.1 |
| Nanopore adapter trimming | Porechop | 0.2.3 |
| Gap-filling | PBJelly | 15.8.24 |
| Short-read polishing | Pilon | 1.22 |
| Duplicate removal | Samtools rmdup | 0.18 |
| Assembly validation | BUSCO | 5.0.0 |
|
| Merqury | 1.1 |
| Masking of repetitive elements | RepeatMasker | 1.332 |
| Identification of repetitive elements | RepeatScout | 1.0.5 |
| Annotation | Augustus | 3.2.2 |
| RNAseq mapping | TopHat2 | 2.1.0 |
| Protein mapping | Exonerate | 2.2.0 |
| Mitochondrial genome assembly | Unicycler | 0.4.4 |
| Long read mapping | Minimap2 | 2.7-r659 |
| Illumina mapping | BWA | 0.7.12 |
| Mitochondrial genome iterative assembly | MIA | 1.0 |
| Mitochondrial assembly | Jalview | 2.11.0 |
Genome metrics of Andean bear assembly stages
| Assembly Version | ||||||
|---|---|---|---|---|---|---|
| Meraculous-2D | HiRise | PB Jelly | Pilon iteration 1 | Pilon iteration 2 | ||
| Assembly step | Shotgun assembly | Scaffolding | Gap filling | Error correcting | Error correcting | |
| Input data | Illumina shotgun | CHiCago | ONT | Illumina shotgun | Illumina shotgun | |
| Genome length (bp) | 2 086 202 895 | 2 215 996 330 | 2 232 824 675 | 2 232 974 266 | 2 232 736 973 | |
| Contigs | 180 766 | 183 631 | 166 073 | 163 790 | 163 790 | |
| Scaffolds | 139 421 | 12 402 | 12 402 | 12 402 | 12 402 | |
| Contig N50 (kb) | 20.1 | 20.3 | 21.8 | 23.2 | 23.5 | |
| Contig L50 | 31 094 | 30 649 | 28 724 | 27 028 | 26 657 | |
| Scaffold N50 (Mb) | 0.0262 | 20.92 | 21.12 | 21.12 | 21.12 | |
| Scaffold L50 | 23 560 | 29 | 29 | 29 | 29 | |
| Number of gaps | 41 345 | 171 229 | 162 727 | 153 671 | 151 388 | |
| Number of Ns | 1 438 142 | 131 322 142 | 125 625 584 | 124 345 965 | 123 688 823 | |
| % of Ns in genome | 0.00198 | 5.93 | 5.62 | 5.57 | 5.54 | |
| Busco Scores (%; | C | 54.9 | 87.9 | 88.0 | 88.1 | 88.0 |
| S | 54.6 | 87.4 | 87.4 | 87.5 | 87.4 | |
| D | 0.3 | 0.5 | 0.6 | 0.6 | 0.60 | |
| F | 16.2 | 3.9 | 4.0 | 3.9 | 4.0 | |
| M | 28.9 | 8.2 | 8.0 | 8.0 | 8.0 |
The metrics of the Andean bear genome at different stages of the assembly process. The final column represents the final assembly. We saw a marked improvement in N50 as a result of scaffolding with HiRise. Gap filling with PBJelly noticeably decreased the numbers of Ns and strings of Ns. As a result of this conversion of Ns to useful sequence, the number of Illumina reads that mapped to the assembly increased. Full BUSCO scores are shown at the bottom, indicating the proportion of complete (C), single-copy (S), duplicated (D), fragmented (F), and missing (M) conserved single-copy mammalian orthologs that a present at each stage of the assembly.
Figure 1.Top 20 Gene Ontology term names in Andean bear annotation. Annotation by Augustus predicted a total of 19 289 genes. After filtering, this translated to 15 504 annotated proteins, which we subclassified into the 3 GO domains: biological process (green), molecular function (blue), and cellular component (yellow), and visualized results in Blast2GO (Götz et al. 2008). The 20 most common GO term names appearing in the annotated gene set are shown.
Figure 2.NGx plot comparing scaffold-level contiguity of Andean bear assembly and other bear genomes. Plot shows the fraction of the genome (x-axis) that is covered by scaffolds of at least a certain size (y-axis). The dashed vertical line shows this value for half of the genome (N50). The Andean bear (T. ornatus) genome is shown in dark blue, with the giant panda (Ursus melanoleuca) in dark green, brown bear (U. arctos)in light green, Asiatic black bear (U. thibetanus) in light blue, polar bear (U. maritimus) in red, and American black bear (U. americanus) in pink. The American black bear genome was assembled using short read data alone, whereas all the others were assembled using a combination of short read, long read, and proximity ligation data. The assembly contiguity is similar to that of other the bear reference genomes that also employed a hybrid assembly approach.