| Literature DB >> 35575079 |
Yi Huang1, Merly Escalona2, Glen Morrison1, Mohan P A Marimuthu3, Oanh Nguyen3, Erin Toffelmier4,5, H Bradley Shaffer4,5, Amy Litt1.
Abstract
Arctostaphylos (Ericaceae) species, commonly known as manzanitas, are an invaluable fire-adapted chaparral clade in the California Floristic Province (CFP), a world biodiversity hotspot on the west coast of North America. This diverse woody genus includes many rare and/or endangered taxa, and the genus plays essential ecological roles in native ecosystems. Despite their importance in conservation management, and the many ecological and evolutionary studies that have focused on manzanitas, virtually no research has been conducted on the genomics of any manzanita species. Here, we report the first genome assembly of a manzanita species, the widespread Arctostaphylos glauca. Consistent with the genomics strategy of the California Conservation Genomics project, we used Pacific Biosciences HiFi long reads and Hi-C chromatin-proximity sequencing technology to produce a de novo assembled genome. The assembly comprises a total of 271 scaffolds spanning 547Mb, close to the genome size estimated by flow cytometry. This assembly, with a scaffold N50 of 31Mb and BUSCO complete score of 98.2%, will be used as a reference genome for understanding the genetic diversity and the basis of adaptations of both common and rare and endangered manzanita species. © The American Genetic Association. 2021.Entities:
Keywords: CCGP; California Conservation Genomics Project; California Floristic Province; chaparral; conservation; endemic
Mesh:
Year: 2022 PMID: 35575079 PMCID: PMC9113465 DOI: 10.1093/jhered/esab071
Source DB: PubMed Journal: J Hered ISSN: 0022-1503 Impact factor: 2.679
Figure 1.Arctostaphylos glauca, big berry manzanita. (A) A tall, arborescent individual in montane chaparral; (B) range map of A. glauca. The areas colored in red show the approximate range of A. glauca, based on georeferenced herbarium collections in the Consortium of California Herbaria database (CCH2.org). Satellite imagery is from Google Earth. (C) Low shrubby growth form, in a particularly dry habitat; (D) stem, leaves, and fruit. The two detached leaves show the adaxial (left) and abaxial (right) leaf surfaces. Scale bar is 1 cm long, and the coin placed for scale is a US 25 cent piece (quarter), 24.26 mm in diameter. Photograph taken on a uniform background, which was then digitally removed. Materials moved slightly in digital editing, to condense the image, while preserving scale. All photos were taken by, and are property of author GM.
Assembly pipeline and software usage. Software citations are listed in the text
| Assembly | Software | Version |
|---|---|---|
| Filtering PacBio HiFi adapters | HiFiAdapterFilt | Commit 64d1c7b |
| K-mer counting | Meryl | 1 |
| Estimation of genome size and heterozygosity | GenomeScope | 2 |
|
| HiFiasm | 0.13-r308 |
| Long read, genome-genome alignment | minimap2 | 2.16 |
| Remove low-coverage, duplicated contigs | purge_dups | 1.0.1 |
|
| ||
| Hi-C mapping for SALSA | Arima Genomics mapping pipeline | Commit 2e74ea4 |
| Hi-C Scaffolding | SALSA | 2 |
| Gap closing | YAGCloser | Commit |
|
| ||
| Short-read alignment | bwa | 0.7.17-r1188 |
| SAM/BAM processing | samtools | 1.11 |
| SAM/BAM filtering | pairtools | 0.3.0 |
| Pairs indexing | pairix | 0.3.7 |
| Matrix generation | Cooler | 0.8.10 |
| Matrix balancing | hicExplorer | 3.6 |
| Contact map visualization | HiGlass | 2.1.11 |
| PretextMap | 0.1.4 | |
| PretextView | 0.1.5 | |
| PretextSnapshot | 0.0.3 | |
|
| ||
| Sequence similarity search | BLAST+ | 2.10 |
| Long read alignment | Pbmm2 ( | 1.4.0 |
| Variant calling and consensus | bcftools | 1.11-5-g9c15769 |
| Extraction of sequences | seqtk | 1.3-r115-dirty |
| Circular-aware long-read alignment | racon | 1.4.19 |
| Sequence polishing | raptor | 0.20.3-171e0f1 |
| Sequence alignment | lastz | 1.04.08 |
| Gene annotation | MitoFinder | 1.4 |
|
| ||
| Basic assembly metrics | QUAST | 5.0.2 |
| Assembly completeness | BUSCO | 5.0.0 |
| Merqury | 1 | |
|
| ||
| General contamination screening | BlobToolKit | 2.3.3 |
Sequencing and assembly statistics, and accession numbers
| Bio projects | CCGP NCBI BioProject | PRJNA720569 | ||||
| Genera NCBI BioProject | PRJNA721387 | |||||
| Species NCBI BioProject | PRJNA734616 | |||||
| NCBI BioSample | SAMN19489519 | |||||
| Specimen identification | UCR ACC. # 292491 | |||||
| NCBI Genome accessions |
|
| ||||
| Assembly accession | GCA_019985065.1 | GCA_019985075.1 | ||||
| Genome sequences | JAHSPW000000000 | JAHSPX000000000 | ||||
| Genome sequence | PacBio HiFi reads | Run | 1 PACBIO_SMRT (Sequel II) run: 1.8 M spots, 27.3G bases, 8.7Gb downloads | |||
| Accession | SRR14883332 | |||||
| Hi-C Illumina reads | Run | 1 Illumina HiSeq X Ten run: 199.2M spots, | ||||
| Accession | SRR14883331 | |||||
| Genome assembly quality metrics | Assembly identifier (Quality code | ddArcGlau1 (6.7.Q62) | ||||
| HiFi Read coverage | 45X | |||||
|
|
| |||||
| Number of contigs | 353 | 2470 | ||||
| Contig N50 (bp) | 8 041 760 | 1 739 008 | ||||
| Longest Contigs | 22 990 225 | 10 884 557 | ||||
| Number of scaffolds | 271 | 2350 | ||||
| Scaffold N50 (bp) | 31 280 158 | 3 804 428 | ||||
| Largest scaffold | 45 401 621 | 22 987 546 | ||||
| Size of final assembly (bp) | 547 548 103 | 556 397 040 | ||||
| Gaps per Gbp | 150 | 1885 | ||||
| Indel QV (Frame shift) | 48.36 | 47.39 | ||||
| Base pair QV | 62.36 | 56.24 | ||||
| Full assembly = 58.28 | ||||||
|
| 74.39 | 65.01 | ||||
| Full assembly = 95.59 | ||||||
| BUSCO completeness |
|
|
|
|
| |
|
| 98.20% | 95.70% | 2.50% | 0.90% | 0.90% | |
| 85.90% | 83.30% | 2.60% | 1.30% | 12.80% | ||
| Organelles | 1 Partial mitochondrial sequence | MZ779111 |
Assembly quality code x.y.Q derived notation, from (Rhie et al. 2021). x = log10[contig NG50]; y = log10[scaffold NG50]; Q = Phred base accuracy QV (Quality value). BUSCO Scores. (C)omplete and (S)ingle; (C)omplete and (D)uplicated; (F)ragmented and (M)issing BUSCO genes. n, number of BUSCO genes in the set/data base. Bp: base pairs.
Read coverage has been calculated based on a genome size of 600Mb.
Figure 2.Visual overview of genome assembly metrics. (A) K-mer spectra output generated from PacBio HiFi data without adapters using GenomeScope2.0. The bimodal pattern observed corresponds to a diploid genome. K-mers covered at lower coverage but higher frequency correspond to differences between haplotypes, whereas the higher coverage but lower frequency k-mers correspond to the similarities between haplotypes; (B) BlobToolKit Snail plot showing a graphical representation of the quality metrics presented in Table 2 for the A. glauca primary assembly (ddArcGlau1). The plot circle represents the full size of the assembly. From the inside-out, the central plot covers length-related metrics. The red line represents the size of the longest scaffold; all other scaffolds are arranged in size-order moving clockwise around the plot and drawn in grey starting from the outside of the central plot. Dark and light orange arcs show the scaffold N50 and scaffold N90 values. The central light grey spiral shows the cumulative scaffold count with a white line at each order of magnitude. White regions in this area reflect the proportion of Ns in the assembly The dark vs. light blue area around it shows mean, maximum and minimum GC vs. AT content at 0.1% intervals (Challis et al. 2020); (C, D) Hi-C contact maps for the primary (C) and alternate (D) genome assembly generated with PretextSnapshot. Hi-C contact maps translate proximity of genomic regions in 3D space to contiguous linear organization. Each cell in the contact map corresponds to sequencing data supporting the linkage (or join) between two of such regions. Scaffolds are separated by black lines and higher density corresponds to higher levels of fragmentation..
Classification of repeat elements generated from RepeatMasker
| Number of elements | Length occupied (bp) | Percentage of sequence (%) | |
|---|---|---|---|
| Retroelements | 160 425 | 112 810 688 | 20.6 |
| SINEs | 17 580 | 3 242 835 | 0.59 |
| Penelope | 1155 | 97 740 | 0.02 |
| LINEs | 20 303 | 6 704 697 | 1.22 |
| CRE/SLACS | 37 | 1672 | 0 |
| L2/CR1/Rex | 4680 | 678 218 | 0.12 |
| R1/LOA/Jockey | 340 | 22 011 | 0 |
| R2/R4/NeSL | 70 | 3992 | 0 |
| RTE/Bov-B | 3670 | 1 765 787 | 0.32 |
| L1/CIN4 | 9180 | 3 982 099 | 0.73 |
| LTR elements | 122 542 | 102 863 156 | 18.79 |
| BEL/Pao | 622 | 98 414 | 0.02 |
| Ty1/Copia | 49 026 | 41 614 407 | 7.6 |
| Gypsy/DIRS1 | 62 124 | 58 039 401 | 10.6 |
| Retroviral | 3184 | 346 807 | 0.06 |
| DNA transposons | 288 141 | 58 683 503 | 10.72 |
| hobo-Activator | 43 650 | 10 654 515 | 1.95 |
| Tc1-IS630-Pogo | 4515 | 615 692 | 0.11 |
| En-Spm | 0 | 0 | 0 |
| MuDR-IS905 | 54 | 53 928 | 0.01 |
| PiggyBac | 121 | 6007 | 0 |
| Tourist/harbinger | 4780 | 1 700 494 | 0.31 |
| Other (mirage, P-element, P-element, Transib) | 695 | 43 698 | 0.01 |
| Rolling-circles | 4313 | 1 802 302 | 0.33 |
| Unclassified | 373 015 | 120 296 966 | 21.97 |
| Total interspersed repeats | 291 791 157 | 53.29 | |
| Small RNA | 19 997 | 17 205 637 | 3.14 |
| Satellites | 2056 | 628 418 | 0.11 |
| Simple repeats | 121 138 | 6 848 295 | 1.25 |
| Low complexity | 17 346 | 833 121 | 0.15 |
Most repeats fragmented by insertions or deletions have been counted as one element.