| Literature DB >> 31530812 |
Yaoxi He1,2,3,4, Xin Luo1,2,3,4, Bin Zhou1,2,3,4, Ting Hu1,2,3,4, Xiaoyu Meng1,2,3,4, Peter A Audano5, Zev N Kronenberg5, Evan E Eichler5,6, Jie Jin7, Yongbo Guo1,2,3,4, Yanan Yang1,2, Xuebin Qi1,2,3, Bing Su8,9,10.
Abstract
We present a high-quality de novo genome assembly (rheMacS) of the Chinese rhesus macaque (Macaca mulatta) using long-read sequencing and multiplatform scaffolding approaches. Compared to the current Indian rhesus macaque reference genome (rheMac8), rheMacS increases sequence contiguity 75-fold, closing 21,940 of the remaining assembly gaps (60.8 Mbp). We improve gene annotation by generating more than two million full-length transcripts from ten different tissues by long-read RNA sequencing. We sequence resolve 53,916 structural variants (96% novel) and identify 17,000 ape-specific structural variants (ASSVs) based on comparison to ape genomes. Many ASSVs map within ChIP-seq predicted enhancer regions where apes and macaque show diverged enhancer activity and gene expression. We further characterize a subset that may contribute to ape- or great-ape-specific phenotypic traits, including taillessness, brain volume expansion, improved manual dexterity, and large body size. The rheMacS genome assembly serves as an ideal reference for future biomedical and evolutionary studies.Entities:
Mesh:
Year: 2019 PMID: 31530812 PMCID: PMC6749001 DOI: 10.1038/s41467-019-12174-w
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Comparison of assembly statistics between rheMac8 and rheMacS
| Assembly | rheMac8 | rheMacS |
|---|---|---|
| Assembly approach | WGS and BAC | WGS and Hi-C |
| Sequencing platform | Sanger, Illumina | PacBio, Bionano, Hi-C, Illumina |
| Number of contigs | 348,493 | 4741 |
| Contig N50 (Mbp) | 0.11 | 8.19 |
| Number of scaffolds | 286,263 | 4543 |
| Scaffold N50 (Mbp) | 4.19 | 13.64 |
| Number of gapsa | 47,882 | 2858 |
| Total gap length (Mbp)a | 72.05 | 5.46 |
| Total bases (bp)b | 2,835,963,390 | 2,955,490,605 |
| Ungapped bases (bp)c | 2,763,913,633 | 2,950,026,318 |
The assembly statistics of rheMac8 were obtained from NCBI (accession number: GCA_000772875.3). WGS whole-genome shotgun, BAC bacterial artificial chromosome
aOnly N-base regions in the assembled chromosomes were counted
bOnly chromosome-placed bases were counted
cNon-N bases in the assembly
Fig. 1Long-read assembly of the Chinese rhesus macaque genome. a Chinese rhesus macaque (Macaca mulatta) (photograph courtesy by Jiaxin Zhao). b Treemaps for fragmentation difference between long-read and short-read rhesus assemblies. The rectangles represent the largest contigs that account for ~ 300 Mbp (~ 10%) of the assembly. c The chromosomal distribution of contigs of the rheMacS genome assembly. We compare genome sequence contiguity between rheMac8 and rheMacS. Thousands of gaps in rheMac8 were closed by longer contigs from rheMacS. The assembled contigs include > 3 Mbp (red), between 1 Mbp and 3 Mbp (green), and those < 1 Mbp (blue). The small contigs (< 1 Mbp) tend to consist of either centromeric or telomeric sequences. The centromeres of each chromosome are indicated based on previous annotation[30]
Fig. 2Quality assessment of the Chinese rhesus macaque genome assembly. a Comparison of sequence contig length distribution between rheMacS and rheMac8. b Length distribution of the closed gaps in rheMac8. c Comparison of assembly quality among various reported primate genome assemblies. Genomes assembled with long PacBio long reads (red) are compared against those assembled using Illumina short-read and Sanger sequencing data (blue). d Comparison of mapping rates when short-read NGS data (upper left), RNA-seq data (by Hisat2: upper right; by Bowtie2: lower left), and Iso-Seq data (lower right) are mapped to rheMacS and rheMac8, respectively. Two-tailed paired t test was used for statistical assessment. ***P < 0.001. The boxplot shows the mean value (central line), upper and lower quartiles (bounds of box) and min/max values (whiskers)
Fig. 3Structural variants (SVs) in rheMacS. a The distribution of large SVs (≥ 1 Kbp) among the rhesus macaque chromosomes. The histogram marks on each chromosome (light blue on the left) indicate the counts of SVs based on per 500 Kbp windows. The black dot on each chromosome indicates the centromere position. b The percentages of the four SV types including deletions (DEL), insertions (INS), duplications (DUP), and inversions (INV). c The SV distribution along with the increase of telomere distance. The SVs are counted with a sliding window size of 500 Kbp. The multicolor dots refer to the four-type SV counts in a 500 Kbp bin, and the solid lines indicate the distribution of average counts. d The length statistics of the rheMacS SVs. e Overlaps of the rheMacS SVs with previously reported SVs by aCGH and NGS data. The overlap cutoff is set to require > 50% reciprocal overlapping of SV length
Fig. 4Summary of the ape-specific structural variants (ASSVs). a The cladogram depicts the phylogenetic relationship among the studied primate species, including human, three great apes, rhesus macaque, and common marmoset. The numbers of the identified SVs by genome-pairwise comparisons are indicated. The numbers in the blue boxes represent the overlap for different pairwise results. The number in the red box indicates the identified 17,000 ASSVs. Gibbon (lesser ape) was not included due to the poor quality of the published genome assembly. We used the common marmoset genome assembly generated by short-read sequencing data to filter the SVs that occurred in the rhesus monkey lineage (bottom panel). b Statistics of ASSVs. c Pie plot for ASSV annotation
The 16 validated ASSVs located in gene-coding regions
| Chroma | Starta | Enda | Length | Type | Gene | Exon | Consequence | Gene functionsb |
|---|---|---|---|---|---|---|---|---|
| chr1 | 11095424 | 11095738 | 316 | INS |
| 4/5 | Splice_acceptor_variant | Spermatogenesis |
| chr3 | 136986715 | 136988148 | 1435 | INS |
| 3/7 | Splice_donor_variant | Neuroprotective and intelligence |
| chr3 | 139580804 | 139583878 | 3076 | INS |
| 5/8 | Splice_acceptor_variant | Neuroprotective; intelligence, and axonal protection |
| chr4 | 169663104 | 169663246 | 144 | INS |
| 2/14 | Coding_sequence_variant | Schizophrenia, autism spectrum disorder; bone pattern, and synaptic transmission |
| chr12 | 109105086 | 109105279 | 195 | INS |
| 6/7 | NMD_transcript_variant | Immunodeficiency; epididymitis, and somatic hypermutation |
| chr12 | 122264731 | 122264829 | 100 | INS |
| 2/14 | NMD_transcript_variant | Lysosome function; foot abnormality and acetabular dysplasia |
| chr13 | 102742592 | 102742592 | 318 | DEL |
| 4/4 | Inframe_deletion | <Unknown> |
| chr16 | 405554 | 405554 | 66 | DEL |
| 3/10 | NMD_transcript_variant | Fatty-acid degradation; hemoglobin concentration and body mass index |
| chr17 | 75909754 | 75909754 | 304 | DEL |
| 29/29 | NMD_transcript_variant | White matter hyperintensity; brain and eye measurements and cilium assembly |
| chr17 | 75909803 | 75909803 | 631 | DEL |
| 29/29 | NMD_transcript_variant | White matter hyperintensity; brain and eye measurements and cilium assembly |
| chr19 | 12318606 | 12318606 | 317 | DEL |
| 4/4 | Frame_shift | Intelligence; schizophrenia and educational attainment |
| chr19 | 48874925 | 48874925 | 285 | DEL |
| 1/1 | Inframe_deletion | White matter disease and Alzheimer’s disease |
| chr19 | 49004220 | 49004220 | 337 | DEL |
| 4/15 | NMD_transcript_variant | Chorionic gonadotropin level; follicle stimulating hormone level; cell cycle and mitotic |
| chr20 | 62965134 | 62965134 | 564 | DEL |
| 9/13 | Splice_region_variant | Porokeratosis; cutaneous photosensitivity; papule and pruritus |
| chr20 | 63533689 | 63533689 | 537 | DEL | PTK6 | 4/8 | Frame_shift | Dysgraphia; schizophrenia, autism spectrum disorder and hair shape |
| chr22 | 39964324 | 39964903 | 578 | INS | Z82206.1 | 2/2 | Coding_sequence_variant | < Unknown> |
Chrom chromosome, INS insertion, DEL deletion, NMD nonsense-mediated decay
aThe coordinates are based on human GRCh38
bThe gene functions are collected from GeneCard database and literatures[35,37–40,43–45]
Fig. 5ASSVs associate with ape-specific (ASPs) or great-ape-specific phenotypes (GASPs). a Heatmap illustrating the ADEs with high-confident ASSVs in eight brain regions. The nearest genes are indicated and the corresponding brain regions are indicated in red. The neurofunction-related genes are highlighted in red. ASSV deletions (circles) and insertions (triangles) are denoted. The circles/triangles in black and gray refer to high-confident ASSVs and candidate ASSVs, respectively. b Comparison of H3K27Ac signals of the ADE with an ASSV (587 bp deletion) in ITSN2 among human, chimpanzee, and macaque. The ADE exhibits significant signal difference between human/chimpanzee and macaque in five brain regions (*P < 0.05; **P < 0.01; ***P < 0.001; NS-not significant, P > 0.05). c A 587 bp deletion within intron-29 of ITSN2 disrupts a putative enhancer sequence in the great-ape lineage, with reduced enhancer activity in human and chimpanzee compared with rhesus macaque. The H3K27Ac signals in PFC and sequence alignments are shown. d ITSN2 exhibits significantly lower expression in chimpanzee and human compared with macaque with 16 neocortical layers (1S–16S) and the adjacent white matter (17 S) (left panel). PCR validation is shown (right panel). e A 477 bp deletion located in intron-10 of CDH8, a gene related to tail development. f A 178 bp ape lineage-specific deletion in intron-12 of NALCN, a gene associated with human fetal adducted thumbs. g A dot plot alignment highlights a 587 bp deletion (located in ITSN2) in apes compared with rhesus macaque and marmoset. h Summary of candidate ASSVs/GASSVs located in genes associated with ASPs (in red) or GASPs (in blue). A statistical assessment of the H3K27Ac signals and gene expression difference was conducted using two-tailed unpaired and paired t test, respectively. NS: not significant (P > 0.05) and ***P < 0.001