| Literature DB >> 29145861 |
Peter A Larsen1, R Alan Harris2,3, Yue Liu2, Shwetha C Murali2,4, C Ryan Campbell5, Adam D Brown6,7, Beth A Sullivan8, Jennifer Shelton9,10, Susan J Brown9, Muthuswamy Raveendran2, Olga Dudchenko3,11,12, Ido Machol3,11,12, Neva C Durand3,11,12, Muhammad S Shamim3,11,12, Erez Lieberman Aiden3,11,12, Donna M Muzny2,3, Richard A Gibbs2,3, Anne D Yoder5, Jeffrey Rogers2,3, Kim C Worley2,3.
Abstract
BACKGROUND: The de novo assembly of repeat-rich mammalian genomes using only high-throughput short read sequencing data typically results in highly fragmented genome assemblies that limit downstream applications. Here, we present an iterative approach to hybrid de novo genome assembly that incorporates datasets stemming from multiple genomic technologies and methods. We used this approach to improve the gray mouse lemur (Microcebus murinus) genome from early draft status to a near chromosome-scale assembly.Entities:
Keywords: Centromeres; Hi-C; Optical maps; Physical maps; Strepsirrhine primate; Super-scaffolding; de novo assembly
Mesh:
Year: 2017 PMID: 29145861 PMCID: PMC5689209 DOI: 10.1186/s12915-017-0439-6
Source DB: PubMed Journal: BMC Biol ISSN: 1741-7007 Impact factor: 7.431
Fig. 1Flowchart of hybrid assembly procedure. The initial assembly was generated using Illumina data and AllPaths-LG, followed by refined scaffolding using Atlas-Link and gap filling using Atlas-GapFill. Further gap filling with PacBio data and PBJelly followed, generating Mmur 2.0. The Mmur 2.0 assembly was super-scaffolded in an iterative method using BNG optical map data to identify conflicts, break and join scaffolds, and Lachesis and Hi-C proximity ligation data to further super-scaffold. The PBJelly method was used a second time to fill gaps in the final super-scaffolds followed by Pilon error correction, creating the Mmur 3.0 assembly (* indicates that the same PacBio data was used for the second PBJelly analysis)
Fig. 2Hybrid iterative improvement of the mouse lemur genome assembly. a Graph showing improvement of the de novo mouse lemur genome assembly from draft status (Mmur 1.0) to chromosome-scale (Mmur 3.0) using the methods described herein. PBJelly Lach 1 results are coincident with those of Mmur 3.0. b Graph of the three main mouse lemur genome assemblies: Mmur 1.0 (draft assembly; ~1.93X) released in 2007; Mmur 2.0 (primary assembly; ~190X); Mmur 3.0 (final chromosome-scale assembly). For both panels, X-axis shows percent of genome contained within scaffolds (arranged according to length) and Y-axis shows scaffold length in Mb
Summary statistics for iterative super-scaffolding of the Microcebus murinus genome
| Mmur 1.0 | Mmur 2.0 | BNG Round 1 | Lachesis Round 1 | BNG Round 2 | Lachesis Round 2 | Mmur 3.0 | |
|---|---|---|---|---|---|---|---|
| Number of scaffolds | 172,937 | 10,311 | 10,161 | 7813 | 8134 | 7679 | 7679 |
| Total size of scaffolds, bp | 2,910,103,014 | 2,438,804,424 | 2,469,090,855 | 2,492,570,855 | 2,491,435,191 | 2,495,985,191 | 2,487,714,386 |
| Longest scaffold, bp | 2,843,453 | 23,116,325 | 33,906,312 | 151,367,110 | 56,348,711 | 155,649,118 | 155,207,550 |
| N50 scaffold length, bp | 214,914 | 3,711,085 | 6,320,565 | 103,223.157 | 14,483,702 | 93,443,986 | 93,316,391 |
| N50 contig length, bp | 3511 | 182,929 | 182,011 | 182,011 | 181,924 | 181,924 | 234,304 |
| Percentage of assembly in scaffolded contigs | 95.4% | 99.2% | 99.2% | 99.6% | 99.6% | 99.6% | 99.6% |
| Scaffold, %N | 36.35 | 2.5 | 3.7 | 4.6 | 4.56 | 4.74 | 4.07 |
Benchmarking Universal Single-Copy Orthologs (BUSCO) results based on 3023 groups searched
| Mmur 2.0 | BNG Round 1 | Lachesis Round 1 | BNG Round 2 | Lachesis Round 2 | Mmur 3.0 | |
|---|---|---|---|---|---|---|
| Complete single-copy BUSCOs | 2708 | 2686 | 2697 | 2706 | 2690 | 2700 |
| Complete duplicated BUSCOs | 75 | 74 | 68 | 73 | 65 | 72 |
| Fragmented BUSCOs | 188 | 206 | 183 | 189 | 189 | 191 |
| Missing BUSCOs | 127 | 131 | 143 | 128 | 144 | 132 |
Fig. 3Mouse lemur 3.0 assembly. Circos plots were calculated with a sliding window of 500 kb. a Linear plot of percent of gaps encoded as N’s, plotted inward, where the red horizontal line is 25%. b Histogram of BNG physical map coverage across the scaffold, plotted with three horizontally shaded zones that match the data’s quartiles: 35× coverage and below is red (less than Q1), 35–56× coverage is grey (Q1–Q3), and 56× coverage and above is green (greater than Q3). c Lachesis scaffolds arranged according to length (in base pairs). Blue colored scaffolds represent those assigned to mouse lemur chromosomes (see Fig. 5) and white scaffolds are undetermined. Purple hashes identify regions containing the complete single copy genes (n = 2628) according to BUSCO analysis. d Histogram of percent of bases that are G + C across the genome. Genome-wide average is 40.98%, regions shaded light green are at least 47.5%, and regions shaded dark green are at least 55% G + C content
Fig. 5Centromere discovery using single molecule PacBio and BioNano data. a Graph of repeat units identified within raw PacBio data using Tandem Repeats Finder (see Methods). Each dot represents a repeat unit within a raw PacBio read and is graphed according to monomer length and overall (tandem) repeat length. A distinct distribution surrounding a 53 bp monomer was observed (including tandem repeats divisible by 53 bp). b The 53 bp monomer (Mm53) was identified, using FISH, to localize to nearly all centromeres in the mouse lemur karyotype, with the exception being the X chromosome (see Results and Fig. 6). c We mined our genome assembly for the Mm53 monomer associated with M. murinus centromeres. The Mm53 repeat was detected near the ends of scaffolds and/or gaps of Ns (representative in silico physical map shown in green). When aligned to consensus BioNano (BNG) physical map (blue), a distinct repeat unit was identified, indicating the presence of a BNG label within mouse lemur centromeres, thus providing a measure of higher-order repeat unit (~3.9 kb) and overall alpha-satellite array size within our BNG physical maps
Fig. 4Macro synteny between mouse lemur and human chromosomes. Broad regions of synteny were identified between 33 M. murinus chromosomes (left) and 23 human chromosomes (right) using MUMmer software and these regions are shown using ribbons colored according to M. murinus chromosome number. Putative identifications for the 33 M. murinus chromosomes were based on comparative cytogenetic data [29] (see Results; Additional file 2: Tables S6 and S7; Additional file 4: Figure S3). Ticks in each chromosome indicate lengths of 10 Mb. Mouse lemur photo courtesy of David Haring and the Duke Lemur Center
Fig. 6Functional identification of centromeric sequences in M. murinus. a, a’: Female mouse lemur metaphase chromosomes (blue) were hybridized with Mm53 (green), showing that the 53 bp sequence, Mm53, was present at every centromere except for the two metacentric X chromosomes (arrows). Gray-scale image shows the Mm53 fluorescent signal alone, illustrating the vast difference in abundance among the mouse lemur chromosomes. b–b”: Combined immunostaining for the essential centromere protein CENP-A and FISH with the Mm53 probe showed that CENP-A was present at every mouse lemur chromosome, including the two X chromosomes (insets in b). Gray scale images of fluorescent signals for Mm53 (b’) and CENP-A (b”) are separated out to emphasize relatively equal amounts of CENP-A at each chromosome, despite varying amounts of Mm53 centromeric sequence. The two X chromosomes have functional centromeres but lack Mm53, indicating that the X centromere is defined by a novel sequence that remains unidentified. Multiple colocalization analyses (k1k2 overlap coefficient and Manders’ colocalization coefficient (MCC), without and with thresholding) were performed on individual metaphases (n = 10 for each dot plot) to measure colocalization of red (CENP-A) and green (Mm53) signals. These analyses emphasized that a high proportion of CENP-A overlapped with Mm53