| Literature DB >> 30456315 |
Theodore S Kalbfleisch1, Edward S Rice2, Michael S DePriest3, Brian P Walenz4, Matthew S Hestand5, Joris R Vermeesch5, Brendan L O Connell2,6, Ian T Fiddes2,7, Alisa O Vershinina8, Nedda F Saremi2, Jessica L Petersen9, Carrie J Finno10, Rebecca R Bellone10,11, Molly E McCue12, Samantha A Brooks13, Ernest Bailey14, Ludovic Orlando15,16, Richard E Green2, Donald C Miller17, Douglas F Antczak17, James N MacLeod14.
Abstract
Recent advances in genomic sequencing technology and computational assembly methods have allowed scientists to improve reference genome assemblies in terms of contiguity and composition. EquCab2, a reference genome for the domestic horse, was released in 2007. Although of equal or better quality compared to other first-generation Sanger assemblies, it had many of the shortcomings common to them. In 2014, the equine genomics research community began a project to improve the reference sequence for the horse, building upon the solid foundation of EquCab2 and incorporating new short-read data, long-read data, and proximity ligation data. Here, we present EquCab3. The count of non-N bases in the incorporated chromosomes is improved from 2.33 Gb in EquCab2 to 2.41 Gb in EquCab3. Contiguity has also been improved nearly 40-fold with a contig N50 of 4.5 Mb and scaffold contiguity enhanced to where all but one of the 32 chromosomes is comprised of a single scaffold.Entities:
Year: 2018 PMID: 30456315 PMCID: PMC6240028 DOI: 10.1038/s42003-018-0199-z
Source DB: PubMed Journal: Commun Biol ISSN: 2399-3642
Resulting contig and scaffold N50s are presented here for each major step in the process of assembling EquCab3
| Sequence composition | Contig N50 | Scaffold N50 |
|---|---|---|
| EquCab3 | ||
| Sanger + MaSuRCa super reads | 1.2 Mb | 6.6 Mb |
| Sanger + MaSuRCa super reads + Chicago + Hi-C | 1.2 Mb | 86 Mb |
| Sanger + MaSuRCa super reads + Chicago + Hi-C + PacBio | 4.5 Mb | 86 Mb |
| EquCab2 | ||
| Sanger Fosmid + BAC + Radiation Hybrid Map data | 112 kb | 46 Mb |
For comparison, the contig and scaffold N50s for the final EquCab2 product are also shown. The final EquCab3 product (Sanger + MaSuRCa super reads + Chicago + Hi-C + PacBio) improved the contig N50 40-fold, and the Scaffold N50 was improved from a chromosome arm-limited 46 Mb to a chromosome length-limited 86 Mb
Fig. 1Percentages of RNA-seq reads from eight tissues from two horses (designated 683610 and 686521) and genomic reads mapping to EquCab2 vs. EquCab3. We used sequence data from the Functional Annotation of Animal Genomes (FAANG) project for this mapping. More RNA-seq reads map to EquCab3 than to EquCab2 for every tissue in both horses. The percentage of genomic reads (last two rows; WGS) mapping to EquCab3 is also larger than those mapping to EquCab2, but the difference is not as large
Fig. 2Number of reads from the Functional Annotation of Animal Genomes (FAANG) project WGS dataset mapping to EquCab2 and EquCab3. Significantly more reads map only to EquCab3 than only to EquCab2 (one-tailed two-sample binomial test p < 2.2 × 10–16)
Fig. 3A comparison of equine chromosome 31 between EquCab2 and EquCab3. a Average coverage per 10 kb window across chr31 in EquCab2 and EquCab3, with a large inversion between them highlighted. EquCab3 has fewer coverage drops and more total sequence than EquCab2. b An alignment of chr31 in EquCab2 and EquCab3 shows a large inversion between the two reference genomes. The radiation hybrid (RH) map (c) and Hi-C contact heat maps for EquCab2 (d) and EquCab3 (e) indicate that this discrepancy is the result of a misassembly in EquCab2
Fig. 4Annotation of EquCab2 and EquCab3 with the Comparative Annotation Toolkit shows substantial improvement in EquCab3. a More genes found in related species were annotated in EquCab3 than in EquCab2. b Fewer genes were split between contigs in EquCab3 than in EquCab2. c The gene coverage distribution is better in EquCab3 than in EquCab2