| Literature DB >> 35177600 |
A Talenti1, J Powell2, J D Hemmink2,3,4,5, E A J Cook3,5, D Wragg2,4, S Jayaraman2, E Paxton2, C Ezeasor6, E T Obishakin7,8, E R Agusi7,8, A Tijjani9,10, W Amanyire11, D Muhanguzi11, K Marshall3,5, A Fisch12, B R Ferreira12, A Qasim13, U Chaudhry2, P Wiener2, P Toye3,5, L J Morrison2,4, T Connelley2,4, J G D Prendergast14,15.
Abstract
Despite only 8% of cattle being found in Europe, European breeds dominate current genetic resources. This adversely impacts cattle research in other important global cattle breeds, especially those from Africa for which genomic resources are particularly limited, despite their disproportionate importance to the continent's economies. To mitigate this issue, we have generated assemblies of African breeds, which have been integrated with genomic data for 294 diverse cattle into a graph genome that incorporates global cattle diversity. We illustrate how this more representative reference assembly contains an extra 116.1 Mb (4.2%) of sequence absent from the current Hereford sequence and consequently inaccessible to current studies. We further demonstrate how using this graph genome increases read mapping rates, reduces allelic biases and improves the agreement of structural variant calling with independent optical mapping data. Consequently, we present an improved, more representative, reference assembly that will improve global cattle research.Entities:
Mesh:
Year: 2022 PMID: 35177600 PMCID: PMC8854726 DOI: 10.1038/s41467-022-28605-0
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 17.694
Fig. 1Principal component analysis of the 294 cattle.
The positions of the populations of origin of the five assemblies considered in this study are shown. The source data are provided with the paper.
Fig. 2Snail plots of the N’Dama (NDA1) and Ankole (ANK1) genome assemblies.
Key metrics are shown for the (A) N’Dama and (B) Ankole genomes such as the longest scaffold (red vertical line), N50 (orange track), N90 (light orange track), GC content (external blue track) and BUSCO scores (outer circular pie chart in green). The region of elevated N content in the N’Dama assembly corresponds to a 5 Mb gap in one of the contigs matching a region of generalised low identity in all of the five assemblies (Supplementary Fig. 4). Even though this region contained an unfilled gap we observe that the regions flanking the gap align to directly contiguous portions of the genome in other assemblies, and therefore that the gap in this region is potentially smaller than represented here.
Sequence contribution from the two African genomes.
| Angus | Ankole | Brahman | N’Dama | Total | ||
|---|---|---|---|---|---|---|
| Non-reference nodes (total) | #nodes | 6,188,973 | 14,994,500 | 14,627,206 | 10,338,166 | 29,315,173 |
| bp | 46,066,551 | 118,203,105 | 60,100,791 | 87,792,217 | 257,235,506 | |
| Non-reference nodes (autosomes) | #nodes | 5,823,611 | 11,262,561 | 13,362,852 | 8,832,454 | 23,599,013 |
| bp | 17,903,582 | 41,317,786 | 39,647,314 | 25,806,882 | 76,660,696 | |
| Filtered non-reference nodes (total) | #nodes | 285,307 | 780,815 | 705,024 | 494,781 | 1,008,401 |
| bp | 4,612,021 | 12,486,639 | 12,023,827 | 6,760,434 | 15,491,621 | |
| Filtered non-reference nodes (autosomes) | #nodes | 198,393 | 429,652 | 443,737 | 313,670 | 571,123 |
| bp | 3,290,022 | 7,093,645 | 7,435,063 | 4,595,327 | 9,046,464 | |
| Final set of contigs | Number of contigs | 2,250 | 5058 | 6387 | 2970 | 16,665 |
| Length (total) | 3,274,775 | 4,508,339 | 10,507,420 | 2,246,905 | 20,537,439 | |
| Length (min) | 61 | 61 | 61 | 61 | 61 | |
| Length (max) | 92,590 | 34,789 | 103,683 | 29,488 | 103,683 | |
| Length (mean) | 1455.00 | 891.00 | 1645.00 | 757.00 | 1,232.37 | |
| Length (std) | 5177.00 | 1990.00 | 4957.00 | 1885.00 | 3,875.06 |
The table shows the amount of sequences from non-ARS-UCD1.2 genomes, and how much the two novel assemblies from African breeds contribute to the numbers.
Fig. 3Comparison of genomic content across the genomes.
A High-quality (NOVEL) sequence specific to, or shared among, each non-reference genome. Numbers represent the kilobases of non-Hereford sequence associated with the set of genomes defined by the group(s) highlighted in green. Each genome is indicated by a number (1 = Ankole, 2 = Angus, 3 = Brahman and 4 = N’Dama); B Multiple genome alignments of the MHC region on chromosome 23 generated with AliTV (v1.0.6)[75]. The plot represents the shared sequences among the different genomes; blue to green segments are representative of higher to lower similarity (100 to 70% respectively); the enlarged region is the MHC region, which shows a large amount of variation between the assemblies.
Fig. 4Graph genome descriptions and their performances.
A A cartoon representation of the four types of graph genomes considered (the linear VG1, VG1 expanded with 11 M short variants (VG1p), the CACTUS VG5 graph and the CACTUS graph expanded with the 11 M short variants (VG5p)). Regions indicated in blue are regions coming from the backbone sequence, those in grey are the short variants from Dutta et al. (2020), and in yellow the variants derived from the CACTUS graph; B the percent enrichment of reads mapped by vg (primary axis) using the different graphs over the bwa mem linear mapper; and C the allelic balance for the linear callers FreeBayes and GATK HaplotypeCaller compared with vg call, showing how the latter reduces the allelic bias for large variants. For other versions of this plot looking at different sets of known and novel variants see Supplementary Note 3; and D the intersection of structural variants longer than 500 bp called using the VG5p graph (blue), Delly V2 (green) and the Bionano optical mapping (orange), showing how most variants called with vg are also confirmed using one of the other methods. Note an SV called by one method may overlap more than one SV called by a different method. The source data for panels (B), (C) and (D) are provided with the paper.
Fig. 5Example of an insertion in the N’Dama relative to the Hereford reference.
The insertion was detected A in both Kenyan N’Dama OM samples as represented by an increase in the distance between labels (vertical lines) on each bionano haplotype (blue rectangles) over that expected given the labels’ in silico locations in the Hereford reference (green rectangle). B This SV was identified as homozygous in all three Nigerian N’Dama resequenced genomes when called against the graph genome. C A Bandage[76] representation of the graph genome in this region showing the large structural variant (blue loop) in the Hereford genome (grey line).
Fig. 6ATAC-seq analyses results.
A Enrichment or depletion of the number of ATAC-seq peaks called in the different assemblies with respect to the number called in ARS-UCD1.2, showing more peaks were called using the expanded ARS-UCD1.2+ genome in all samples; and B showing the enrichment around the TSS of both the ARS-UCD1.2 annotated genes (left three heatmaps) and of the 923 features predicted by Augustus in the novel contigs (right). The source data for panel (A) are provided with the paper.