| Literature DB >> 35189951 |
Adrien Leger1, Ian Brettell1, Jack Monahan1, Carl Barton1, Nadeshda Wolf2, Natalja Kusminski2, Cathrin Herder2, Narendar Aadepu2,3, Clara Becker3, Jakob Gierten3, Omar T Hammouda3, Eva Hasel3, Colin Lischik3, Katharina Lust3, Natalia Sokolova3, Risa Suzuki3, Tinatini Tavhelidse3, Thomas Thumberger3, Erika Tsingos3, Philip Watson3, Bettina Welz3, Kiyoshi Naruse4, Felix Loosli2, Joachim Wittbrodt3, Ewan Birney1, Tomas Fitzgerald1.
Abstract
BACKGROUND: The teleost medaka (Oryzias latipes) is a well-established vertebrate model system, with a long history of genetic research, and multiple high-quality reference genomes available for several inbred strains. Medaka has a high tolerance to inbreeding from the wild, thus allowing one to establish inbred lines from wild founder individuals.Entities:
Keywords: Genetics; Graph genome; Inbred panel; Long read sequencing; Medaka; Methylation; Nanopore; Pan genome; Structural variation
Mesh:
Year: 2022 PMID: 35189951 PMCID: PMC8862245 DOI: 10.1186/s13059-022-02602-4
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 17.906
Summary statistics of individual MIKK line assemblies
| Line id | Number of contigs | GC (%) | Total length | Largest contig | N50 | Total aligned length | Largest alignment | NA50 |
|---|---|---|---|---|---|---|---|---|
| 2,886 | 40.66 | 730,816,425 | 5,635,124 | 802,725 | 599,170,757 | 2,340,443 | 259,311 | |
| 2,762 | 40.71 | 737,637,241 | 6,376,848 | 971,613 | 612,988,975 | 2,781,976 | 279,257 | |
| 2,512 | 40.69 | 732,447,291 | 5,851,261 | 942,347 | 608,014,583 | 2,243,072 | 265,102 | |
| 2,892 | 40.69 | 732,448,405 | 4,099,264 | 845,096 | 607,409,964 | 1,761,859 | 253,015 | |
| 3,368 | 40.56 | 728,542,858 | 4,525,370 | 624,727 | 545,652,612 | 1,541,261 | 180,262 | |
| 3,077 | 40.59 | 727,390,278 | 6,080,511 | 708,738 | 573,096,833 | 1,823,612 | 220,342 | |
| 3,053 | 40.62 | 730,357,166 | 6,658,276 | 742,721 | 584,535,579 | 2,384,437 | 235,007 | |
| 4,374 | 40.49 | 720,948,860 | 3,238,833 | 404,886 | 491,501,885 | 1,304,457 | 105,372 | |
| 2,810 | 40.73 | 732,113,747 | 5,059,334 | 903,899 | 620,809,739 | 2,195,961 | 270,130 | |
| 3,651 | 40.72 | 741,963,499 | 5,589,066 | 737,055 | 563,763,739 | 2,231,101 | 180,572 | |
| 3,142 | 40.71 | 731,842,804 | 4,138,056 | 747,064 | 606,908,341 | 2,042,573 | 245,116 | |
| 3,977 | 40.75 | 729,131,624 | 3,835,876 | 475,444 | 621,355,180 | 1,698,873 | 209,441 |
Fig. 1Quality metrics for individual assemblies. A Normalised distribution of contigs length for each assembly. Dashed lines represent the N50 values. B Cumulative length of contigs. C Cumulative length for contig blocks aligned on HdrR, in comparison with the HdrR reference chromosomes (dashed black line). D Distribution of CG content of assemblies in comparison with the HdrR reference (dash black line). E Feature-response curve for HdrR gene annotation, showing the quality of the assemblies as a function of the maximum number of possible genes allowed in the contigs
Fig. 2Pangenome graph reference characterisation. A Example section of the graph for chromosome 1 showing different paths through the graph via segments originating from the 4 types of assemblies used to build the graph. B Total length of segments contained into the graph by type of assembly. Dashed areas represent the proportion of bases for segments covered by at least 2 samples with at least 5% of the average coverage over the HdrR reference segments. C Distribution of the length of segment by type of assembly normalised by the total length of segments. D Kernel density plots of the length of alternative segments according to their divergence when aligned onto the HdrR reference. The quadrants defined by vertical dashed lines (length = 2kb) and horizontal dashed lines (divergence = 0.5) separate the segments into 4 categories according to their length and divergence score. The numbers displayed correspond to the percentages of segments within each of the 4 quadrants. E Percentages of bases from nanopore reads aligning on each type of assembly for the 12 MIKK samples. F Detailed percentages of bases aligned on alternative segments for each MIKK sample depending on segments cross-usage by the other samples, from 12 (all other samples) to 1 (only the current sample). A segment was considered used by a sample when its coverage was at least 5% of the average HdrR reference coverage
Pangenome graph reference statistics. Segment type indicates which assembly the segments originally come from. For the “Segments used by at least 2 MIKK samples” columns, we defined a segment as being used if its coverage is at least 10% of the average coverage over the HdrR reference segments
| Segment type | All segments in graph | Segments used by at least 2 MIKK samples | Median segment length | Longest segment | N50 | Median % identity | ||
|---|---|---|---|---|---|---|---|---|
| Length (bp) | # segments | Length (bp) | # segments | |||||
| 734,100,826 | 648,692 | 713,609,808 | 615,564 | 401 | 675,459 | 3000 | NA | |
| 103,507,879 | 148,689 | 43,204,187 | 47,854 | 239 | 175,667 | 2003 | 62.9% | |
| 152,043,332 | 100,275 | 49,055,247 | 23,881 | 371 | 236,527 | 5803 | 60.4% | |
| 318,174,656 | 211,836 | 203,539,620 | 161,533 | 559 | 89,792 | 3998 | 73.7% | |
| 1,307,826,693 | 1,109,492 | 1,009,408,862 | 848,832 | 389 | 675,459 | 3342 | NA | |
Fig. 3Example structural variations identified in the pangenome graph. A, B Visualisation of alternative divergent paths in the graph. For both selected examples, the left side panel shows a bandage plot indicating the reference HdrR and the alternative path. Each graph segment is color-coded according to the number of samples with at least 50% of the reference coverage for the ONT DNA-Seq dataset, from white (none) to deep red (all). Blue segments are supported by multiple samples for both Illumina RNA-Seq and ONT DNA-Seq. The right panel shows the linear structure of local assemblies for both the reference and the alternative paths. For the reference, the top blue track represents the existing Medaka HdrR annotations. The light and dark green tracks correspond to the segment layout from the graph. Finally, the heatmaps show the RNA expression intensities for all 50 medaka samples sequenced along the represented sections of the graph (grey = not found, white = less than 5 reads, dark red = more than 100 reads). C, D Visualisation of large-scale deletions in the graph. For both selected examples, the left panel shows the Medaka HdrR annotations (blue) and the graph segment layout (light and dark green), overlaid with the deletion position (grey rectangle). The bandage plots on the right are color-coded as previously described. The shaded area indicates the reference sequence deletions robustly supported by a direct connection between distant reference segments (link coverage > 50% of reference coverage for at least 9 samples)
Fig. 4Polished SVs in 9 MIKK panel lines sequenced with ONT. DEL deletion, INS insertion, TRA translocation, DUP duplication, INV inversion. A Aggregate log10 counts and lengths of distinct SVs by type, excluding TRA. B pLI LOD scores in distinct SVs by SV type. C Histogram of LOD scores by SV type. D Total and singleton counts of SV types per sample. E Circos plot showing per-sample distribution and lengths of DEL variants across the genome. Circos figures for each of the other SV types are included in Additional file 8: Fig. S6
Fig. 5DNA methylation analysis. A Heatmap of all significant CpG islands differentially methylated across the MIKK panel samples. CpG islands are sorted by genomic positions and the X-axis. Samples are ordered by hierarchical clustering on the Y-axis and color coded so that sibling lines are indicated with the same colors. The color scheme is according to the value of the median log-likelihood ratio from −3 (blue) to 3 (red) with ambiguous values between −1 and 1 in white. B Distribution of distances to the closest gene TSS for significantly differentially methylated CpG in red (n=4249) and all non-significant islands in blue (n=8727). C Number of significant CpG islands by genomic windows for the 24 HdrR chromosomes, from 0 (white) to 6 (dark red). D Example 100kb region containing a significant CpG island in red (chr15:1565040-1565987) as opposed to non-significant ones in white. E CpG level log-likelihood ratio kernel density plot for the CpG island highlighted in panel D. Samples are sorted on the Y-axis by decreasing median llr. Individual CpG values are indicated by dots. F Heatmap of log-likelihood ratio with hierarchical clustering by sample for the CpG island highlighted in panel D. On the X-axis are individual CpGs sorted by genomic position