| Literature DB >> 26984673 |
Guillaume Martin1, Franc-Christophe Baurens1, Gaëtan Droc1, Mathieu Rouard2, Alberto Cenci2, Andrzej Kilian3, Alex Hastie4, Jaroslav Doležel5, Jean-Marc Aury6, Adriana Alberti6, Françoise Carreel1, Angélique D'Hont7.
Abstract
BACKGROUND: Recent advances in genomics indicate functional significance of a majority of genome sequences and their long range interactions. As a detailed examination of genome organization and function requires very high quality genome sequence, the objective of this study was to improve reference genome assembly of banana (Musa acuminata).Entities:
Keywords: Bioinformatics tool; GBS; Genome assembly; Genome map; Musa acuminata; Paired-end sequences
Mesh:
Substances:
Year: 2016 PMID: 26984673 PMCID: PMC4793746 DOI: 10.1186/s12864-016-2579-4
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Overview of the pipeline used to improve the Musa draft genome sequence. Ellipses correspond to input data and grey ellipses indicate new data acquired for the improvement of the assembly. Boxes corresponds to bioinformatics tools, the ones in blue are new and made available through Scaffremodler and Scaffhunter toolboxes respectively (see Additional file 1)
Statistics on scaffold assemblies
| V1 (D'hont et al. 2012) | SSPACE | Fusion/joining/splitting/gap re-estimation | IRYS scaffold | GapCloser | |
|---|---|---|---|---|---|
| Scaffold number | 7 513 | 2 267 | 1 572 | 1 532 | 1 532 |
| Cumulated size | 472 210 317 | 438 736 528 | 443 852 100 | 450 994 104 | 450 697 673 |
| Unknown sites (%) | 81 728 542 (17.3) | 48 267 272 (11.0) | 53 378 493 (12.3) | 60 520 497 (13.4) | 45 175 659 (10.0) |
| N50 (scaffold number) | 1 311 088 (65) | 1 545 585 (52) | 2 890 075 (28) | 3 014 384 (26) | 3 016 874 (26) |
| N80 (scaffold number) | 316 579 (299) | 370 770 (242) | 491 628 (169) | 578 880 (150) | 579 793 (150) |
| N90 (scaffold number) | 54 335 (647) | 169 980 (416) | 201 127 (305) | 234 686 (268) | 234 825 (267) |
Fig. 2Example of a clue leading to scaffold splitting. a Genetic markers mapped onto scaffold21 belong respectively to linkage-group 7 (red) and linkage-group 6 (blue) suggesting a chimeric misassembly. b CIRCOS graphical representation of paired read mapping in the misassembled region. This representation is drawn using Scaffremodler’s tools. In the inner circle, links between read pairs are drawn with the following color code: grey lines correspond to concordant pairs (correct orientation and insert size), orange and red lines correspond to discordant pairs with smaller and greater insert size respectively. Purple lines correspond to pairs showing reverse-reverse orientation, green lines, forward-forward and blue lines correspond to pair with complete reverse orientation relative to the paired library construction. The second circle represents scaffold in blue with gaps as black regions. The next circles are scatter plots with warm-cool color code. The first scatter plot presents the proportion of discordant reads on window size of one third of expected read pair insert size. The outer circle represents a scatter plot of read coverage on window size of 100 bases. The black arrow points the misassembled region in scaffold21 leading to the assembly of two regions that are not linked
Fig. 3Example of a clue leading to scaffold fusion. a Graphical representation of paired read leading to the identification of fusion of scaffold1112 into scaffold24. This representation is drawn using Scaffremodler’s tools. In the inner circle, links between read pairs are drawn with the color code described in Fig. 2: grey for concordant pairs; red and orange for discordant in size; purple, green and blue for orientation discordant pairs. The second circle represents scaffold in blue with gaps as black regions. The next represents the proportion of discordant reads and the last circle represents read coverage as in Fig. 2. Red and blue beams linking scaffold1112 and scaffold24 allowed identifying scaffold fusion schematized in (b). Inserting scaffold1112 into scaffold24 will correct the discordant red links and correct the orientation of discordant blue links
Fig. 4Representation of the new version of eleven pseudo-molecules of Musa acuminata. Black and white boxes correspond to oriented and unoriented scaffolds, respectively. Genetic marker, gene and unknown sequence (‘N’) density are represented in grey, blue and green respectively based on a windows size of 100 kb. The recombination rate (red curve) has been calculated on 180 individuals on corrected genetic markers and a sliding window of 500 kb
Statistics on marker density on linkage groups
| Linkage group | Cumulated scaffold size | Marker number | Marker density (number/100 kb) |
|---|---|---|---|
| chr01 | 29 067 552 | 1 384 | 4.76 |
| chr02 | 29 509 134 | 1 502 | 5.09 |
| chr03 | 35 017 413 | 1 920 | 5.48 |
| chr04 | 37 104 143 | 2 489 | 6.71 |
| chr05 | 41 848 132 | 1 924 | 4.60 |
| chr06 | 37 589 864 | 2 234 | 5.94 |
| chr07 | 35 025 021 | 1 744 | 4.98 |
| chr08 | 44 883 571 | 2 728 | 6.08 |
| chr09 | 41 302 925 | 2 136 | 5.17 |
| chr10 | 37 671 811 | 2 023 | 5.37 |
| chr11 | 27 952 850 | 1 519 | 5.43 |
| Total | 396 972 416 | 21 603 | 5.44 |
Statistics on Musa acuminata pseudo-molecule assembly between the first and the new version
| Version 1 | Version 2 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Identifier | Scaffold cumulated size | Nba | Scaffold N50 | Nba | N in scaffolds | % | Scaffold cumulated size | Nba | Scaffold N50 | Nba | N in scaffolds | % |
| chr01 | 27 571 529 | 22 | 2 245 470 | 4 | 3 459 727 | 12.5 | 29 067 552 | 30 | 1 394 891 | 2 | 2 151 480 | 7.4 |
| chr02 | 22 052 597 | 22 | 1 755 924 | 3 | 2 961 122 | 13.4 | 29 509 134 | 27 | 2 676 329 | 3 | 3 555 070 | 12.0 |
| chr03 | 30 468 307 | 22 | 3 785 391 | 3 | 3 981 002 | 13.1 | 35 017 413 | 31 | 9 733 574 | 2 | 2 329 119 | 6.7 |
| chr04 | 30 050 316 | 13 | 8 856 836 | 2 | 3 343 441 | 11.1 | 37 104 143 | 17 | 7 838 899 | 3 | 2 076 824 | 5.6 |
| chr05 | 29 375 369 | 21 | 2 773 165 | 4 | 3 488 635 | 11.9 | 41 848 132 | 52 | 2 239 696 | 5 | 3 976 084 | 9.5 |
| chr06 | 34 896 279 | 30 | 7 330 853 | 2 | 4 472 335 | 12.8 | 37 589 864 | 36 | 9 841 105 | 2 | 2 328 163 | 6.2 |
| chr07 | 28 615 304 | 22 | 5 244 634 | 3 | 4 262 894 | 14.9 | 35 025 021 | 31 | 6 378 715 | 3 | 4 518 654 | 12.9 |
| chr08 | 35 437 139 | 27 | 2 556 008 | 3 | 5 002 970 | 14.1 | 44 883 571 | 57 | 9 906 416 | 2 | 3 821 170 | 8.5 |
| chr09 | 34 145 263 | 37 | 1 544 587 | 6 | 5 397 793 | 15.8 | 41 302 925 | 39 | 2 119 922 | 3 | 3 398 494 | 8.2 |
| chr10 | 33 662 572 | 33 | 1 266 487 | 5 | 5 753 963 | 17.1 | 37 671 811 | 31 | 1 798 308 | 3 | 3 318 350 | 8.8 |
| chr11 | 25 512 624 | 15 | 7 530 813 | 2 | 2 838 651 | 11.1 | 27 952 850 | 16 | 7 787 879 | 2 | 1 979 175 | 7.1 |
| Mitochondrion | - | - | - | - | - | - | 7 218 240 | 12 | 616 199 | 4 | 37 503 | 0.5 |
aScaffold number
Fig. 5Dot plot comparison of gene order between the initial and the new version of Musa acuminata genome sequence assembly. A dot represents the position of a gene in the two assembly versions with the initial assembly on x axis and the new one on the y axis. Ruptures in the diagonal indicate differences of gene order. Red circles indicate the main differences and green circles indicate the variations resulting from the approximate scaffold order in the peri-centromeric regions. For instance, the version 2 of the assembly corrects a significant error between the chromosome 1 and 4
Statistics on annotation transfer between the first release of the assembly and the new release
| First release (D'hont et al. 2012) | New release (version 2) | ||||||
|---|---|---|---|---|---|---|---|
| Identifier | Pseudo-molecule size (bp)c | Number | Pseudo-molecule size (bp)c | Number | |||
| RefSeqa | BGHb | RefSeqa | BGHb | Consensus | |||
| chr01 | 27 573 629 | 2 407 | 2 836 | 29 070 452 | 2 038 | 2 427 | 2 372 |
| chr02 | 22 054 697 | 1 975 | 2 328 | 29 511 734 | 2 172 | 2 563 | 2 517 |
| chr03 | 30 470 407 | 2 796 | 3 251 | 35 020 413 | 2 991 | 3 443 | 3 371 |
| chr04 | 30 051 516 | 2 850 | 3 368 | 37 105 743 | 3 512 | 4 123 | 4 018 |
| chr05 | 29 377 369 | 2 583 | 2 972 | 41 853 232 | 2 824 | 3 268 | 3 215 |
| chr06 | 34 899 179 | 3 165 | 3 700 | 37 593 364 | 3 425 | 4 003 | 3 896 |
| chr07 | 28 617 404 | 2 447 | 2 764 | 35 028 021 | 2 577 | 2 907 | 2 918 |
| chr08 | 35 439 739 | 2 876 | 3 458 | 44 889 171 | 3 034 | 3 623 | 3 489 |
| chr09 | 34 148 863 | 2 602 | 3 110 | 41 306 725 | 2 752 | 3 318 | 3 157 |
| chr10 | 33 665 772 | 2 677 | 3 157 | 37 674 811 | 2 775 | 3 229 | 3 155 |
| chr11 | 25 514 024 | 2 257 | 2 679 | 27 954 350 | 2 205 | 2 614 | 2 521 |
| chrUn_random | 141 147 818 | 2 081 | 2 927 | 46 622 217 | 344 | 540 | 543 |
| Mitochondrial | N/A | N/A | N/A | 7 218 240 | 25 | 96 | 104 |
| Total | 472 960 417 | 30 716 | 36 550 | 450 848 473 | 30 674 | 36 154 | 35 276 |
aNCBI RefSeq genome annotation released the 7 October 2014 and generated with the NCBI Eukaryotic Genome Annotation Pipeline
bBanana Genome Hub (BGH) annotation performed by [27], in addition to manually curated genes performed before 08 December 2014 available in the Banana Genome Hub
cIncluding ‘N’ separating scaffolds