| Literature DB >> 31548511 |
Shang-Fang Yang1, Chia-Wei Lu2, Cheng-Te Yao3, Chih-Ming Hung4.
Abstract
Trimming low quality bases from sequencing reads is considered as routine procedure for genome assembly; however, we know little about its pros and cons. Here, we used empirical data to examine how read trimming affects assembled genome quality and computational time for a widespread East Asian passerine, the rufous-capped babbler (Cyanoderma ruficeps Blyth). We found that scaffolds assembled from raw reads were always longer than those from trimmed ones, whereas computational times for the former were sometimes much longer than the latter. Nevertheless, assembly completeness showed little difference among the trimming strategies. One should determine the optimal trimming strategy based on what the assembled genome will be used for. For example, to identify single nucleotide polymorphisms (SNPs) associated with phenotypic evolution, applying PLATANUS to gently trim reads would yield a reference genome with a slightly shorter scaffold length (N50 = 15.64 vs. 16.89 Mb) than the raw reads, but would save 75% of computational time. We also found that chromosomes Z, W, and 4A of the rufous-capped babbler were poorly assembled, likely due to a recently fused, neo-sex chromosome. The rufous-capped babbler genome with long scaffolds and quality gene annotation can provide a good system to study avian ecological adaptation in East Asia.Entities:
Keywords: computational time; de novo genome assemble; genome quality; reading trimming; rufous-capped babbler
Mesh:
Year: 2019 PMID: 31548511 PMCID: PMC6826712 DOI: 10.3390/genes10100737
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
De novo assembly results of rufous-capped babbler genomes based on three datasets using PLATANUS. The summaries are based on scaffolds ≥ 1000 bp, with exceptions noted in the rows.
| PLATANUS | Raw PE | Trimmed PE | Cut Off PE |
|---|---|---|---|
| # scaffolds (>= 0 bp) | 977,450 | 372,099 | 640,832 |
| # scaffolds | 9926 | 7756 | 8660 |
| # scaffolds (>= 5000 bp) | 660 | 742 | 838 |
| Total length (>= 0 bp) | 1,154,981,082 | 1,092,952,113 | 1,118,080,274 |
| Total length | 1,040,050,024 | 1,041,446,934 | 1,028,622,565 |
| Total length (>= 5000 bp) | 1,025,650,842 | 1,030,254,425 | 1,016,176,548 |
| Largest scaffold | 74,755,707 | 64,119,864 | 50,901,048 |
| GC (%) | 42.07 | 42.14 | 42.05 |
| Scaffold N50 | 16,893,686 | 15,643,638 | 13,305,708 |
| Gaps | 48,535 | 46,333 | 52,286 |
| N_count | 9,597,849 | 8,524,882 | 9,456,690 |
| N per 1000 bp | 9.22 | 8.19 | 9.19 |
De novo assembly results of rufous-capped babbler genomes based on three datasets using DISCOVARdenovo + SOAPdenovo2. The summaries are based on scaffolds ≥ 1000 bp, with exceptions noted in the rows.
| DIS + SOAP | Raw PE | Trimmed PE | Cut Off PE |
|---|---|---|---|
| # scaffolds (>= 0 bp) | 392,770 | 244,234 | 326,322 |
| # scaffolds | 20,349 | 20,600 | 25,332 |
| # scaffolds (>= 5000 bp) | 2617 | 2942 | 3231 |
| Total length (>= 0 bp) | 1,239,771,083 | 1,212,202,269 | 1,250,242,340 |
| Total length | 1,131,554,057 | 1,138,451,258 | 1,151,285,864 |
| Total length (>= 5000 bp) | 1,100,826,487 | 1,107,888,121 | 1,114,172,265 |
| Largest scaffold | 14,426,870 | 14,721,577 | 10,390,167 |
| GC (%) | 42.44 | 42.42 | 42.39 |
| Scaffold N50 | 2,5527,00 | 1,995,139 | 1,419,184 |
| Gaps | 33,342 | 40433 | 49601 |
| N_count | 32,412,091 | 39,082,416 | 45,610,276 |
| N per 1000 bp | 28.64 | 34.33 | 39.62 |
Computational times for genome assembly based on three datasets using PLATANUS. The computational times for three genome assembly procedures, contig assembly, scaffolding, and gap closing, were estimated, separately.
| Raw PE | Trimmed PE | Cut Off PE | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Time (mins) | CPU | Time × CPU | Time (mins) | CPU | Time × CPU | Time (mins) | CPU | Time × CPU | |
| Contig assembly | 21,131 | 40 |
| 5008 | 40 |
| 3899 | 40 |
|
| Scaffolding | 622 | 1 |
| 507 | 1 |
| 500 | 1 |
|
| Gap Closing | 262 | 40 |
| 440 | 40 |
| 497 | 40 |
|
| SUM (mins) | 856,342 | 218,427 | 176,340 | ||||||
| SUM (hours) |
|
|
| ||||||
Computational times for genome assembly based on three datasets using DISCOVARdenovo + SOAPdenovo2. The computational times for three genome assembly procedures, contig assembly, scaffolding, and gap closing, were estimated, separately.
| Raw PE | Trimmed PE | Cut Off PE | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Time (mins) | CPU | Time × CPU | Time (mins) | CPU | Time × CPU | Time (mins) | CPU | Time × CPU | |
| Contig assembly | 834 | 40 |
| 1803 | 40 |
| 1214 | 40 |
|
| Scaffolding | 205 | 40 |
| 366 | 40 |
| 290 | 40 |
|
| Gap Closing | 387 | 40 |
| 224 | 48 |
| 138 | 48 |
|
| SUM (mins) | 57,040 | 97,512 | 66,784 | ||||||
| SUM (hours) |
|
|
| ||||||
Assembly completeness of three PLATANUS-assembled genomes based on BUSCO analyses. Types of BUSCOs indicate the assessment output types of benchmarking universal single-copy orthologs.
| Raw PE | Trimmed PE | Cut Off PE | ||||
|---|---|---|---|---|---|---|
| Types of BUSCOs |
| % |
| % |
| % |
| Complete | 4623 | 94 | 4642 | 94.5 | 4630 | 94.2 |
| Complete and single-copy | 4572 | 93 | 4595 | 93.5 | 4588 | 93.3 |
| Complete and duplicated | 51 | 1 | 47 | 1 | 42 | 0.9 |
| Fragmented | 172 | 3.5 | 155 | 3.2 | 159 | 3.2 |
| Missing | 120 | 2.5 | 118 | 2.3 | 126 | 2.6 |
| Total | 4915 | 100 | 4915 | 100 | 4915 | 100 |
Assembly completeness of three DISCOVARdenovo+SOAPdenovo2-assembled genomes based on BUSCO analyses. Types of BUSCOs indicate assessment output types of benchmarking universal single-copy orthologs.
| Raw PE | Trimmed PE | Cut Off PE | ||||
|---|---|---|---|---|---|---|
| Types of BUSCOs |
| % |
| % |
| % |
| Complete | 4668 | 94.9 | 4663 | 94.9 | 4607 | 93.8 |
| Complete and single-copy | 4588 | 93.3 | 4575 | 93.1 | 4530 | 92.2 |
| Complete and duplicated | 80 | 1.6 | 88 | 1.8 | 77 | 1.6 |
| Fragmented | 146 | 3 | 160 | 3.3 | 180 | 3.7 |
| Missing | 101 | 2.1 | 92 | 1.8 | 128 | 2.5 |
| Total | 4915 | 100 | 4915 | 100 | 4915 | 100 |
BRAKER gene prediction results for the PLATANUS-assembled genomes.
| Raw PE | Trimmed PE | Cut Off PE | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Number | Total length | Mean length | Number | Total length | Mean length | Number | Total length | Mean length | |
|
| 21,919 | 392,576,027 | 17,910 | 21,712 | 398,950,687 | 18,375 | 20,859 | 391,437,927 | 18,766 |
|
| 25,736 | 586,294,213 | 22,781 | 25,594 | 594,536,470 | 23,230 | 24,651 | 586,027,670 | 23,773 |
|
| 24,849 | 74,547 | 3 | 24,708 | 74,124 | 3 | 23,797 | 71,391 | 3 |
|
| 24,769 | 74,307 | 3 | 24,676 | 74,028 | 3 | 23,735 | 71,205 | 3 |
|
| 245,952 | 42,525,218 | 173 | 249,072 | 42,562,276 | 171 | 244,448 | 41,219,021 | 169 |
|
| 221,329 | 543,768,995 | 2,457 | 224,695 | 551,974,194 | 2,456 | 220,985 | 544,808,649 | 2,465 |
Completeness of predicted genes and transcripts in three PLATANUS-assembled genomes based on the presence of start and stop codons. Com_G indicates complete predicted genes. Com_T indicates complete predicted transcripts. %Com_G indicates the percentage of predicted genes is complete. %Com_T indicates the percentage of predicted transcripts is complete.
| Raw PE | Trimmed PE | Cut Off PE | |
|---|---|---|---|
| Gene | 21,919 | 21,712 | 20,859 |
| Com_G | 20,369 | 20,220 | 19,382 |
| %Com_G | 92.9% | 93.1% | 92.9% |
| Transcript | 25,736 | 25,594 | 24,651 |
| Com_T | 24,142 | 24,051 | 23,135 |
| %Com_T | 93.8% | 94.0% | 93.9% |
%Com_G = (Com_G / Gene) × 100%; %Com_T = (Com_T / Transcript) × 100%.
Figure 1D-GENIES plots of the alignments between the rufous-capped babbler and zebra finch genomes. The raw PE (A) and trimmed PE (B) PLATANUS-assembled genomes are mapped to the zebra finch genome. Only the scaffolds of the rufous-capped babbler genomes with lengths > 5000 bp were used. The unlocalized scaffolds of the zebra finch genome were excluded from the analyses.
Figure 2Jupiter plots of the alignments between the rufous-capped babbler and zebra finch genomes. The raw PE (A) and trimmed PE (B) PLATANUS-assembled genomes are mapped to the zebra finch genome. Only the scaffolds of the rufous-capped babbler genomes with lengths > 5000 bp were used. The unlocalized scaffolds of the zebra finch genome were excluded from the analyses.