| Literature DB >> 31644802 |
Eric J Carpenter1, Naim Matasci2, Saravanaraj Ayyampalayam3, Shuangxiu Wu4, Jing Sun4, Jun Yu4, Fabio Rocha Jimenez Vieira5, Chris Bowler5, Richard G Dorrell5, Matthew A Gitzendanner6, Ling Li7, Wensi Du7, Kristian K Ullrich8, Norman J Wickett9,10, Todd J Barkmann11, Michael S Barker12, James H Leebens-Mack13, Gane Ka-Shu Wong1,7,14.
Abstract
BACKGROUND: The 1000 Plant transcriptomes initiative (1KP) explored genetic diversity by sequencing RNA from 1,342 samples representing 1,173 species of green plants (Viridiplantae).Entities:
Keywords: RNA; assemblies; contamination; genes; plants; transcriptome completeness
Mesh:
Substances:
Year: 2019 PMID: 31644802 PMCID: PMC6808545 DOI: 10.1093/gigascience/giz126
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Distribution in amount of sequence data per sample library
| Percentile | Dataset size (all base qualities) (Gb) |
|---|---|
| 5th | 1.3 |
| 25th | 1.9 |
| 50th | 2.2 |
| 75th | 2.5 |
| 95th | 3.0 |
Summary percentiles characterizing the sizes of the datasets in gigabase pairs of sequence.
Figure 1:A, Overview of the total sequence percentage verified to be of contaminant origin (red), or inferred to be possible contaminants in other sequence libraries (grey) in all 1KP libraries, and libraries inferred to be contaminated through the 18S phylogenetic placement. B, 21 libraries in which >6% of the total sequences are potential contaminants. C, Heat map of inferred contaminant interactions between pairs of species; contaminated species are shown on the vertical axis and contaminating species on the horizontal axis.
Assembly quality assessment by Transrate
| Percentile | Good scaffolds (all sizes) | Good scaffolds (%) |
|---|---|---|
| 5th | 19,355 | 32.47 |
| 25th | 30,755 | 44.83 |
| 50th | 37,983 | 53.65 |
| 75th | 47,608 | 62.93 |
| 95th | 71,368 | 74.87 |
Characteristic percentiles summarizing the per sample distributions of high-quality scaffolds for both total counts and fractions of the sample.
Figure 2:Fraction of the gene sets found (complete + fragments) vs the number of scaffolds (≥300 bp) in the assemblies. For each sample, the fractions of the eukaryota and embryophyta sets found in the assemblies are calculated with BUSCO and the fraction of the CEGMA 248 set with the CRBB tool. All 3 sets are more completely recovered at higher scaffold counts, but the BUSCO embryophyta set is less complete in our samples.
Completeness of gene sets: characteristic percentiles summarizing the distributions of the CEGMA 248 and BUSCO genome completeness scores
| Percentile | CEGMA 248 | BUSCO | |
|---|---|---|---|
| Embryophyta | Eukaryota | ||
| 5th | 79.03 | 11.2 (8.5) | 66.0 (37.3) |
| 25th | 89.92 | 44.1 (29.8) | 84.9 (64.4) |
| 50th | 92.34 | 62.5 (48.2) | 90.4 (75.9) |
| 75th | 93.55 | 75.2 (59.6) | 93.7 (84.1) |
| 95th | 94.76 | 82.6 (73.2) | 96.1 (91.0) |
*BUSCO numbers are the sum of the complete and fragment assembly counts reported, with numbers based on the complete sequence numbers alone given in parentheses.