| Literature DB >> 31236009 |
Nicholas Mills1, Ethan M Bensman2, William L Poehlman3, Walter B Ligon1, F Alex Feltus3.
Abstract
MOTIVATION: As the size of high-throughput DNA sequence datasets continues to grow, the cost of transferring and storing the datasets may prevent their processing in all but the largest data centers or commercial cloud providers. To lower this cost, it should be possible to process only a subset of the original data while still preserving the biological information of interest.Entities:
Keywords: FASTQ; RNA-Seq; data transfers; high-throughput DNA sequencing
Year: 2019 PMID: 31236009 PMCID: PMC6572328 DOI: 10.1177/1177932219856359
Source DB: PubMed Journal: Bioinform Biol Insights ISSN: 1177-9322
Figure 1.Detected transcripts by number of records for 4 datasets. Each point indicates the number of transcripts with FPKM > 0 measured at the given number of records. All runs were identically analyzed using the workflow of the Scientific workflow section. FASTQ files were sampled at 1% to 100% of the records of the original dataset. Dashed lines at the top of each plot indicate the theoretical maximum number of detected transcripts (217 857 for human and 49 558 for pig). Species for bladder, hypoxia, and nisc2 is Homo sapiens. Species for oncopig is Sus scrofa. FPKM indicates Fragments Per Kilobase of transcript per Million mapped reads.
Figure 2.Transfer times of full and partial FASTQ files from nisc2. FASTQ files were transferred between Clemson and Chicago over the public Internet using FDT. The time to transfer a complete dataset is shown with the bars labeled “complete transfer.” The time to transfer a partial dataset satisfying the criteria in the transfer of partial datasets section is shown with the bars labeled “partial transfer.” Reported times within a group are the average of 5 trials. Error bars are too small to be visible. The x-axis gives the last 2 digits of the dataset name, where each dataset name begins with the string SRR10748. The rightmost pair of bars plots the sum total of the times for all datasets within both groups.
Figure 3.Only low-level transcripts accumulate with more sequence records. (A) The number of genes that were detected at 6 FPKM expression thresholds are shown for the 6 nisc2 datasets at each percent transfer. (B) The amount of gene overlap at each transfer level is shown for a representative nisc2 dataset SRR1047863.
Estimation of the number of detected transcripts at 30 million records.
| Dataset | Predicted FPKM | Percent of whole |
|---|---|---|
| Bladder | 97 872-105 433 | 83%-84% |
| Hypoxia | 100 613-102 279 | 90%-93% |
|
| 101 490-110 594 | 71%-76% |
| Oncopig | 26 865-30 658 | 87%-92% |
Abbreviation: FPKM, Fragments Per Kilobase of transcript per Million mapped reads.
For each run of Figure 1, a linear model of the number of detected transcripts was created using the formula detected ~log(records) (R2 = 0.988-1.0). These models were then used to predict the number of detected transcripts at 30 million records for each run. As each dataset contains between 4 and 7 runs, this table lists the range of predicted transcripts for each dataset. The values for percent of whole were calculated by dividing the predicted number of transcripts at 30 million records by the number of detected transcripts measured in the full dataset as plotted in Figure 1.