| Literature DB >> 31888474 |
Andrian Yang1,2, Abhinav Kishore1, Benjamin Phipps1, Joshua W K Ho3,4,5.
Abstract
BACKGROUND: Read alignment and transcript assembly are the core of RNA-seq analysis for transcript isoform discovery. Nonetheless, current tools are not designed to be scalable for analysis of full-length bulk or single cell RNA-seq (scRNA-seq) data. The previous version of our cloud-based tool Falco only focuses on RNA-seq read counting, but does not allow for more flexible steps such as alignment and read assembly.Entities:
Keywords: Alignment; Cloud computing; Falco; Single-cell RNA-seq; Transcript assembly
Mesh:
Year: 2019 PMID: 31888474 PMCID: PMC6936136 DOI: 10.1186/s12864-019-6341-6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Overview of the Falco framework pipelines. a Alignment-only pipeline. The pipeline is composed of the splitting and pre-processing steps from the original Falco framework and the new Spark-based alignment step from the Falco framework. The alignment step is composed of two stages - an alignment stage, where read chunks are aligned and stored in a temporary location in HDFS, and a concatenation stage, where alignment chunks from the same sample are concatenated to obtain the full alignment result. b Transcript-assembly pipeline. The pipeline is also composed of the splitting and pre-processing steps from the original Falco framework in addition to the new Spark-based transcript assembly step from the Falco framework. The transcript assembly step is composed of a number of stages, including an alignment stage, which performs alignment of read chunks and binning of the alignment result; an assembly stage which perform transcript assembly in parallel, and a merging step, where assembled transcripts are merged with the reference annotation to produce an updated annotation
Runtime comparison for alignment of single cell datasets with and without the Falco framework
| System | Nodes | Mouse - embryonic stem cell (hours) | Human - brain (hours) | ||
|---|---|---|---|---|---|
| STAR | HISAT2 | STAR | HISAT2 | ||
| Standalone | 1 (1 process) | 34.9 | 42.7 | 11.1 | 9.8 |
| 1 (12 processes) | 20.2 | 14.9 | 5.2 | 3.1 | |
| 1 (16 processes) | N/A | 14.9 | N/A | 3.0 | |
| Falco | 10 | 8.0 | 5.9 | 1.7 | 1.2 |
| 20 | 4.7 | 3.6 | 1.0 | 0.8 | |
| 30 | 3.8 | 2.9 | 0.8 | 0.6 | |
| 40 | 3.5 | 2.6 | 0.7 | 0.6 | |
Standalone number of processes indicates the number of FASTQ file pairs that are processed in parallel. Timing for Falco includes initialisation and configuration time which are approximately 10 min. Runtime for STAR with 16 processes is not available as some STAR processes are killed by the operating system, resulting in failure of the job
Runtime comparison for alignment of the human brain single cell dataset using Rail-RNA and Falco frameworks
| Nodes | Rail-RNA | Falco | |
|---|---|---|---|
| STAR | HISAT | ||
| 10 | 15.9 | 5.9 | 1.2 |
| 40 | 5.7 | 2.6 | 0.6 |
Accuracy of assembled transcripts for simulated data from single node runs
| Feature | STAR + StringTie (with reference) | STAR + Scallop | HISAT + StringTie | |||
|---|---|---|---|---|---|---|
| Sensitivity (%) | Precision (%) | Sensitivity (%) | Precision (%) | Sensitivity (%) | Precision (%) | |
| Base | 97.3 | 99.8 | 87.7 | 85.2 | 80.0 | 93.8 |
| Exon | 97.3 | 98.4 | 57.0 | 75.2 | 63.9 | 89.9 |
| Intron | 96.8 | 99.3 | 70.7 | 99.1 | 85.7 | 97.8 |
| Intron Chain | 93.6 | 84 | 31.9 | 56.7 | 25.7 | 38.9 |
| Transcript | 94.1 | 85.7 | 33.9 | 35.2 | 28.4 | 43.8 |
| Locus | 98.3 | 99.4 | 71.9 | 57.7 | 69.1 | 82.0 |
Accuracy of assembled transcripts for simulated data from Falco-based runs
| Feature | STAR + StringTie (with reference) | STAR + Scallop | HISAT + StringTie | |||
|---|---|---|---|---|---|---|
| Sensitivity (%) | Precision (%) | Sensitivity (%) | Precision (%) | Sensitivity (%) | Precision (%) | |
| Base | 96.2 | 99.9 | 88.6 | 85.3 | 81.1 | 93.8 |
| Exon | 96.4 | 97.9 | 59.6 | 74.7 | 65.8 | 86.0 |
| Intron | 95.5 | 99.3 | 74.5 | 99.2 | 87.6 | 97.8 |
| Intron Chain | 92.8 | 83.1 | 33.4 | 50.4 | 26.7 | 33.4 |
| Transcript | 93.3 | 84.9 | 35.2 | 33.4 | 29.3 | 38.0 |
| Locus | 98.1 | 99.3 | 72.5 | 59.7 | 69.7 | 82.4 |
Precision of assembled transcripts for human brain single cell dataset across different transcript assembly approaches
| Feature | STAR + StringTie (with reference) | STAR + Scallop | HISAT + StringTie | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Falco alignment + individual assembly (%) | Falco alignment + pooled assembly (%) | Falco transcript assembly mode (%) | Falco alignment + individual assembly (%) | Falco alignment + pooled assembly (%) | Falco transcript assembly mode (%) | Falco alignment + individual assembly (%) | Falco alignment + pooled assembly (%) | Falco transcript assembly mode (%) | |
| Base | 42.7 | 63.0 | 41.8 | 48.9 | 27.9 | 28.9 | 40.6 | 60.1 | 38.9 |
| Exon | 76.5 | 92.3 | 79.0 | 79.4 | 73.1 | 73.4 | 72.9 | 90.2 | 76.0 |
| Intron | 88.0 | 96.9 | 92.3 | 88.7 | 91.7 | 91.1 | 82.9 | 94.6 | 88.2 |
| Intron Chain | 79.1 | 94.5 | 85.9 | 80.1 | 84.4 | 83.3 | 72.1 | 90.7 | 79.6 |
| Transcript | 57.5 | 83.0 | 60.2 | 62.1 | 51.0 | 51.6 | 54.8 | 80.9 | 57.8 |
| Locus | 32.2 | 61.5 | 32.4 | 37.2 | 24.1 | 24.9 | 30.5 | 58.9 | 31.2 |
Precision of assembled transcripts for mouse embryonic stem cell single cell dataset across different transcript assembly approaches
| Feature | STAR + StringTie (with reference) | STAR + Scallop | HISAT + StringTie | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Falco alignment + individual assembly (%) | Falco alignment + pooled assembly (%) | Falco transcript assembly mode (%) | Falco alignment + individual assembly (%) | Falco alignment + pooled assembly (%) | Falco transcript assembly mode (%) | Falco alignment + individual assembly (%) | Falco alignment + pooled assembly (%) | Falco transcript assembly mode (%) | |
| Base | 52.8 | 90.3 | 55.0 | 56.2 | 34.5 | 38.4 | 51.1 | 88.0 | 51.0 |
| Exon | 69.8 | 95.9 | 78.3 | 74.6 | 71.7 | 73.1 | 69.1 | 95.3 | 76.2 |
| Intron | 77.0 | 96.9 | 88.0 | 80.9 | 87.1 | 85.6 | 76.0 | 96.4 | 86.3 |
| Intron Chain | 57.0 | 93.4 | 76.1 | 62.9 | 73.0 | 70.5 | 56.4 | 92.3 | 73.1 |
| Transcript | 48.1 | 90.9 | 58.8 | 54.8 | 47.8 | 50.8 | 48.5 | 90.2 | 56.2 |
| Locus | 40.4 | 86.1 | 41.7 | 46.3 | 30.1 | 33.9 | 40.1 | 85.8 | 39.5 |
Runtime comparison for transcript assembly of single cell dataset with and without the Falco framework
| System | Nodes | Human - brain (hours) | ||
|---|---|---|---|---|
| STAR + StringTie (with reference) | STAR + Scallop | HISAT + StringTie | ||
| Standalone | 1 (1 process) | 17.2 | 16.2 | 16.3 |
| 1 (12 processes) | 4.2 | 5.5 | 5.7 | |
| 1 (16 processes) | N/A | N/A | 4.1 | |
| Falco | 10 node | 2.9 | 2.8 | 2.3 |
| 20 node | 1.7 | 1.7 | 1.4 | |
| 30 node | 1.3 | 1.3 | 1.1 | |
| 40 node | 1.1 | 1.1 | 0.9 | |
Standalone number of processes indicates the number of FASTQ file pairs that are processed in parallel. Timing for Falco includes initialisation and configuration time which are approximately 10 min. Runtime for STAR with 16 processes is not available as some STAR processes are killed by the operating system, resulting in failure of the job