| Literature DB >> 30340508 |
Ye Yu1, Jinpeng Liu1, Xinan Liu1, Yi Zhang1, Eamonn Magner1, Erik Lehnert2, Chen Qian3, Jinze Liu4.
Abstract
We present SeqOthello, an ultra-fast and memory-efficient indexing structure to support arbitrary sequence query against large collections of RNA-seq experiments. It takes SeqOthello only 5 min and 19.1 GB memory to conduct a global survey of 11,658 fusion events against 10,113 TCGA Pan-Cancer RNA-seq datasets. The query recovers 92.7% of tier-1 fusions curated by TCGA Fusion Gene Database and reveals 270 novel occurrences, all of which are present as tumor-specific. By providing a reference-free, alignment-free, and parameter-free sequence search system, SeqOthello will enable large-scale integrative studies using sequence-level data, an undertaking not previously practicable for many individual labs.Entities:
Keywords: Compression; Gene fusion; Othello; Pan-cancer; Query; RNA-seq; SeqOthello; TCGA
Mesh:
Year: 2018 PMID: 30340508 PMCID: PMC6194578 DOI: 10.1186/s13059-018-1535-9
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Overview of SeqOthello structure and query procedure. a An illustration of the SeqOthello indexing structure to support scalable k-mer searching in large-scale sequencing experiments. The bottom level of SeqOthello stores the occurrence maps of individual k-mers, encoded in three different formats and divided into disjoint buckets. The mapping between a k-mer and its occurrence map is achieved by a hierarchy of Othello structures in which the root Othello maps a k-mer to its bucket and the Othello in each bucket maps a k-mer to its occurrence map. b An example illustrating SeqOthello’s sequence query process and output. A sequence query is decomposed into its constituent k-mers. The query result can be either a k-mer hit map, recording each k-mer’s presence/absence along the query sequence, or k-mer hit ratios (i.e., the fraction of query k-mers present in each experiment)
Fig. 2Comparing query performance for SeqOthello and three SBT-based algorithms: SBT [10], SSBT [11], and SBT-AS [12]. Performance is benchmarked on 2652 human RNA-seq experiments. The query consists of 198,093 human transcripts in Gencode Release 25. a Query response time. b Peak memory
Fig. 3The distribution of error rate in k-mer hit ratios returned by SeqOthello. A randomly selected set of 150 experiments are extracted from SeqOthello’s result by querying all human transcripts on 2652 human experiments. The error (δ) of a transcript query over an experiment is calculated as the difference between the transcript’s k-mer hit ratio returned by SeqOthello and the k-mer hit ratio obtained by mapping raw k-mers using the same RNA-seq experiments to the transcript sequences. Each bar shows the percentage of transcripts with δ falling in a specific range. The error bar shows the standard deviation of such percentage measured on 150 experiments
Fig. 4An illustration of fusion calling criteria using SeqOthello’s query results against TCGA RNA-seq data. a, b Examples of k-mer hit distribution as a result of fusion junction sequence query using SeqOthello. The presence of a small set of k-mers in large fraction of samples indicates background noise as a result of these k-mers being repetitive. For each fusion, we use δ98th, the k-mer hit at 98th percentile as an estimation of background noise. a Histogram of k-mer hits querying junction sequence spanning chr21:42880008-chr21:39956869 connecting gene pair TMPRSS2-ERG. The background noise is estimated at δ98th = 2. b Histogram of k-mer hits querying junction sequence spanning chr5:134688636-chr5:179991489 connecting gene pair H2AFY-CNOT6. The background noise is estimated at δ98th = 6. c The comparison of performance in recovering database-known fusion occurrences and detecting novel occurrences between noise-aware approach and SBT-like approach using θ-based containment query. Here μ is the minimum number of k-mer hits required beyond the fusion-specific noise level used in the noise-aware approach. The change in μ between two adjacent points is 1; θ is the minimum fraction of k-mer hits required to call the presence of a query as used in SBT containment query. The change in θ between two adjacent points is 0.05. d The distribution of the actual k-mer hits of all fusion occurrences called with the noise-aware approach
Fig. 5Top ten most recurring gene fusion events queried through SeqOthello indexing 10113 TCGA RNA-seq experiments across 29 tumor types.. Bar plots show occurrence number of top ten recurrent gene fusions detected by SeqOthello over different tumor types. Occurrences of each fusion on each tumor type are classified into novel occurrences (not reported in TCGA Gene Fusion Database) and annotated occurrences (already curated by TCGA Gene Fusion Database)
Hexadecimal encoding for integer values in the delta-list encoding
| Integer value | Encoded binary representation | Hexadecimal value | Encoded length in bits |
|---|---|---|---|
| 0 ≤ | (1xxx)2 | 0x8 ∣ | 4 |
| 8 ≤ | (01xxxxxx)2 | 0x40 ∣ | 8 |
| 64 ≤ | (001xxxxxxxxx)2 | 0x200 ∣ | 12 |
| 512 ≤ | (0001xxxxxxxxxxxx)2 | 0x1000 ∣ | 16 |
| 4096 ≤ | (0000xxxxxxxxxx…)2 | 0x0000 ∣ | 32 |
A summary of notations used in Section 4
| Root | Othello at the root of SeqOthello |
| Othello of the bucket | |
|
| Set of RNA-seq experiments |
|
| Set of buckets |
|
| |
|
SeqOthello
| Probability of an alien |
| SeqOthello | Probability of an alien query returning experiment |
|
root
| Probability that query of an alien |
|
| Probability that query of an alien |
Estimated probability values computed on SeqOthello constructed for human and TCGA datasets
| SRA | TCGA | |
|---|---|---|
| ∣ | 2652 | 10,113 |
| ∣ | 105 | 127 |
|
SeqOthello
| 0.532440 | 0.551722 |
| SeqOthello | 0.000840 | 0.000606 |
| standard deviation of SeqOthello | 0.000684 | 0.000173 |
SBT, SSBT, and SBT-AS version information
| Algorithm | URL | Version |
|---|---|---|
| SBT |
| f7986e4511189cb781b4e3517626b396fb11eefa |
| SSBT |
| 0fe43f4c0de7a0a452486a252a5e317862c2af45 |
| SBT-AS |
| 383d23f17d5537a0abf1436e5d04795ef91950b3 |