| Literature DB >> 25691827 |
Urszula Oleksiewicz1, Katarzyna Tomczak2, Jakub Woropaj3, Monika Markowska4, Piotr Stępniak4, Parantu K Shah5.
Abstract
Our current understanding of cancer genetics is grounded on the principle that cancer arises from a clone that has accumulated the requisite somatically acquired genetic aberrations, leading to the malignant transformation. It also results in aberrent of gene and protein expression. Next generation sequencing (NGS) or deep sequencing platforms are being used to create large catalogues of changes in copy numbers, mutations, structural variations, gene fusions, gene expression, and other types of information for cancer patients. However, inferring different types of biological changes from raw reads generated using the sequencing experiments is algorithmically and computationally challenging. In this article, we outline common steps for the quality control and processing of NGS data. We highlight the importance of accurate and application-specific alignment of these reads and the methodological steps and challenges in obtaining different types of information. We comment on the importance of integrating these data and building infrastructure to analyse it. We also provide exhaustive lists of available software to obtain information and point the readers to articles comparing software for deeper insight in specialised areas. We hope that the article will guide readers in choosing the right tools for analysing oncogenomic datasets.Entities:
Keywords: next generation sequencing
Year: 2015 PMID: 25691827 PMCID: PMC4322529 DOI: 10.5114/wo.2014.47137
Source DB: PubMed Journal: Contemp Oncol (Pozn) ISSN: 1428-2526
Software for primary quality control of NGS data
| Method name | Year published | PMID | Data type | Platform | Statistical method | Input requirements |
|---|---|---|---|---|---|---|
| BIGpre | 2011 | 22289480 | Illumina, 454 | Perl | Correlation between forward and reverse reads, trimming low-quality reads | FASTQ |
| FastQC | 2010 | – | any | Java | Sequence length, quality, k-mers presence reports | FASTQ, SAM/BAM |
| HTQC | 2013 | 23363224 | Illumina | C++ | Tail trimming, filter by quality/length/tile | FASTQ |
| QC-Chain | 2013 | 23565205 | any | C++ | Quality assessment, trimming, filtering unknown contamination | FASTQ |
| Qualimap | 2012 | 22914218 | any | Java, R | Alignment biases detection, sample comparison | SAM/BAM |
| PRINSEQ | 2011 | 21278185 | any | Perl | Sequence complexity, duplicates, occurrence of Ns and poly-A/T tails, tag sequences reports | FASTQ, FASTA |
| PIQA | 2009 | 19602525 | Illumina | R | Assess the clusters density per tile, base-calls proportions per tile/cycle | FASTQ |
| FastUniq | 2012 | 23284954 | any | C++ | De novo PCR duplicates removal for paired short reads | FASTQ |
Software for mapping sequence reads to genome
| Method name | Year published | PMID | Data type | Platform | Statistical method | Input requirements |
|---|---|---|---|---|---|---|
| BFAST | 2009 | 19907642 | RNA | C | Based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. | FASTQ |
| Bowtie | 2009 | 19261174 | RNA | C++ (SeqAn library) | Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches | FASTQ, FASTA |
| BWA | 2009 | 19451168 | RNA | C | Backward search with Burrows-Wheeler Transform (BWT), allowing mismatches and gaps. | FASTQ |
| BWA-PSSM | 2014 | 24717095 | RNA | C | Probabilistic adaptable alignment based on the use of position specific scoring matrices (PSSM) and BWT | FASTQ |
| CUSHAW2 | 22576173 | RNA | C++ | Uses Burrows-Wheeler transform (BWT), the Ferragina-Manzini index and CUDA parallel programming model for GPUs. Supports only ungapped alignment | FASTQ | |
| DistMap | 2013 | 24009693 | RNA | Perl, Java | Wrapper for many aligners, based on MapReduc API for parallel processing. Currently not handling spliced alignments | FASTQ |
| MAQ | 2008 | 18714091 | RNA | C++, Perl | Based on Smith-Waterman gapped alignment and Bayesian statistical model that incorporates the mapping qualities and error probabilities | FASTQ |
| MOSAIK | 2014 | 24599324 | RNA | C++ | Uses hash clustering strategy coupled with the Smith-Waterman algorithm. Detects mismatches, short insertions and deletions | FASTQ |
| PASS | 2009 | 19218350 | RNA | C++ | Based on precomputed score tables (PST) calculated with the Needleman and Wunsch algorithm | FASTQ |
| RMAP | 2009 | 19736251 | RNA | C++ | Uses multiple filtration (Pevzner and Waterman) and approximate pattern matching. | FASTQ, FASTA |
| SOAPaligner/SOAP2 | 2009 | 19497933 | RNA | C | Based on Burrows Wheeler Transformation (BWT) compression index | FASTQ |
| Stampy | 2011 | 20980556 | RNA | Python | Hybrid probabilistic model for mapping quality (measured by Phred score) | FASTQ |
| ZOOM | 2008 | 18684737 | RNA | Custom filtering model | FASTQ |
Software for RNA-Seq data analysis
| Method name | Year published | PMID | Data type | Platform | Statistical method | Input requirements |
|---|---|---|---|---|---|---|
| Aldex | 2013 | 23843979 | RNA-seq | R | ANOVA, Dirichlet distribution | Raw counts |
| ALDEx2 | 2014 | 24910773 | RNA-seq | R | Dirichlet distribution, glm | Raw counts |
| ASC | 2010 | 21080965 | RNA-seq | R | Empirical Bayes, | Raw counts |
| baySeq | 2010 | 20698981 | RNA-seq | R | Empirical Bayes | Raw counts |
| BBSeq | 2011 | 21810900 | RNA-seq | R | Beta-Binomial, linear model | Raw counts |
| CEDER | 2012 | 22641709 | RNA-seq | R | Negative binomial | Raw counts |
| campcodeR (uses edgeR & DESeq) | 2014 | 24813215 | RNA-seq | R | Empirical Bayes | Raw counts |
| COV2HTML (not working) | 2014 | 24512253 | RNA-seq | Web site | ||
| CPTRA | 2009 | 19811681 | RNA-seq | Python | ??? | Long-read sequence w/ annotation, short-read sequence tag // fasta, fastq |
| CQN | 2012 | 22285995 | RNA-seq | R | Conditional quantile normalisation, robust generalised regression | Raw counts |
| Cuffdiff | 2013 | 23222703 | RNA-seq | Standalone | Beta negative binomial distribution | |
| DegPack | 2013 | 24981075 | RNA-seq | Web site | Non-parametric (ranks) | Raw counts |
| DEGseq | 2010 | 19855105 | RNA-seq | R | MA-plot-based (?) | Raw counts |
| DER Finder | 2014 | 24398039 | RNA-seq | R | Hidden Markov Model | Raw counts |
| DESeq | 2010 | 20979621 | RNA-seq | R | Negative binomial distribution | Raw counts |
| DEXUS | 2013 | 24049071 | RNA-seq | R | Expectation-maximisation algorithm, Bayes | Raw counts |
| EBSeq | 2013 | 23428641 | RNA-seq | R | Empirical Bayes | Raw counts |
| EDASeq (users edgeR & DESeq) | 2011 | 22177264 | RNA-seq | R | Empirical Bayes | Raw counts |
| edgeR | 2012 | 22287627 | RNA-seq | R | Empirical Bayes, glm | Raw counts |
| edgeR-robust | 2014 | 24753412 | RNA-seq | R | Weights, empirical Bayes | Raw counts |
| GExposer | Machine learning algorithm | |||||
| iFad | 2012 | 22581178 | RNA-seq | R | Bayesian sparse factor model | Raw counts |
| MRFSEQ (uses DESeq) | 2013 | 23793751 | RNA-seq | Standalone | Markov random field model | Raw counts, co-expression database |
| Myrna | 2010 | 20701754 | RNA-seq | Cloud-computing, Bowtie, R | ||
| NOISeq | 2011 | 21903743 | RNA-seq | R | Non-parametric | Raw counts |
| NPEBseq | 2013 | 23981227 | RNA-seq | R | Non-parametric Bayesian | Raw counts |
| pairedBayes | RNA-seq | R | Empirical Bayes | Raw counts | ||
| PoissonSeq | 2012 | 22003245 | RNA-seq | R | Poisson goodness-of-fit | Raw counts |
| QuasiSeq | 2012 | 23104842 | RNA-seq | R | Quasi-Poisson, quasi-negative binomial | Raw counts |
| RNASeqGUI (uses edgeR, DESeq, NoiSeq, BaySeq) | 2014 | 24812338 | RNA-seq | R | Empirical Bayes, negative binomial | Raw counts |
| SAMSeq | 2013 | 22127579 | RNA-seq | Standalone | Non-parametric | Raw counts |
| ShrinkBaye | 2014 | 24766777 | RNA-seq | R | Negative binomial, Poisson-Gaussian, Bayesian GLM | Raw counts |
| sSeq | 2013 | 23589650 | RNA-seq | R | Negative Binomial | Raw Counts |
| TCC (uses edgeR, DESeq, DESeq2) | 2013 | 23837715 | RNA-seq | R | Negative Binomial, Empirical Bayes | Raw counts |
| tRanslatome | 2013 | 24222209 | RNA-seq | R | Rank Product, t-test, SAM, limma, ANOTA, DESeq, edgeR | Raw counts |
| TSPM.R | ||||||
| tweeDEseq | 2013 | 23965047 | RNA-seq | R | Poisson-Tweedie distributions | Raw counts |
Alternative splicing algorithm
| Method name | Year published | PMID | Data type | Platform | Statistical method | Input requirements |
|---|---|---|---|---|---|---|
| ABMapper | 2011 | 21169377 | RNA | C++, Perl | Fast suffix-array algorithm and a dual-seed strategy for spliced alignment | FASTA, FASTQ |
| ERANGE | 2008 | 18516045 | RNA | Python | Splice junctions identification relies on reference genome exon positions | FASTQ |
| GEM Mapper | 2012 | 23103880 | RNA | C, Objective Caml | Based on Burrows-Wheeler Transform and custom mapping algorithms. Uses custom mappability concept | FASTQ |
| MapSplice | 2010 | 20802226 | RNA | C++, Python | Algorithm not dependent on splice site features or intron length; consequently, it can detect novel canonical as well as non-canonical splices. This method has tag alignment phase and splice inference phase | FASTQ |
| PALMapper | 2010 | 21154708 | RNA | Web/ Galaxy | Combines GenomeMapper (based on BWT and k-mer indexes) read mapper with the spliced aligner QPALMA | FASTQ |
| QPALMA | 2008 | 18689821 | RNA | C++, Python | SVM-based splice site predictor with the so-called ‘weighted degree’ kernel. Alignment based on extended BWT | FASTQ |
| SpliceMap | 2010 | 20371516 | RNA | C++, Python | Cannoical GT-AG splice sites identification using half-read mapping | FASTQ |
| SplitSeek | 2010 | 20236510 | RNA | Candidate junction reads generation in intermediate BEDPE format feasible for paired-end sequences | FASTQ | |
| Subread | 2013 | 23558742 | RNA | R | Seed-and-vote - new multi-seed alignment strategy for overlapping seeds from each read (subreads) | FASTQ |
| TopHat | 2009 | 19289445 | RNA | C++ | Canonical GT–AG splice sites identification | FASTQ |
Methods for finding mutations
| Method name | Year published | PMID | Data type | Platform | Statistical method | Input requirements |
|---|---|---|---|---|---|---|
| ActiveDriver | 2013 | 23340843 | DNA-seq | R | Generalized linear regression | FASTA, TAB |
| CancerMutationAnalysis | 2014 | 24233780 | DNA-seq | R | Empirical Bayes | Non-standard tables |
| CanDrA | 2013 | 24205039 | DNA-seq | Perl | U-Mann Whitney, AUC (area under curve) | Non-standard tables |
| CanPredict | 2007 | 17537827 | DNA-seq | Web site | SIFT, Pfam-based logR.E-value metric, GOSS | FASTA |
| CAROL | 2012 | 22261837 | DNA-seq | R | SIFT, PolyPhen-2 | Tab-delimited, FASTA |
| CHASM/SNV-Box | 2009 | 19654296 | DNA-seq | Standalone | CHASM, SNV-Box | dnSNP r#, Pubmed ID, VCF, bed, |
| CRAVAT | 2013 | 23325621 | DNA-seq | Web site | CHASM, SnvGet | Non-standard tables |
| DDIG-in | 2013 | 23497682 | DNA-seq | Web site | Support vector machine-based method | Non-standard tables |
| DMI | 2012 | 23044540 | DNA-seq | Standalone | Machine learning, discrimination index | Text file |
| DrGaP | 2013 | 23954162 | DNA-seq | Standalone | Chi-square distribution | Non-standard tables |
| e-Driver | 2014 | 25064568 | DNA-seq | Perl | Binomial distribution | Non-standard tables |
| eXtasy | 2013 | 24076761 | DNA-seq | Web site | Variant impact prediction, haploinsufficiency prediction, phenotype-specific gene prioritisation | VCF |
| FATHMM | 2013 | 23620363 | DNA-seq | Web site | Hidden Markov Model | Annotated VCF |
| InVEx | 2012 | 22817889 | DNA-seq | Python | Permutation-based | Non-standard tables, power FASTA |
| MuSIC | 2012 | 22759861 | DNA-seq | Standalone | Fisher p-value, likelihood ratio test, convolution test (summarised log statistic of joint binomial point probability) | BAM, SNV, MAF |
| MutSig | 2013 | 23770567 | DNA-seq | Standalone | MutSigCV (Background mutation rate) | |
| nsSNPAnalyzer | 2005 | 15980516 | DNA-seq | Web site | Machine learning (random forest) | FASTA, SNP |
| Oncodrive-fm | 2012 | 22904074 | DNA-seq | Standalone | SIFT, PolyPhen2, MutationAssessor | TDM, TSV |
| OncodriveCLUST | 2013 | 23884480 | DNA-seq | Python | Clustering | Non-standard tables |
| PANTHER | 2013 | 23193289 | DNA-seq | Web site | subPSEC | FASTA |
| PhD-SNP | 2006 | 16895930 | DNA-seq | Web site | Sequence and Profile-Based | FASTA |
| PROVEAN | 2012 | 23056405 | DNA-seq | Standalone | Alignment-Based | Non-standard tables |
| transFIC | 2012 | 23181723 | DNA-seq | Web site | SIFT, PolyPhen2, MutationAssessor | File w/ chromosome/protein coordinates (hg19) |
Identifying gene fusions
| Method name | Year published | PMID | Data type | Platform | Statistical method | Input requirements |
|---|---|---|---|---|---|---|
| BreakFusion | 2012 | 22563071 | RNA-seq | C++, Perl | A computational pipeline for identifying gene fusions from RNA-seq data | BAM |
| BreakTrans | 2013 | 23972288 | RNA-seq | Perl | Uncovering the genomic architecture of gene fusions | Tab-delimited text files |
| chimerascan | 2011 | 21840877 | RNA-seq | Python | Identifying chimeric transcription in sequencing data | FASTQ |
| comrad | 2011 | 21478487 | RNA-seq, DNA-seq | C++ | Discovery of gene fusions using paired end RNA-Seq and WGSS. | FASTQ |
| deFuse | 2011 | 21625565 | RNA-seq | C++ | Detecting gene fusions from paired-end RNA-seq | FASTQ |
| FusionAnalyser | 2012 | 22570408 | RNA-seq | C# | Detecting gene fusions from paired-end RNA-Seq data | SAM/BAM |
| FusionHunter | 2011 | 21546395 | RNA-seq | Perl | Detecting gene fusions from paired-end RNA-Seq data | FASTQ |
| FusionMap | 2011 | 21593131 | RNA-seq, DNA-seq | C# | Detecting gene fusions from single- and paired-end RNA-Seq and DNA-seq data | FASTQ/BAM with unmapped reads |
| FusionSeq | 2010 | 20964841 | RNA-seq | C | A modular framework for finding gene fusions by analysing Paired-End RNA-Sequencing data | MRF, SAM |
| ShortFuse | 2011 | 21330288 | RNA-seq | C++, Python | Detecting gene fusions from paired-end RNA-Seq data | FASTQ |
| SnowShoes-FTD | 2011 | 21622959 | RNA-seq | Perl | Detecting gene fusions from paired-end RNA-Seq data | FASTQ |
| SOAPfusion | 2013 | 24123671 | RNA-seq | Perl | Detecting gene fusions from paired-end RNA-Seq data | FASTQ |
| TopHat-Fusion | 2011 | 21835007 | RNA-seq | C++ | Detecting gene fusions from single- and paired-end RNA-Seq data | FASTQ |
Available software for purity estimation
| Method name | Year published | PMID | Data type (technique) | Platform | Statistical method | Input requirements |
|---|---|---|---|---|---|---|
| Dsection | 2010 | 20631160 | RNA (Microarray) | Web-based and MATLAB | Bayesian model | Expression and proportion data required |
| csSAM | 2010 | 20208531 | RNA (Microarray) | - | Linear regression-based model | Expression profile of mixed tissue samples |
| mixture_estimation.R | 2010 | 20202973 | RNA (Microarray) | R based | Variation of electronic subtraction method | Expression profile of mixed tissue samples |
| ASCAT | 2010 | 20837533 | DNA (Microarray) | R based | Analytical optimisation method | SNP array data with Log R and B-Allele frequency information |
| PERT | 2012 | 23284283 | RNA (Microarray) | Octave | Perturbation model | Expression data from mixed cell type and expression profile of each homogeneous cell type |
| ABSOLUTE | 2012 | 22544022 | DNA (Microarray and HTS) | R based | Gaussian mixture model | Copy number data in segmentation file |
| JointSNVMixl, JointSNVMix2 | 2012 | 22285562 | DNA (HTS) | Python | Probabilistic graphical model | Sequence data from tumour/normal pairs |
| CNAnorm | 2012 | 22039209 | DNA (HTS) | R based | Analytical optimization method | Sequencing data of tumour and normal samples in bam format |
| DeconRNASeq | 2013 | 23428642 | RNA (Microarray and HTS) | R based | Globally optimised nonnegative decomposition algorithm | Expression data from multiple tissue, signature of individual tissue and proportion data required |
| TEMT | 2013 | 23735186 | RNA (HTS) | Python | Probabilistic model including position and sequence-specific biases | Required RNA-seq sequencing data from pure tissue and mixed tissue |
| ESTIMATE | 2013 | 24113773 | RNA (HTS) | R based | Gene signature (ssGSEA) based model | Expression data in Gene Set Enrichment Analysis (GSEA) gct format |
| THetA | 2013 | 23895164 | DNA (HTS) | Python | Explicit probabilistic model | Copy number data in interval count file format |
| ExPANdS | 2013 | 24177718 | DNA (HTS) | R based and MATLAB | Probability distributions model | Somatic mutations and copy number data required |
| Virmid | 2013 | 23987214 | DNA (HTS) | Java based | Probabilistic model and maximum likelihood estimator | Disease and normal sequencing data in bam format |
| MuTect | 2013 | 23396013 | DNA (HTS) | Java based | Bayesian model | Tumour and normal sequencing data |
| TrAp | 2013 | 23892400 | DNA (HTS) | Java based | Linear mixture model with evolutionary framework | Tumour karyotypes and somatic hypermutation datasets |
| Seo | 2013 | 23650637 | RNA (Microarray) | – | Linear mixture model | Disease-associated variants and expression of heterogeneous normal tissue |