Literature DB >> 25691827

Computational characterisation of cancer molecular profiles derived using next generation sequencing.

Urszula Oleksiewicz1, Katarzyna Tomczak2, Jakub Woropaj3, Monika Markowska4, Piotr Stępniak4, Parantu K Shah5.   

Abstract

Our current understanding of cancer genetics is grounded on the principle that cancer arises from a clone that has accumulated the requisite somatically acquired genetic aberrations, leading to the malignant transformation. It also results in aberrent of gene and protein expression. Next generation sequencing (NGS) or deep sequencing platforms are being used to create large catalogues of changes in copy numbers, mutations, structural variations, gene fusions, gene expression, and other types of information for cancer patients. However, inferring different types of biological changes from raw reads generated using the sequencing experiments is algorithmically and computationally challenging. In this article, we outline common steps for the quality control and processing of NGS data. We highlight the importance of accurate and application-specific alignment of these reads and the methodological steps and challenges in obtaining different types of information. We comment on the importance of integrating these data and building infrastructure to analyse it. We also provide exhaustive lists of available software to obtain information and point the readers to articles comparing software for deeper insight in specialised areas. We hope that the article will guide readers in choosing the right tools for analysing oncogenomic datasets.

Entities:  

Keywords:  next generation sequencing

Year:  2015        PMID: 25691827      PMCID: PMC4322529          DOI: 10.5114/wo.2014.47137

Source DB:  PubMed          Journal:  Contemp Oncol (Pozn)        ISSN: 1428-2526


Molecular profiling of cancer genomes

Over the years, individual laboratories and large-scale projects such as TCGA and ICGC have discovered that cancer is a heterogeneous disease with lots of variability within a single tumour type or even within a single tumour [1-4]. Nonetheless, much of our current understanding of cancer genetics is grounded on the principle that cancer arises from a clone that has accumulated the requisite somatically acquired genetic aberrations leading to the malignant transformation [5]. Characterising individual tumours or cohorts at the molecular level has helped in identifying common and type specific cancer vulnerabilities as well as recording the individual history of tumours [4, 6–8]. This has enabled the creation of drugs that target these molecular vulnerabilities and provide tailored treatments for patients, improving therapy efficacy and minimising its side effects [9, 10]. For example, Imatinib specifically targets the BCR-Abl fusion tyrosine kinase that exists only in the cells of chronic myelogenous leukaemia and other tumours but not in healthy cells [11]. Similarly, Herceptin is a monoclonal antibody that is used to target HER2 positive breast tumours [12]. At present, the TCGA and other large-scale projects characterise tumours with microarray and next generation sequencing (NGS) platforms to obtain a different type of genetic information at the whole genome level [6–8, 13–15]. The microarray platform had been, and is currently being, used to identify gene and microRNA expression, alternative splicing, copy number alterations, DNA methylation, and identification of protein-DNA and protein-RNA interactions [16]. Next generation sequencing platforms are now replacing the microarray platforms for obtaining these data. Moreover, sequence reads from whole exome sequencing, along with DNA and RNA sequencing, also allow detection of mutations and gene fusions for coding and non-coding regions of the genome [17]. While conceptually similar in experiment design, the sequence read information generated using NGS platforms has very different statistical properties to intensity-based information acquired from microarray platforms [18]. Multiple articles have reviewed protocols to generate microarray profiles and their statistical analysis to extract meaningful information [19-22]. While relatively new, a vast amount literature describing statistical methodologies to analyse NGS data already exists [23-28]. So, in this article we focus only on the methodologies to extract meaningful information from NGS reads. Moreover, we will discuss only those data types that are generated and analysed by TCGA. Furthermore, for each data type, we only describe the main steps to obtain this information and point the readers to an exhaustive list of methodologies/software and articles for deeper insight. Also, we do not provide any comments on comparison of these methods but rather point to articles comparing different methodologies and identifying their strengths and errors [29-32].

Pre-processing and Quality Control of NGS data

To date, several next-generation sequencing platforms are available, including the Illumina Genome Analyser, which is being used extensively TCGA by consortium for tumour profiling. Each platform has its own method for generating sequencing reads from samples. But in every case, the sequence reads obtained using these platforms are short – typically from 36 to several hundred nucleotides. Furthermore the sequencing run can be single-end or paired-end, meaning the reads are sequenced in one or two directions (from 3’ and 5’ ends). The first tasks in any NGS computational pipeline are: performing primary data acquisition, determining base calls and confidence scores from the fluorescent signals of the sequencer, and converting them to FASTQ files containing the raw sequence reads and per base quality scores. When multiple samples are pooled in one lane using sample-specific index/barcode adapters, the FASTQ should be demultiplexed and reorganised based on index information, and the adapters ought to be trimmed [33]. Quality control is a very important part of the data preparation (Table 1). There are several kinds of sequencing artefacts that could have a serious negative impact on downstream analyses. The artefacts commonly exist in raw reads, regardless of the sequencing platform. Firstly, sequences may be contaminated with adapters on their 5′- or 3′ ends that were added as part of the sequencing protocol. Secondly, base quality and sequence complexity vary both within and between reads. The qualities of bases on most sequencing platforms will degrade as the run progresses, so it is common to see the quality of base calls falling towards the end of the read. It is desirable to remove or trim such sequences with appropriate thresholds. Additionally, NGS reads can be highly redundant with the same sequence being represented in large numbers, so it is important to reduce these PCR amplification artefacts. The contamination in the sequencing dataset can also be caused by laboratory factors such as sample preparation, library construction, and other steps of the experiment. Moreover, samples may contain DNA/RNA from other sources including viruses, which are hard to avoid during the sample preparation process. Finally, general statistical methods like sample clustering and principal component analysis (PCA), and outlier detection can be used for assessment of overall quality and sample comparison according to experiment design.
Table 1

Software for primary quality control of NGS data

Method nameYear publishedPMIDData typePlatformStatistical methodInput requirements
BIGpre201122289480Illumina, 454PerlCorrelation between forward and reverse reads, trimming low-quality readsFASTQ
FastQC2010anyJavaSequence length, quality, k-mers presence reportsFASTQ, SAM/BAM
HTQC201323363224IlluminaC++Tail trimming, filter by quality/length/tileFASTQ
QC-Chain201323565205anyC++Quality assessment, trimming, filtering unknown contaminationFASTQ
Qualimap201222914218anyJava, RAlignment biases detection, sample comparisonSAM/BAM
PRINSEQ201121278185anyPerlSequence complexity, duplicates, occurrence of Ns and poly-A/T tails, tag sequences reportsFASTQ, FASTA
PIQA200919602525IlluminaRAssess the clusters density per tile, base-calls proportions per tile/cycleFASTQ
FastUniq201223284954anyC++De novo PCR duplicates removal for paired short readsFASTQ
Software for primary quality control of NGS data

Aligning short reads to the reference genome

Accurate alignment of short sequence reads generated using NGS platforms to a repeat masked reference genome is the first step in obtaining biological information from NGS data. Since, the numbers of reads generated in any given NGS experiment are very large (typically in millions), many efficient algorithms have been developed to deal with the alignment process. It is important to note that different read mapping procedures are necessary depending on the needs of downstream analysis, and alignment accuracy has a high impact on the interpretation of the data. We comment on that in sections to follow. Most applications aim to identify uniquely mapped reads - matching to a single “best” genomic position. The non-uniquely mapped reads are filtered using an upper boundary for the number of reported mappings. Most short read alignment algorithms use auxiliary data structures (also called indexes) for the reads or reference sequence. The main indexing methods are based on hash tables, prefix/suffix trees, or merge sorting methods (Table 2). Such representation of the entire human genome takes only a few GB of memory and enables exact matches to be found in a short time. Burrow-Wheeler transform and FM-index-based algorithms give better results for reads from repeated regions, but there is no efficient general method for handling errors in the reads for this category. Some hybrid solutions have been proposed, e.g. Stampy (see Table 2). These enhancements result in higher sensitivity and smaller memory requirements of mapping tools. Reported mapping positions are particularly useful as they prevent the result list being blown up by reads mapping to highly repetitive regions. It is important to note that in the case of paired-end sequencing the paired reads need to be mapped to identical genomic positions to be considered multi-mapping reads. Data from Illumina's machine has few substitution errors per read and virtually no insertion or deletion (INDELs) errors [34]. Thus, it can be mapped efficiently by, for example, Bowtie [35], and then its junction-mapping extension can be done by Tophat [36], which can handle up to three mismatches per sequence and no INDELs.
Table 2

Software for mapping sequence reads to genome

Method nameYear publishedPMIDData typePlatformStatistical methodInput requirements
BFAST200919907642RNACBased on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants.Final local alignment uses a Smith-Waterman method, with gaps to support the detection of small INDELsFASTQ
Bowtie200919261174RNAC++ (SeqAn library)Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatchesFASTQ, FASTA
BWA200919451168RNACBackward search with Burrows-Wheeler Transform (BWT), allowing mismatches and gaps.FASTQ
BWA-PSSM201424717095RNACProbabilistic adaptable alignment based on the use of position specific scoring matrices (PSSM) and BWTFASTQ
CUSHAW222576173RNAC++Uses Burrows-Wheeler transform (BWT), the Ferragina-Manzini index and CUDA parallel programming model for GPUs. Supports only ungapped alignmentFASTQ
DistMap201324009693RNAPerl, JavaWrapper for many aligners, based on MapReduc API for parallel processing. Currently not handling spliced alignmentsFASTQ
MAQ200818714091RNAC++, PerlBased on Smith-Waterman gapped alignment and Bayesian statistical model that incorporates the mapping qualities and error probabilitiesFASTQ
MOSAIK201424599324RNAC++Uses hash clustering strategy coupled with the Smith-Waterman algorithm. Detects mismatches, short insertions and deletionsFASTQ
PASS200919218350RNAC++Based on precomputed score tables (PST) calculated with the Needleman and Wunsch algorithmFASTQ
RMAP200919736251RNAC++Uses multiple filtration (Pevzner and Waterman) and approximate pattern matching.Incorporates the use of quality scores directly into the mapping processFASTQ, FASTA
SOAPaligner/SOAP2200919497933RNACBased on Burrows Wheeler Transformation (BWT) compression indexFASTQ
Stampy201120980556RNAPythonHybrid probabilistic model for mapping quality (measured by Phred score)FASTQ
ZOOM200818684737RNACustom filtering modelFASTQ
Software for mapping sequence reads to genome Transcriptome sequencing produces reads from transcribed sequences with introns and intergenic regions excluded. Standard alignment algorithms, which handle mismatches and gaps, generally do not handle mapping reads spanning across exons. Tools for identifying novel splice junctions usually use standard algorithms in the first step and then derive exon positions, e.g. from clustering of mapped reads or reads mapped into introns at their last few bases. Even in de novo assembly, in some parallel algorithms, if the location of each individual read is not tracked the reads may still need to be aligned back to the assembly. Therefore, sequence mapping is essential to almost all NGS techniques. Quantitation of microRNA expression requires similar steps but reads are mapped to the mature and precursor sequences of known miRNAs collected in microRNA databases. Prediction of secondary structure and genomic cluster analysis is useful [37].

Expression quantitation and identification of differential expression

The expression level of each mRNA is measured by the number of sequenced fragments that map to the transcript (or counts and its derivatives), which is expected to correlate directly with its abundance level. Counts usually refer to the number of reads that align to a particular genomic feature. Like gene counts, any other targets may be quantified, including exons, transcripts, and miRNAs. Counts are heavily dependent on RNA sequencing depth and the effective length of the feature. Therefore, counts need to be adjusted for feature length to make the expression comparable. Effective gene counts are adjusted for the amount of bias in the experiment. Counts per million (CPM) mapped reads are counts scaled by the number of sequenced fragments multiplied by one million. CPM's length-normalised analogues are reads per kilobase per million (RPKM) and fragment per kilobase per million (FPKM). RPKM and FPKM are identical for single-end sequencing but differ for the paired-end sequencing. Calculating length-normalised measures makes them comparable within a sample. The RSEM package computes maximum likelihood abundance estimates using the Expectation-Maximisation algorithm and effectively takes care of multi-mapping reads. The RSEM representation is a current standard for reporting expression by Firehose GDAC pipeline. A deficiency of the RPKM/FPKM approach is that the proportional representation of each gene is dependent on the expression levels of all other genes. Often a small fraction of genes account for large proportions of the sequenced reads, and small expression changes in these highly expressed genes will skew the counts of lowly expressed genes under this scheme. This can result in deduction of erroneous differential expression. Therefore, methods for calculating differential expression require counts to begin with. Thus RNA-Seq non-negative counts follow discrete distribution as opposed to the intensities recorded from microarrays, which are treated as continuous measurements and commonly assumed to follow a log-normal distribution. For RNA-Seq data Poisson distribution and Negative Binomial (NB) distribution are the two most commonly used models [38-42] (Table 3). Other distributions, such as beta-binomial [43], have also been proposed.
Table 3

Software for RNA-Seq data analysis

Method nameYear publishedPMIDData typePlatformStatistical methodInput requirements
Aldex201323843979RNA-seqRANOVA, Dirichlet distributionRaw counts
ALDEx2201424910773RNA-seqRDirichlet distribution, glmRaw counts
ASC201021080965RNA-seqREmpirical Bayes,Raw counts
baySeq201020698981RNA-seqREmpirical BayesRaw counts
BBSeq201121810900RNA-seqRBeta-Binomial, linear modelRaw counts
CEDER201222641709RNA-seqRNegative binomialRaw counts
campcodeR (uses edgeR & DESeq)201424813215RNA-seqREmpirical BayesRaw counts
COV2HTML (not working)201424512253RNA-seqWeb site
CPTRA200919811681RNA-seqPython???Long-read sequence w/ annotation, short-read sequence tag // fasta, fastq
CQN201222285995RNA-seqRConditional quantile normalisation, robust generalised regressionRaw counts
Cuffdiff201323222703RNA-seqStandaloneBeta negative binomial distribution
DegPack201324981075RNA-seqWeb siteNon-parametric (ranks)Raw counts
DEGseq201019855105RNA-seqRMA-plot-based (?)Raw counts
DER Finder201424398039RNA-seqRHidden Markov ModelRaw counts
DESeq201020979621RNA-seqRNegative binomial distributionRaw counts
DEXUS201324049071RNA-seqRExpectation-maximisation algorithm, BayesRaw counts
EBSeq201323428641RNA-seqREmpirical BayesRaw counts
EDASeq (users edgeR & DESeq)201122177264RNA-seqREmpirical BayesRaw counts
edgeR201222287627RNA-seqREmpirical Bayes, glmRaw counts
edgeR-robust201424753412RNA-seqRWeights, empirical BayesRaw counts
GExposerMachine learning algorithm
iFad201222581178RNA-seqRBayesian sparse factor modelRaw counts
MRFSEQ (uses DESeq)201323793751RNA-seqStandaloneMarkov random field modelRaw counts, co-expression database
Myrna201020701754RNA-seqCloud-computing, Bowtie, R
NOISeq201121903743RNA-seqRNon-parametricRaw counts
NPEBseq201323981227RNA-seqRNon-parametric BayesianRaw counts
pairedBayesRNA-seqREmpirical BayesRaw counts
PoissonSeq201222003245RNA-seqRPoisson goodness-of-fitRaw counts
QuasiSeq201223104842RNA-seqRQuasi-Poisson, quasi-negative binomialRaw counts
RNASeqGUI (uses edgeR, DESeq, NoiSeq, BaySeq)201424812338RNA-seqREmpirical Bayes, negative binomialRaw counts
SAMSeq201322127579RNA-seqStandaloneNon-parametricRaw counts
ShrinkBaye201424766777RNA-seqRNegative binomial, Poisson-Gaussian, Bayesian GLMRaw counts
sSeq201323589650RNA-seqRNegative BinomialRaw Counts
TCC (uses edgeR, DESeq, DESeq2)201323837715RNA-seqRNegative Binomial, Empirical BayesRaw counts
tRanslatome201324222209RNA-seqRRank Product, t-test, SAM, limma, ANOTA, DESeq, edgeRRaw counts
TSPM.R
tweeDEseq201323965047RNA-seqRPoisson-Tweedie distributionsRaw counts
Software for RNA-Seq data analysis The Poisson distribution has the advantage of simplicity and has only one parameter, but it constrains the variance of the modelled variable to be equal to the mean. The Negative Binomial distribution has two parameters, encoding the mean and the dispersion, and hence allows modelling of more general mean-variance relationships. For RNA-seq, it has been suggested that the Poisson distribution is well suited for analysis of technical replicates, whereas the higher variability between biological replicates necessitates a distribution incorporating overdispersion, such as Negative Binomial [28, 44, 45]. Analogous to microarray data analysis, it is clear that borrowing the variance from other genes help to better estimate the variation in read counts for a gene and condition. This overcomes a common problem with an underestimation of variance when based on a low number of observations. The most commonly used parametric methods include EdgeR, DESeq, and baySeq and use negative binomial distribution. Other methods such as Cuffdiff2 uses a beta-negative binomial distribution, which is a combination of beta and negative binomial distribution. Non-parametric methods like SAM-Seq also work relatively well on the count data.

Identification of alternative splicing from transcriptomic reads

A widely recognised source of proteome diversity in eukaryotic species is expression of multiple distinct mRNA transcripts from a single gene locus by alternative transcript initiation, alternative splicing [47] (Table 4), and alternative polyadenylation [48]. If RNA-Seq reads span exon junctions, parts of reads will map to two different exons. This allows inference of alternative splicing. However, such a read structure will pose problems to standard aligners that map reads contiguously to the reference. Splice sites can be detected initially by identifying reads that span exon junctions. Split-read aligners such as TopHat, methods that identify minimal match on either side of exon junction, and genomic short-read nucleotide alignment are used to identify alternative splicing. Most methods utilise a database of expression and alternative expression sequence ‘features’. These and other strategies that perform de-novo assemblies present a number of computational challenges because of computation time and the depth of sequencing, which results in few junction-spanning reads. This quantitation of alternatively spliced transcripts needs to be followed by identification of differential expression of these transcripts between samples.
Table 4

Alternative splicing algorithm

Method nameYear publishedPMIDData typePlatformStatistical methodInput requirements
ABMapper201121169377RNAC++, PerlFast suffix-array algorithm and a dual-seed strategy for spliced alignmentFASTA, FASTQ
ERANGE200818516045RNAPythonSplice junctions identification relies on reference genome exon positionsFASTQ
GEM Mapper201223103880RNAC, Objective CamlBased on Burrows-Wheeler Transform and custom mapping algorithms. Uses custom mappability conceptFASTQ
MapSplice201020802226RNAC++, PythonAlgorithm not dependent on splice site features or intron length; consequently, it can detect novel canonical as well as non-canonical splices. This method has tag alignment phase and splice inference phaseFASTQ
PALMapper201021154708RNAWeb/ GalaxyCombines GenomeMapper (based on BWT and k-mer indexes) read mapper with the spliced aligner QPALMAFASTQ
QPALMA200818689821RNAC++, PythonSVM-based splice site predictor with the so-called ‘weighted degree’ kernel. Alignment based on extended BWTFASTQ
SpliceMap201020371516RNAC++, PythonCannoical GT-AG splice sites identification using half-read mappingFASTQ
SplitSeek201020236510RNACandidate junction reads generation in intermediate BEDPE format feasible for paired-end sequencesFASTQ
Subread201323558742RNARSeed-and-vote - new multi-seed alignment strategy for overlapping seeds from each read (subreads)FASTQ
TopHat200919289445RNAC++Canonical GT–AG splice sites identificationFASTQ
Alternative splicing algorithm Apart from detection and quantitation of expression and alternative splicing, RNAseq has the capacity to identify RNA editing events [49-51], allele-specific expression (ASE) [52, 53], quantify noncoding RNAs [54, 55], and detect exogenous RNA [56, 57], single-nucleotide polymorphisms (SNPs), somatic mutations [8, 58], and structural variations.

Detection of somatic copy number alterations and structural variants

Genomic alterations accumulate in tumours during cancer development [59]. In addition to point-mutations, inversions, and translocations, somatic copy-number alterations (SCNAs) are ubiquitous in cancer [60, 61], and several recurrent SCNAs are associated with particular cancer types [62], tumour aggressiveness [63], and patient prognosis [64]. Reliable detection of SCNAs can lead to identification of cancer driver genes [65] and development of new therapeutic approaches [66-68]. Deep sequencing [69] and exome-based sequencing efforts are now replacing microarray-based array CGH (aCGH) [70] and single nucleotide polymorphism arrays (SNP arrays) [23, 71]. Moreever, similarly to microarray-based studies, the inference about cancer related copy number alterations requires comparison of paired samples – normal and tumour. The general workflow to detect SCNAs from sequencing data consists of three main steps: (i) raw copy-number inference in the local genome region by either calculating read counts or depth of coverage ratio between tumour and control samples, (ii) the raw copy-number profiles segmentation to find change-points in the raw copy-number signal and divide the chromosomes accordingly into segments with similar copy-numbers, and (iii) classification of different segments into gains or losses. The first step is essentially based on understanding variations in depth of coverage (DOC) of aligned sequence reads against the reference genome [69, 72, 73]. Deviation from the background in DOC may signify the presence of a copy number variation (CNV). The last two steps for obtaining copy number alteration are not specific to the sequencing data. Multiple methods exist for identification of structural variations using whole genome sequencing. Methods include identification of atypical alignment patterns of sequence reads against the reference genome, which reflect gaps in the sequence alignment [74]. In paired-end read mapping, the sequenced ends of a short DNA fragment are aligned against the reference genome. The mean insert size of the fragment is compared with the reference genome distance between aligned fragment ends to deduce the presence of deletions or insertions [74-76]. This detection requires high alignment accuracy and underscores its importance. Although coding regions comprise only ∼1% of the genome, they are enriched for causal variation, making exome-based studies valuable, manageable, and cost-effective. Whole exome sequencing (WES) data have been used effectively for the identification of small INDELs, usually of a size < 50 bp, within exon targets that are typically sized between 200 and 300 bp. The approaches discussed above, while appropriate for (deeply sequenced) DNA-sequencing data, are less effective for exome sequencing and detecting CNV, as the CNV's breakpoints are likely to lie outside the targeted exons [77]. Detecting structural and copy number variations from RNA-Seq data presents similar challenges.

Identification of cancer driver mutations and their functional impact

Cancer is abundantly composed of somatic mutations accumulating in the genome over an individual's lifetime, only a fraction of which drive cancer progression. Mutations can be identified from DNA-seq, RNA-seq, and Exome sequencing data [78-80] (Table 5). The most basic way of detecting somatic mutations from NGS reads is to identify mismatch/gaps in the alignment of the read with the reference. However, large datasets possess sequencing errors: random mutations that occur during cell division and single nucleotide polymorphisms that differ from reference assembly. This makes identification of cancer driver mutations a challenging issue [81]. Moreover, intra-tumour heterogeneity also hinders the identification of all types of somatic mutations [82]. Several methods for detecting somatic mutations are currently in use, such as MuTect [83], Strelka [84], and VarScan 2 [85] for SNV detection or BIC-Seq [86], APOLLOH [87], CoNIEFER [88], BreakDancer [89], and Meerkat [87] for CNA or SV detection. Most methods for somatic mutation detection take into account only part of the possible source of errors; therefore, running different methods simultaneously is advisable.
Table 5

Methods for finding mutations

Method nameYear publishedPMIDData typePlatformStatistical methodInput requirements
ActiveDriver201323340843DNA-seqRGeneralized linear regressionFASTA, TAB
CancerMutationAnalysis201424233780DNA-seqREmpirical BayesNon-standard tables
CanDrA201324205039DNA-seqPerlU-Mann Whitney, AUC (area under curve)Non-standard tables
CanPredict200717537827DNA-seqWeb siteSIFT, Pfam-based logR.E-value metric, GOSSFASTA
CAROL201222261837DNA-seqRSIFT, PolyPhen-2Tab-delimited, FASTA
CHASM/SNV-Box200919654296DNA-seqStandaloneCHASM, SNV-BoxdnSNP r#, Pubmed ID, VCF, bed,
CRAVAT201323325621DNA-seqWeb siteCHASM, SnvGetNon-standard tables
DDIG-in201323497682DNA-seqWeb siteSupport vector machine-based methodNon-standard tables
DMI201223044540DNA-seqStandaloneMachine learning, discrimination indexText file
DrGaP201323954162DNA-seqStandaloneChi-square distributionNon-standard tables
e-Driver201425064568DNA-seqPerlBinomial distributionNon-standard tables
eXtasy201324076761DNA-seqWeb siteVariant impact prediction, haploinsufficiency prediction, phenotype-specific gene prioritisationVCF
FATHMM201323620363DNA-seqWeb siteHidden Markov ModelAnnotated VCF
InVEx201222817889DNA-seqPythonPermutation-basedNon-standard tables, power FASTA
MuSIC201222759861DNA-seqStandaloneFisher p-value, likelihood ratio test, convolution test (summarised log statistic of joint binomial point probability)BAM, SNV, MAF
MutSig201323770567DNA-seqStandaloneMutSigCV (Background mutation rate)
nsSNPAnalyzer200515980516DNA-seqWeb siteMachine learning (random forest)FASTA, SNP
Oncodrive-fm201222904074DNA-seqStandaloneSIFT, PolyPhen2, MutationAssessorTDM, TSV
OncodriveCLUST201323884480DNA-seqPythonClusteringNon-standard tables
PANTHER201323193289DNA-seqWeb sitesubPSECFASTA
PhD-SNP200616895930DNA-seqWeb siteSequence and Profile-BasedFASTA
PROVEAN201223056405DNA-seqStandaloneAlignment-BasedNon-standard tables
transFIC201223181723DNA-seqWeb siteSIFT, PolyPhen2, MutationAssessorFile w/ chromosome/protein coordinates (hg19)
Methods for finding mutations The most basic task for mutation analysis in cancer is the distinction between driver and passenger mutations. To help filter a subset of driver mutations from the long list of detected somatic and passenger mutations, three major computational predictive approaches utilising different statistical tests can be applied [90-92]: (1) Identification of recurrent somatic mutations is based on the idea of clonal evolution of tumour cell populations. To predict genes with recurrent single-mutations in a cohort of cancer patients, several statistical methodologies including MutSigCV [3], MuSiC [93], and DrGaP [94] are available. These methods are based on the determination of the probability of the observed number of mutations in a gene to the expected background mutation rate, the BMR (probability of observed passenger mutation) across a cohort of patients. As opposed to mutations, there is no accurate model established to identify genes with recurrent copy number aberrations (CNAs); therefore, methods are based on a non-parametric approach, e.g. GISTIC2 [95], CMDS [96], and ADMIRE [97]. (2) Prediction of the functional impact of individual mutations is based on the utilisation of additional information about protein sequence and/or structure and evolutionary conservation of the protein encoded by the mutated gene. Methods like SIFT [98], Polyphen-2 [99], and MutationAssesor [100] predict the functional impact (deleteriousness) of missense mutations. CHASM utilises random forest classification to identify driver and passenger somatic missense mutations, based on a training set of labelled positive (driver) and negative (passenger) examples [101]. Furthermore, clusters of non-synonymous mutations across patients, typically to detect ‘activating’ mutations, NMC [102] and the Invex [103] method can be applied. Moreover, the iPAC method is able to search for clusters of mutations, but in the context of crystal structures of proteins [104]. (3) Identification of recurrent combinations of mutations is based on assessment of combinations of mutations enriched in known pathways (e.g. GSEA [105], Path-Scan [106], Patient-oriented gene sets [107]), interaction networks (NetBox [108], HotNet [109], MEMo [110]), or de novo defined sets (Dendrix [109], Muti-Dendrix [111] or RME [112]), enabling the discovery of novel combinations of mutated genes in cancer.

Identification of gene fusions

Gene fusions appear as a result of chromosomal rearrangements, such as deletion, insertion, inversion, or translocation. A fused gene is expressed as a hybrid entity encoding sequences of two distinct genes. Tumorigenic fusions successfully evade the gene regulation that its constituents are subjected to. Multiple cancer-related gene fusions have been identified, including prototypic BCR-ABL [113], EML4-ALK [114], TMPRSS2-ERG [115], KIF5B-RET [116], and others [7, 13, 14, 117]. Such alterations may serve as a good cancer biomarker or therapeutic target [120, 121]. Whole genome [75, 119] and transcriptome sequencing [120, 121] profiles can be used to identify fusions. Transcriptome sequencing is proven to be superior over WGS and therefore is more commonly used. This is due to the fact that RNA-seq covers only transcribed sequences, which constitute a small percentage of the genome, thus reducing the cost, time, and resources needed for full analysis. Furthermore, RNA-seq provides information on the transcriptionally active fusion genes and their splicing variants. However, it also harbours certain limitations, including lack of information regarding non-transcribed regions and dependence on the heterogeneity in the expression levels between various cell types [122]. Over the last few years several software packages (Table 6) have been developed for the detection of gene fusions and/or structural variants (SV) that cause gene fusions. The majority of the software utilises RNA-seq data as an input. However, other tools use WGS data or both to increase the likelihood of detection of true fusion.
Table 6

Identifying gene fusions

Method nameYear publishedPMIDData typePlatformStatistical methodInput requirements
BreakFusion201222563071RNA-seqC++, PerlA computational pipeline for identifying gene fusions from RNA-seq dataBAM
BreakTrans201323972288RNA-seqPerlUncovering the genomic architecture of gene fusionsTab-delimited text files
chimerascan201121840877RNA-seqPythonIdentifying chimeric transcription in sequencing dataFASTQ
comrad201121478487RNA-seq, DNA-seqC++Discovery of gene fusions using paired end RNA-Seq and WGSS.FASTQ
deFuse201121625565RNA-seqC++Detecting gene fusions from paired-end RNA-seqFASTQ
FusionAnalyser201222570408RNA-seqC#Detecting gene fusions from paired-end RNA-Seq dataSAM/BAM
FusionHunter201121546395RNA-seqPerlDetecting gene fusions from paired-end RNA-Seq dataFASTQ
FusionMap201121593131RNA-seq, DNA-seqC#Detecting gene fusions from single- and paired-end RNA-Seq and DNA-seq dataFASTQ/BAM with unmapped reads
FusionSeq201020964841RNA-seqCA modular framework for finding gene fusions by analysing Paired-End RNA-Sequencing dataMRF, SAM
ShortFuse201121330288RNA-seqC++, PythonDetecting gene fusions from paired-end RNA-Seq dataFASTQ
SnowShoes-FTD201121622959RNA-seqPerlDetecting gene fusions from paired-end RNA-Seq dataFASTQ
SOAPfusion201324123671RNA-seqPerlDetecting gene fusions from paired-end RNA-Seq dataFASTQ
TopHat-Fusion201121835007RNA-seqC++Detecting gene fusions from single- and paired-end RNA-Seq dataFASTQ
Identifying gene fusions The most common analysis steps for identifying gene fusions are as follows: (i) alignment and filtering, (ii) detection of fusion junctions in candidate genes, and (iii) fragment assembly and selection of putative fusions [122]. Apart from mapping to the current reference genome available, RNA-seq reads are additionally mapped to annotated transcriptome libraries (e.g. RefSeq). The most commonly used mapping tool is Bowtie, due to its speed and high efficiency. The reads that map appropriately to the reference genome are filtered out from further analysis. The unmapped or discordantly mapped reads are fusion candidates that might be further passed through additional filters (e.g. ribosomal filter, repetitive region filter, short distance filter, etc.) to eliminate potential false negatives. Next, the reads remaining after filtration are divided into smaller fragments (so-called “split reads”) with even or pre-defined length, and both terminal parts are independently aligned to the reference genome. If they map to two different genes, they are further subjected to detection of fusion junction. The sequences of both genes are put together according to the fusion boundary, and the whole read is re-aligned to the candidate fusion gene to call supporting reads essential for the final selection of fusion. Another approach for the detection of fusion intersection is grouping discordantly mapped reads (“spanning reads”) according to the same breakpoints. Detection of fusion junction from such groups fuels prediction of the putative fusion transcript. Subsequently, the reads are re-aligned to predicted sequences and the predictions with the highest mapping scores aid identification of candidate fusion genes. The final selection of the fusion genes depends on several parameters including the number of supporting reads, quality of the alignment, and sequencing coverage [122, 123].

Estimating sample purity

Most genomics and expression-profiling studies including TCGA use a mixture of different clonal populations of tumour cells, which is often contaminated with stromal and immune cells. Indeed, many common tumours, such as pancreatic tumours, are intensively infiltrated by stroma [124] making it difficult to obtain homogenous material for genomic studies. Furthermore, epithelial cells are also often found in tumour samples, as they are at the interior surface of blood vessels necessary for providing nutrients for cancer cells. Methods like laser capture micro-dissection are rarely used in RNA-studies requiring stable material [125]. Estimating purity and clonality of a tumour sample containing a mixed population of cells requires accurate measurement of the proportion of tumour and stromal cell samples. Over the years several different methods (Table 7) have been developed to deconvolute genomic and transcriptomic data obtained from mixed-cell populations. Software packages based on these methods provide powerful tools for estimation of tumour heterogeneity and purity and in consequence identification of likely early driver events during tumorigenesis [126].
Table 7

Available software for purity estimation

Method nameYear publishedPMIDData type (technique)PlatformStatistical methodInput requirements
Dsection201020631160RNA (Microarray)Web-based and MATLABBayesian modelExpression and proportion data required
csSAM201020208531RNA (Microarray)-Linear regression-based modelExpression profile of mixed tissue samples
mixture_estimation.R201020202973RNA (Microarray)R basedVariation of electronic subtraction methodExpression profile of mixed tissue samples
ASCAT201020837533DNA (Microarray)R basedAnalytical optimisation methodSNP array data with Log R and B-Allele frequency information
PERT201223284283RNA (Microarray)OctavePerturbation modelExpression data from mixed cell type and expression profile of each homogeneous cell type
ABSOLUTE201222544022DNA (Microarray and HTS)R basedGaussian mixture modelCopy number data in segmentation file
JointSNVMixl, JointSNVMix2201222285562DNA (HTS)PythonProbabilistic graphical modelSequence data from tumour/normal pairs
CNAnorm201222039209DNA (HTS)R basedAnalytical optimization methodSequencing data of tumour and normal samples in bam format
DeconRNASeq201323428642RNA (Microarray and HTS)R basedGlobally optimised nonnegative decomposition algorithmExpression data from multiple tissue, signature of individual tissue and proportion data required
TEMT201323735186RNA (HTS)PythonProbabilistic model including position and sequence-specific biasesRequired RNA-seq sequencing data from pure tissue and mixed tissue
ESTIMATE201324113773RNA (HTS)R basedGene signature (ssGSEA) based modelExpression data in Gene Set Enrichment Analysis (GSEA) gct format
THetA201323895164DNA (HTS)PythonExplicit probabilistic modelCopy number data in interval count file format
ExPANdS201324177718DNA (HTS)R based and MATLABProbability distributions modelSomatic mutations and copy number data required
Virmid201323987214DNA (HTS)Java basedProbabilistic model and maximum likelihood estimatorDisease and normal sequencing data in bam format
MuTect201323396013DNA (HTS)Java basedBayesian modelTumour and normal sequencing data
TrAp201323892400DNA (HTS)Java basedLinear mixture model with evolutionary frameworkTumour karyotypes and somatic hypermutation datasets
Seo et al.201323650637RNA (Microarray)Linear mixture modelDisease-associated variants and expression of heterogeneous normal tissue
Available software for purity estimation

Comparative and integrative analysis of tumour samples

One of the major achievements of the TCGA project is the generation of different types of data from the same sample for a large number of tumours. This data generation is followed by uniform data processing and correlation with clinical information by Firehose and other analysis pipelines at various genome data analysis centres. The availability of such paired data allows detection of functional impact on genomic lesions (e.g. mutations, copy numbers, and gene fusions) on gene/miRNA (using RNA-Seq) and protein expression (using RPPA arrays) and pathway levels while reducing errors due to individual patient variation. Another example is the utilisation of multiple data types to identify integrated subtypes for a given tumour type using the iCLUSTER method [13, 14, 127]. Other approaches include integration of pathway information (e.g. PARADIGM & Paradigm-shift) and regulatory network information (e.g. GEMINI – [128]). Comparative analysis of multiple tumour types increases the statistical power to detect common events that drive tumorigenesis and repurpose the therapy. For example, ERBB2-HER2 is mutated and/or amplified in subsets of glioblastoma, gastric, serous endometrial, bladder, and lung cancers. The result, at least in some cases, is responsiveness to HER2-targeted therapy, analogous to that previously observed for HER2-amplified breast cancer. There are more examples that underscore the importance of such comparative analysis [4].

Future of cancer profile analysis

As we are entering the era of $1000 genome sequencing, tumour profiles are being sequenced routinely. Moreover, tumour catalogues and pre-clinical models [129, 130] have similar types of information available, with or without drug treatments. Integration of such datasets can speed up pre-clinical drug development and repurposing of available drugs. Tumour profiling by sequencing is also expected to enter both the pre-clinical and clinical setting for standardised testing as well as personalisation of medicine. However, the sequencing data fits the definition of “big data”, and a reliable computational infrastructure for storage, processing, analysis, and visualisation [131, 132] is required to make most of this avalanche of information [133]. Indeed, ambitious efforts like the cancer moonshot program and APOLLO launched by the UT MD Anderson Cancer Centre, aim to combine big data warehousing with IBM WATSON based cognitive and adaptive learning to reduce cancer mortality for several tumour types, will fully realise the power of tumour profiling.
  127 in total

Review 1.  Design issues for cDNA microarray experiments.

Authors:  Yee Hwa Yang; Terry Speed
Journal:  Nat Rev Genet       Date:  2002-08       Impact factor: 53.242

2.  ISRNA: an integrative online toolkit for short reads from high-throughput sequencing data.

Authors:  Guan-Zheng Luo; Wei Yang; Ying-Ke Ma; Xiu-Jie Wang
Journal:  Bioinformatics       Date:  2013-12-03       Impact factor: 6.937

Review 3.  Computational methods for discovering structural variation with next-generation sequencing.

Authors:  Paul Medvedev; Monica Stanciu; Michael Brudno
Journal:  Nat Methods       Date:  2009-11       Impact factor: 28.547

4.  A powerful and flexible approach to the analysis of RNA sequence count data.

Authors:  Yi-Hui Zhou; Kai Xia; Fred A Wright
Journal:  Bioinformatics       Date:  2011-08-02       Impact factor: 6.937

Review 5.  Combining immunotherapy and targeted therapies in cancer treatment.

Authors:  Matthew Vanneman; Glenn Dranoff
Journal:  Nat Rev Cancer       Date:  2012-03-22       Impact factor: 60.716

6.  Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors:  Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal:  Proc Natl Acad Sci U S A       Date:  2005-09-30       Impact factor: 11.205

7.  Comparative analysis of algorithms for integration of copy number and expression data.

Authors:  Riku Louhimo; Tatiana Lepikhova; Outi Monni; Sampsa Hautaniemi
Journal:  Nat Methods       Date:  2012-02-12       Impact factor: 28.547

8.  High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays.

Authors:  D Pinkel; R Segraves; D Sudar; S Clark; I Poole; D Kowbel; C Collins; W L Kuo; C Chen; Y Zhai; S H Dairkee; B M Ljung; J W Gray; D G Albertson
Journal:  Nat Genet       Date:  1998-10       Impact factor: 38.330

9.  PolyA_DB 2: mRNA polyadenylation sites in vertebrate genes.

Authors:  Ju Youn Lee; Ijen Yeh; Ji Yeon Park; Bin Tian
Journal:  Nucleic Acids Res       Date:  2007-01       Impact factor: 16.971

10.  A comparison of methods for differential expression analysis of RNA-seq data.

Authors:  Charlotte Soneson; Mauro Delorenzi
Journal:  BMC Bioinformatics       Date:  2013-03-09       Impact factor: 3.169

View more
  3 in total

Review 1.  lncRNA in HNSCC: challenges and potential.

Authors:  Kacper Guglas; Marta Bogaczyńska; Tomasz Kolenda; Marcel Ryś; Anna Teresiak; Renata Bliźniak; Izabela Łasińska; Jacek Mackiewicz; Katarzyna Lamperska
Journal:  Contemp Oncol (Pozn)       Date:  2017-12-30

2.  Allele balance bias identifies systematic genotyping errors and false disease associations.

Authors:  Francesc Muyas; Mattia Bosio; Anna Puig; Hana Susak; Laura Domènech; Georgia Escaramis; Luis Zapata; German Demidov; Xavier Estivill; Raquel Rabionet; Stephan Ossowski
Journal:  Hum Mutat       Date:  2018-11-23       Impact factor: 4.878

3.  Quantification of long non-coding RNAs using qRT-PCR: comparison of different cDNA synthesis methods and RNA stability.

Authors:  Tomasz Kolenda; Marcel Ryś; Kacper Guglas; Anna Teresiak; Renata Bliźniak; Jacek Mackiewicz; Katarzyna Lamperska
Journal:  Arch Med Sci       Date:  2019-01-30       Impact factor: 3.318

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.