Literature DB >> 32164528

Accurately estimating the length distributions of genomic micro-satellites by tumor purity deconvolution.

Yixuan Wang1,2, Xuanping Zhang1,2, Xiao Xiao3, Fei-Ran Zhang4, Xinxing Yan1,2, Xuan Feng1,2, Zhongmeng Zhao1,2, Yanfang Guan1,2,5, Jiayin Wang6,7.   

Abstract

BACKGROUND: Genomic micro-satellites are the genomic regions that consist of short and repetitive DNA motifs. Estimating the length distribution and state of a micro-satellite region is an important computational step in cancer sequencing data pipelines, which is suggested to facilitate the downstream analysis and clinical decision supporting. Although several state-of-the-art approaches have been proposed to identify micro-satellite instability (MSI) events, they are limited in dealing with regions longer than one read length. Moreover, based on our best knowledge, all of these approaches imply a hypothesis that the tumor purity of the sequenced samples is sufficiently high, which is inconsistent with the reality, leading the inferred length distribution to dilute the data signal and introducing the false positive errors.
RESULTS: In this article, we proposed a computational approach, named ELMSI, which detected MSI events based on the next generation sequencing technology. ELMSI can estimate the specific length distributions and states of micro-satellite regions from a mixed tumor sample paired with a control one. It first estimated the purity of the tumor sample based on the read counts of the filtered SNVs loci. Then, the algorithm identified the length distributions and the states of short micro-satellites by adding the Maximum Likelihood Estimation (MLE) step to the existing algorithm. After that, ELMSI continued to infer the length distributions of long micro-satellites by incorporating a simplified Expectation Maximization (EM) algorithm with central limit theorem, and then used statistical tests to output the states of these micro-satellites. Based on our experimental results, ELMSI was able to handle micro-satellites with lengths ranging from shorter than one read length to 10kbps.
CONCLUSIONS: To verify the reliability of our algorithm, we first compared the ability of classifying the shorter micro-satellites from the mixed samples with the existing algorithm MSIsensor. Meanwhile, we varied the number of micro-satellite regions, the read length and the sequencing coverage to separately test the performance of ELMSI on estimating the longer ones from the mixed samples. ELMSI performed well on mixed samples, and thus ELMSI was of great value for improving the recognition effect of micro-satellite regions and supporting clinical decision supporting. The source codes have been uploaded and maintained at https://github.com/YixuanWang1120/ELMSI for academic use only.

Entities:  

Keywords:  Cancer genomics; Computational pipeline; Genomic micro-satellite; Length distribution estimation; Sequencing data analysis; Tumor purity

Mesh:

Year:  2020        PMID: 32164528      PMCID: PMC7069170          DOI: 10.1186/s12859-020-3349-5

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

Micro-satellites are repetitive DNA sequences that consist of specific oligonucleotide units [1, 2], exposing intrinsic polymorphisms in terms of the length, which are often described as length distributions [3]. A distinct event known as micro-satellite instability (MSI) refers to a pattern of hypermutation caused by defects in the mismatch repair system [4], characterized by widespread length polymorphisms of micro-satellites repeats, as well as by elevated frequency of single-nucleotide variants (SNVs) [3, 5]. MSI happens if the length distributions of the same micro-satellite region differ significantly between different tissue samples, such as a tumor sample and a normal sample, otherwise the micro-satellite stability (MSS) event exists. Up to 15% – 20% of sporadic cases of colorectal cancer exhibit MSI events [6, 7], while 12% of advanced prostate cancer cases have MSI events [8]. Some recent studies have surveyed the MSI landscape across a range of cancer types [9-11], and imply that these regions have important clinical implications for cancer diagnostics and patient prognosis [12, 13]. For example, MSI positive colorectal tumors respond well to PD-1 blocade [14]. Due to these clinical utility, the detection of MSI events has become increasingly important. Owing to the increasing prevalence of the next generation sequencing (NGS) technologies, several computational tools for MSI diagnosis utilizing NGS data were developed, replacing the traditional fluorescent multiplexed PCR-based methods, which are time-consuming and costly. These algorithms includes MSIsensor [15], mSINGS [16], MANTIS [17], MSIseq [18], MSIpred [19], and MIRMMR [20]. Based on our best knowledge, these algorithms may be roughly divided into two categories: the read-count distribution based ones and mutation burden based ones. MSIsensor is among the first algorithms for analyzing cancer sequencing data, calculating the length distributions of each micro-satellite in paired tumor-normal sequence data and implementing a statistical test to identify significantly altered events between these paired distributions. mSINGS works based on target-gene captured sequencing data, allowing for the comparisons among the numbers of signals that reflect the repetitive micro-satellite tracts by differing lengths from tumor and control samples. mSINGS is computationally complex, and is thus only suitable for small panels. MANTIS analyzes MSI of a normal-tumor sample pair as an aggregate of loci instead of analyzing the differences of individual loci. By pooling the scores of all the loci and focusing on the average score, the impacts that sequencing errors or poorly performing loci may have on the results can be reduced. Meanwhile, MSIseq, MIRMMR and MSIpred utilize machine learning algorithms to predict MSI status. MSIseq compares the length distributions using four machine learning frameworks: logistic regression, decision tree, random forest and naive Bayes approach. It is a classifier that only reports MSI-H vs. non-MSI-H, without a score or percentage, or information about the instability of particular loci. MIRMMR builds a logistic regression classifier that considers both the methylation and mutation information of the genes belonging to MMR system. MSIpred adopts a support vector machine (SVM) to compute 22 features characterizing the tumor mutational load from mutation data in mutation annotation format (MAF) generated from paired tumor-normal exome sequencing data, and then use these features to predict tumor MSI status in the SVM. The classifier was trained by the MAF data of 1074 samples belonging to four types. But none of these approaches is able to overcome the one-read-length limitation. Since the detector can no longer squeeze the micro-satellites by partially mapping reads, the algorithms cannot locally anchor the micro-satellite by using paired-end reads. To this end, ELMSI has been proposed to break through this one-read-length limitation. Of note, all of these existing algorithms generally imply a hypothesis that the tumor purity of the input sequenced samples is sufficiently high, where the purity refers to the proportion of tumor cells in the mixed sample, which varies widely among different samples and cancer types. But in practice, the sample purity is not as high as expected. Due to the growth pattern of tumor tissues and clinical sampling method, the tumor sample sequenced is actually a mixture that contains non-cancerous cells [21]. The presence of non-cancerous cells can influence the judgment of micro-satellite state. Ignoring the tumor purity, the micro-satellite length distributions and states may be inaccurate. For a micro-satellite region from a mixed tumor sample, different tissues may carry different length distributions, while the observed “distribution” from the sequencing data is actually a convolution of the distribution in tumor cells with that in normal cells. If we first established an assumption that the input sample is sufficiently pure, which means we have already assumed that there is only one distribution existing in the mathematical model, then we cannot fit the actual two distributions at all (See Fig. 1). Meanwhile, even if we can use a software to estimate the tumor purity p in advance, we cannot directly solve the deconvolution problem. In order to recognize the actual length distribution of the tumor micro-satellite from a given mixed sample, we must calculate the parameter values of the distributions accurately. Furthermore, since the existing algorithms mainly use statistical tests to detect MSI, even if the sample is pure enough, the convolutional distribution inferred based on a set of mixed data containing the normal tissue micro-satellite length data, which will dilute the data signal and may mislead the statistical tests to report a MSS event, introducing type-I error finally. Existing tumor purity estimation algorithms, such as EMpurity [22], can accurately identify the proportion of normal cells and tumor cells in sequencing samples respectively, which is helpful for us to further correct the length distributions according to the estimated purity.
Fig. 1

Micro-satellite length distributions in the mixed sample

Micro-satellite length distributions in the mixed sample Motivated by this, in this article, we proposed a novel algorithm termed ELMSI that offers a new approach to identify the state and length distributions of the microsatellite from a given mixed sample. First, we established a more realistic hypothesis that the sequencing sample is a normal-tumor mixed sample, where the micro-satellite lengths are subject to two different distributions. Secondly, we used the purity estimation algorithms to accelerate the deconvolution process for calculating the respective distribution parameters. Finally, our algorithm was suitable for both short and long MSI detection. To test the performance of ELMSI, a series of simulation experiments were conducted. Because mSINGS is only used for small panels and MSIseq targets the sequencing at smaller regions of interest, while ELMSI instead focuses on longer micro-satellite and larger panel, these algorithms were not selected for comparisons. The experimental results herein were compared with MSIsensor. The results demonstrated that ELMSI can accurately identify the state of micro-satellite and infer the length distributions of it from a given mixed normal-tumor sample. Our algorithm outperformed MSIsensor based on multiple indicators, maintaining satisfactory accuracy even when coverage decreases at the same time.

Methods

Computational pipeline

Suppose that we are given a series of mapped files in BinAry Map (BAM) format generated from a normal-tumor mixed sample, and the outputs of the proposed algorithm include both the length distributions and the state of each micro-satellite. The proposed approach, ELMSI, consists of three components. The first component is estimating the tumor purity of the given sequenced sample by calculating the read counts of the filtered SNVs. Based on the estimated purity, the second component identifies the length distributions and the state of the shorter micro-satellites from the mixed sample by adding the Maximum Likelihood Estimation (MLE) step to the existing algorithm MSIsensor [15]. The third component infers the length distributions of the longer micro-satellites by combining a simplified Expectation Maximization (EM) algorithm with central limit theorem, and then uses statistical tests to output the states of them. Here, a model of micro-satellite evolution which has been well recognized in recent years holds that the distribution of micro-satellite length is a balance between length mutations and point mutations [23, 24]. Length mutations, the rate of which increases with increasing repeat counts, favor loci to attain arbitrarily high values, whereas point mutations break long repeat arrays into smaller units. Therefore, we make the same assumption [25] that the length distribution approximates a normal distribution. We have made two assumptions on the established computational model: The input sequenced sample is not pure, containing micro-satellites of two types (normal cells and tumor cells) represented by two kinds of length distributions. The length distribution of a micro-satellite approximates a normal distribution. Before building the model, we need to process the input data. We have the Binary Map format (BAM) files of whole-exome sequencing (WES) data mapped to reference genome by bwa [26] as our initial input data. Then, we define the following important terms on the aligned reads. MS-pair: Two paired reads, one of which is perfectly mapped while the other spans a breakpoint. SB-read: A read which is across the breakpoint in an MS-pair. PSset: A collection of the binary group consisting of initial positions and sequences of the SB-reads, which is represented by (POS, SEQ). Sk-mer: The sequence consisting of the first k bases. We first find all the micro-satellite candidate regions by scanning the given reference genome, recording micro-satellites of maximum repeat unit length 6bp and saving the location and the corresponding sequences of each site. Then, we use a clustering algorithm to find the remanent micro-satellite candidate regions which may be ignored by the initial scanning. This algorithm clusters are based on the distances among the initial mapping positions of the reads across each breakpoint. The number of clusters represents the number of micro-satellite regions. We set L as the longest length of micro-satellites. The lengths of micro-satellites are generally less than 50kbps [27]. Thus, L is set to be 50kbps. ELMSI estimates the number of micro-satellites using a clustering algorithm according to the distances of the initial positions of the SB-reads. The clustering strategy is as follows: According to the mapping results from the PSset, two SB-reads will belong to the same cluster only if the distance between their initial positions is less than L. Each cluster then represents a candidate micro-satellite region, providing the number of micro-satellites. Once the number of micro-satellites is determined, for each candidate micro-satellite region, ELMSI uses a k-mer based algorithm to split each read. As the repeat units that compose micro-satellites are usually less than 6 bps, we set k=6 as a default. Starting from the first base of the read sequence, the algorithm detects whether two k-mer sequences are identical replicates. This sequence is a candidate repeat unit, and the first base of the sequence is a candidate breakpoint of the micro-satellite. The same operation is conducted for all reads in the micro-satellite region and other candidate areas, taking the mode of the repeat units and breakpoints as the final results.

Estimating the tumor purity of the sample

First, we introduce a tumor purity estimation algorithm. Due to the limitation of current sequencing technologies, the purity problem is almost inevitable during the actual sampling process, so many algorithms are proposed to solve this problem. Among them, EMpurity [22] has established a probability model to accurately estimate the tumor cell proportion in the mixed sample. The observed indicators are the numbers of reads supporting the reference allele and mutation at each site, respectively, while the unknown hidden states include the tumor purity and the joint genotype. EMpurity designs a probabilistic model to describe the emission probabilities from the hidden states to the observed indicators and the transition probabilities among the hidden states. This model is solved by an Expectation Maximization algorithm. EMpurity uses the pair-sampled DNA sequencing data as the model input data, and only considers the heterozygous sites with somatic mutations. For one sample in the pair, the set of possible genotype values at each loci is G={AA,AB,BB}. Let N,T and T represent the normal sample, virtual pure tumor sample and mixed tumor sample, respectively. Here, the virtual pure tumor sample T is actually part of T. Then, for the paired samples, the set of possible combined genotype values is a Cartesian product, which is G×G={(G,G):G,G∈G}. For any site i, let and denote the number of reads supporting the reference allele in the normal sample and mixed tumor sample, respectively, each of which follows a binomial distribution with parameters μ and . There are only 9 possible joint genotypes, which follow a polynomial distribution with parameter μ. Considering the bias on read depth, we assume that tumor purity follows a normal distribution across all of the given sites, whose parameters are μ and λ. Let and . Let be the number of reads supporting the mutation in x. Let represent the read depth in x. For x∈{N,T}, these values are observed. And then, the estimation of tumor purity is . Let denote the random variable representing the joint genotype . Let 𝜗 represent the set of unknown parameters, which is . Suppose that satisfies and . This model is solved by an Expectation Maximization algorithm, where the established likelihood function is: The specific EM iterative process can be referred to EMpurity [22].

Estimating the length distribution parameters of the short micro-satellite

For the shorter (shorter than one-read-length) micro-satellites, the existing algorithms, such as MSIsensor [15], can accurately calculate the specific length data and estimate the state of them. However, when the sequenced sample is a normal-tumor mixture, the calculated micro-satellite lengths actually contain both the normal micro-satellite lengths and the tumor micro-satellite lengths, and the state estimated directly is inaccurate. Thus, given a mixed sample with known proportions (normal cells account for (1−p), tumor cells account for p) and a micro-satellite region belonging to this sample, MSIsensor can detect this micro-satellite region, obtaining a set of the lengths L={l1,l2,...,l} as a result. L is actually a length data set sampled randomly from two samples which are independent of each other and subject to two different normal distribution models. According to the law of large numbers, the data in L have a probability of (1−p) to be the length of a micro-satellite from normal cells, and the probability of p to be that from tumor cells. Given a micro-satellite region, we assume that its length follows a normal distribution when it belongs to normal cells, while the length of it follows a normal distribution when it belongs to tumor cells. Therefore, the length of this micro-satellite in the mixed sample follows a probability distribution with the density function f=(1−p)f1+pf2, where f1 and f2 is the density function of N1 and N2 respectively, while L={l1,l2,...,l} is the set of lengths obtained from this mixed micro-satellite sample independently. We can get the values of μ1,σ1 by separately detecting normal samples (such as blood samples). Under these known conditions, we can use the Maximum Likelihood Estimation (MLE) step to estimate the values of μ2,σ2. From the above, the likelihood function is the joint probability density function of the lengths: The likelihood function actually reflects the probability of generating these length values in L. The parameter values in the likelihood function which can maximize this probability are the estimated values we need to calculate: By this, the estimated values can be obtained. Thus, the length distributions of shorter micro-satellites from a given mixed sample can be recognized, and then we perform a z-test to assess the micro-satellite state.

Estimating the length distribution parameters of the long micro-satellite

On the other hand, for the longer micro-satellites, reads cannot locate them, so we cannot pinpoint their specific lengths. Thus, we use the length distribution to characterize them. Given a mixed sample of normal-tumor cells, we set the proportion of tumor cells as p to facilitate the computation. In this paper, we only consider the following two scenarios (See Fig. 2).
Fig. 2

The patterns of sequencing reads from a micro-satellite region sampled from a mixed sample. a short MS region. b long MS region

The patterns of sequencing reads from a micro-satellite region sampled from a mixed sample. a short MS region. b long MS region Similarly, we have known that the micro-satellite lengths in (1−p) normal cells follow a normal distribution , while the micro-satellite lengths in p pure tumor cells follow an another normal distribution . And, normal distribution parameters of N1 can be estimated by detecting normal tissue cells alone. According to central limit theorem, the average of the samples is roughly equal to the average of the population. Whatever the distribution of the population is (mean is μ, variation is σ2), when the sampling times reach a certain condition (>30), the means of the samples (sample size n) sampled from it will surround the mean of the population and be normally distributed (mean is μ, variation is σ2/n). Due to the specific lengths of longer micro-satellite cannot be assessed by the existing technology, we can use the distribution of the mean length of them to reflect the overall length distribution. Our approach supposes that the length of a micro-satellite is normally distributed. Therefore, ELMSI considers a continuous estimation strategy, whose basic goal is to estimate the micro-satellite average length based on the coverage of the specified area containing this micro-satellite, and then using the updated micro-satellite average length to estimate the coverage of this specified area in turn. This loop is repeated until there are no longer significant changes in micro-satellite average length. Therefore, we can use at least 30 groups of sampling average lengths to assess the distribution of the overall long micro-satellite. The length of the hybrid longer micro-satellites belonging to this mixed sample subject to a normal distribution with . According to the Central Limit Theorem, the sampled average length distribution parameters μ can be obtained to reflect the overall length distribution. However, under the technical restrictions, we can only use the estimated σ2 to represent the overall variance due to the uncountable sample size. By substituting them in the above formula, the length distribution parameters μ2 and σ2 of micro-satellites in the pure tumor sample can be calculated. The specific EM process is as follows: Let WIN−bk be the window on the reference, with the breakpoint of a micro-satellite as the midpoint of it. The default length of WIN−bk is set to be 5000bps. Then, the read pairs can be divided into the following categories. Let C-pair be the paired-reads perfectly mapped to WIN−bk,T-pair be the paired-reads perfectly mapped to the micro-satellite region, O-pair be the paired-reads with one read mapped to WIN−bk and the other mapped to the micro-satellite region, SO-pair be the paired-reads with one read mapped to the micro-satellite region and the other spanning across a breakpoint, S-pair be the paired-reads with one read mapped to WIN−bk while the other spans across a breakpoint, and S-read be the reads which span across the breakpoints in any SO-pair or S-pair. Figure 3 is a graphical representation of the relevant definitions.
Fig. 3

The changes in coverage when a micro-satellite event occurs, and the definitions of different read pairs. C-pairs: The paired-reads in which mapped to WIN-bk; T-pairs: The paired-reads in which both reads mapped from micro-satellite areas; O-pairs: The paired-reads in which one read perfectly matched to WIN-bk and one read mapped from micro-satellites areas; SO-pairs: The paired-reads in which one read is mapped from an micro-satellite area and one read spans across the breakpoints; S-pairs: The paired-reads in which one read is perfectly matched to WIN-bk and one read spans across the breakpoint. S-reads: The reads which span across the breakpoints in SO-pairs and S-pairs

The changes in coverage when a micro-satellite event occurs, and the definitions of different read pairs. C-pairs: The paired-reads in which mapped to WIN-bk; T-pairs: The paired-reads in which both reads mapped from micro-satellite areas; O-pairs: The paired-reads in which one read perfectly matched to WIN-bk and one read mapped from micro-satellites areas; SO-pairs: The paired-reads in which one read is mapped from an micro-satellite area and one read spans across the breakpoints; S-pairs: The paired-reads in which one read is perfectly matched to WIN-bk and one read spans across the breakpoint. S-reads: The reads which span across the breakpoints in SO-pairs and S-pairs The breakpoints and the repeat units of these micro-satellites can be identified by the aforementioned data preprocessing, we set a WIN−bk with the breakpoint as the midpoint. The initial length of WIN−kb is set to be 5000 bps. According to the aligned reads corresponding to WIN−bk, we can obtain the coverage of reference in WIN−bk using the following formulas: where SUM represents the total number of bases in WIN−bk,NUM represents the total number of reads in the target area, L represents the read length, C represents the coverage of the target area, and L represents the length of the target area. When the WIN−bk length is fixed, SUM is a constant. Thus, the lengths of micro-satellites do not affect SUM, but do influence the coverage C. We can therefore calculate the normal distribution parameters of the micro-satellite lengths through the following nine steps. Variable initialization: Let m be the total number of micro-satellites, i be the ith micro-satellites, S be the sampling times, WIN−bk be the sequence of samples with the micro-satellite’s breakpoint as the midpoint, L be the length of WIN−bk,L be the total number of bases belong to the micro-satellites region in all S-reads, L be the set of micro-satellite lengths. Step 1-1: Initializing the number of micro-satellites, the repeating units, breakpoints by the data preprocessing; Step 1-2: Clustering the paired-reads into 5 categories which are: C-pairs, T-pairs, O-pairs, S-pairs and SO-pairs, all the paired-reads are in WIN−bk; Step 1-3: Calculating the number of paired-reads in these categories and let NUM,NUM,NUM,NUM,NUM represent the number of C-pairs, T-pairs, O-pairs, S-pairs and SO-pairs respectively; Step 1-4: Setting m as the number of micro-satellites, i=1,S=1,L=5000bps,L′=0,L=∅. According to the paired-reads clustering results, calculate the average coverage of WIN−bk. The formula is , where SUM=2×(NUM+NUM+NUM+NUM+NUM)×L+L. And L=L′+L. Suppose that the coverage follows a uniform distribution, and then the coverage in Step 2 is equal to the coverage in micro-satellite area. In this step, we use the formula to update the micro-satellite length. Where SUM=(2×NUM+NUM+NUM)×L+L. If |L−L′′|>δ, where , let L′=L′′, and repeat Step 2. The obtained micro-satellite length is incorporated into a set, . In order to assess the normal distribution parameter of a given micro-satellite sequence, we sample 30 times (at least) by changing the size of L. Set S=S+1, if S<30, and let L=L+1000. Then proceed to Step 1. The statistical data regarding micro-satellite lengths obtained from these 30 groups of sampling experiments are tested using a normal test algorithm and the Shapiro-Wilk algorithm. Output the normal distribution parameters of a micro-satellite N(μ,σ2). μ and σ2 are the mean and covariance of lengths. If i The independent z-test is used to compare the state of micro-satellite between tumor cells and normal cells. If p-value <0.05, then the identified micro-satellite is an MSI event, otherwise the identified micro-satellite is an MSS event.

Results and discussion

To test the performance of ELMSI, we first tested its ability of micro-satellite state classification, and also compared the two major indicators - precision rate and recall rate - with those yielded by MSIsensor [15]. And we conducted experiments on a series of simulated datasets with different configurations, which altered the number of micro-satellites, coverage, and read length. In these simulation experiments, the following key indicators were calculated to evaluate ELMSI: true positive (TP), false positive (FP), true negative (TN) and false negative (FN). In addition, five popular indicators were further calculated, which are accuracy, recall, precision, MCC and Gain. Accuracy=(TP+TN)/(TP+TN+FN+FP); Recall=TP/(TP+FN); Precision=TP/(TP+FP); ; Gain=(TP−FP)/(TP+FN).

Simulation dataset generation

To generate the simulation datasets, we first randomly selected a region of 10Mbps on human chromosome 19. To design a complex situation, we randomly chose the micro-satellites length, repeat unit, and the breakpoint. As aforementioned, the micro-satellite length in a given individual is normal distributed. We divided the normal distribution N(μ,σ2) into seven parts which are μ−3σ,μ−2σ,μ−σ,μ,μ+σ,μ+2σ,μ+3σ, and the number of micro-satellites in each part planted into the reference was got through multiplied coverage by corresponding probability 1%, 6%, 24%, 38%, 24%, 6%, and 1% for each part, respectively. Once each micro-satellite was planted, we merged these seven read files. All of the simulated reads were then mapped to the reference sequence. The alignment file was then provided to variant calling tools.

Micro-satellites state classification and comparison experiment

In this part, we first tested the accuracy of ELMSI in classifying the micro-satellite state from the mixed samples. The z-test was used to determine whether the micro-satellite is a MSI event. For the shorter micro-satellites, we compared our algorithm with the proposed approach MSIsensor. Among the proposed micro-satellite state classification algorithms, mSINGS is suitable for small panels and has been reported to be used only for limited exome data, and MSIseq only targets the sequencing at smaller regions. Comparison with these algorithms is meaningless. MSIsensor can accurately identify the micro-satellite state and lengths when the they are shorter than one read length. Thus we chose MSIsensor to do the comparison experiment. The number of micro-satellite was set to be 30, the coverage was set to be 100 × and the read-length was set to be 200bps. The tumor purity was set to be 0.9, 0.7, 0.5, 0.3, 0.1, respectively. Micro-satellite state were subsequently identified by the two classification tools MSIsensor and ELMSI. The results are shown in Table 1.
Table 1

Comparison results of ELMSI and MSIsensor

Tumor proportionMSIsensorELMSI
PrecisionRecallPrecisionRecall
0.910.333311
0.710.133310.6667
0.50010.5667
0.30010.6
0.10010.4667
Comparison results of ELMSI and MSIsensor As can be seen, ELMSI has better performance in hybrid micro-satellite state classification. When the tumor purity of the input sequenced sample is below a certain ratio, the MSS signal in the normal sample will dilute the MSI signal, causing MSIsensor to report a MSS event. Thus, when the input tumor sample is a mixture with high normal cell contamination, MSIsensor cannot distinguish the MSI accurately. However, ELMSI can do the classification even if the tumor purity is less than 10%. On the other hand, for the longer micro-satellites, the paired-reads used to locate the candidate micro-satellite region are invalid, and none of the existing approaches is able to overcome the one-read-length limitation. Thus, we proposed ELMSI, which can identify the longer hybrid micro-satellites, and classify their state. Next, we tested the classification accuracy of it. The number of micro-satellite was set to be 30, the coverage was set to be 100 × and the read-length was set to be 200bps. The tumor purity was set to be 0.9, 0.7, 0.5, 0.3, 0.1, respectively. The detailed results are shown in Table 2.
Table 2

Performance of ELMSI for longer micro-satellites classification

Tumor purity0.90.7
No.BreakpointUnitμNμTBreakpointμTMSIBreakpointμTMSI
134489TCATT8612534491146.84134489163.441
2122387GGCC425525122389685.631122387724.231
3189108GCTAC46120189158105.651189108133.031
4190653CATC43136190655130.521190653170.211
5194236AAC89166194238145.931194236151.431
6251655GCT7111125165491.17125165571.711
7311313ACCA56236311315321.081311313331.601
8356789GCT7625635679051.471356789161.351
9398971TTCG45225398973213.121398971251.421
10412340G10028041250588.05141234070.501
11432344TGA78258432343177.301432344220.541
12473174AAGG221354473176462.751473174403.211
13501994CGCCG78161501996128.261501994329.521
14505733ACAGGG40111505791222.261505733248.281
15526358GTCC58144526360167.581526358152.301
16612344TGC90270612342355.321612344343.931
17622735GGTTC77142622737114.591622735197.751
18677621TCA70200677623163.891677621202.361
19712345GACT89269712337N/A0712345230.621
20731506GA146203731506104.14173150688.481
21776166TAA213324776167359.17431776166564.011
22842735CTC134211842734213.291842735236.041
23866450TG185220866450371.531866450526.35511
24891334TCAGC105285891336234.201891334338.381
25908385AGAAT167229908386194.851908385294.271
26910124C20530191020498.17191012432.501
27929056CCG120210929058199.721929056202.661
28944729GGACT90190944731214.331944729225.511
29964608AGGGGG56156964610305.591964608296.611
30973099GGGCAC355460973101849.181973099N/A0
Accuracy0.9670.967
Tumor purity0.50.30.1
No.UnitBreakpointμTMSIBreakpointμTMSIBreakpointμTMS
1TCATT34491155.09134489165.59134489272.511
2GGCC122389354.081122387710.491122387619.041
3GCTAC189108177.331189108127.27118910894.681
4CATC190655137.45119065384.00119065319.351
5AAC194238172.941194236125.12119423682.571
6GCT25165486.32125165530.281251655N/A0
7ACCA311355240.461311313361.341311313509.081
8GCT356790303.441356789189.281356789584.241
9TTCG398973275.141398971282.361398971295.251
10G41248366.431412340105.091412340N/A0
11TGA432343292.881432344270.841432344345.931
12AAGG473176434.981473174640.551473174779.621
13CGCCG501996278.351501994365.001501994800.131
14ACAGGG505844246.031505733178.621505733339.871
15GTCC526361120.871526358132.081526358112.711
16TGC612419467.031612344570.021612344N/A0
17GGTTC622737215.571622735241.851622735254.751
18TCA677623226.321677621241.831677621220.821
19GACT712347393.391712345327.391712345N/A0
20GA73150677.51173150651.491731506N/A0
21TAA776167405.811776166170.121776166340.201
22CTC842734221.521842735321.371842735204.911
23TG866450485.101866450870.521866450236.771
24TCAGC891336316.811891334112.101891334677.121
25AGAAT908386206.391908385457.041908385152.631
26C910220N/A0910124N/A0910124N/A0
27CCG929058203.18192905698.561929056269.481
28GGACT944731257.571944729250.611944729279.331
29AGGGGG964610416.611964608380.031964608691.831
30GGGCAC9731011182.001973099N/A0973099748.711
Accuracy0.9670.9330.8
Performance of ELMSI for longer micro-satellites classification As is shown in Table 2, the decreasing tumor ratio can influence the accuracy of the ELMSI. However, even with a purity as low as 10%, the results still indicate that ELMSI can provide a reliable MSI classification.

Estimating the distribution of micro-satellite lengths

To separately verify the validity of ELMSI in estimating the length distributions of the longer micro-satellites. We ignored the influence of tumor purity, and tested the performance of ELMSI by changing micro-satellite number, coverage, and read length. A correct call is defined as follows: a micro-satellites is identified with a correct repeat unit, the breakpoint detected belongs to the (b−−10bps,b+10bps) where b is the set breakpoint, and the actual micro-satellites length belongs to the (μ−3σ,μ+3σ), where μ and σ are parameter values which have be estimated. We first changed the number of micro-satellite from 20 to 100. In order to better reflect the influence of micro-satellite number on ELMSI, we also varied the coverage from 30 ×, 60 ×, 100 ×, to 120 ×. The read length was set to be 100bp in this group of experiments. For each different micro-satellite number, we repeated the test five times using the same setting and output the average results, which are summarized in Table 3.
Table 3

Key indicators of ELMSI in different numbers number of mciro-satellites

Number of MSIsCoverageAccuracyRecallPrecisionGainMCC
2030 ×0.53850.70000.70000.4000-0.300
60 ×0.48150.65000.65000.3000-0.350
100 ×0.44190.63330.59380.2000-0.386
120 ×0.57500.76670.69700.4333-0.265
3030 ×0.52500.70000.67740.3667-0.311
60 ×0.52500.70000.67740.3667-0.311
100 ×0.68750.82500.80490.6250-0.184
120 ×0.64000.80000.76190.5500-0.218
4030 ×0.68090.80000.82050.6250-0.189
60 ×0.58820.75000.73170.4750-0.259
100 ×0.56060.74000.69810.4200-0.280
120 ×0.49300.70000.62500.2800-0.335
5030 ×0.69490.82000.82000.6400-0.180
60 ×0.70000.84000.80770.6400-0.175
100 ×0.60000.75000.75000.5000-0.250
120 ×0.53750.71670.68250.3833-0.299
6030 ×0.61640.75000.77590.5333-0.236
60 ×0.60810.75000.76270.5167-0.243
100 ×0.62790.77140.77140.5429-0.228
120 ×0.58620.72860.75000.4857-0.260
7030 ×0.53330.68570.70590.4000-0.304
60 ×0.47870.64290.65220.3000-0.352
100 ×0.56600.75000.69770.4250-0.274
120 ×0.54630.73750.67820.3875-0.290
8030 ×0.61220.75000.76920.5250-0.240
60 ×0.59800.76250.73490.4875-0.250
100 ×0.56640.71110.73560.4556-0.276
120 ×0.55750.70000.73260.4444-0.283
9030 ×0.49570.64440.68240.3444-0.336
60 ×0.50850.66670.68180.3556-0.325
100 ×0.41380.60000.57140.1500-0.414
120 ×0.66670.80000.80000.6000-0.200
10030 ×0.61160.74000.77890.5300-0.239
60 ×0.51910.68000.68690.3700-0.316
100 ×0.62900.78000.76470.5400-0.227
120 ×0.59520.75000.74260.4900-0.253
Key indicators of ELMSI in different numbers number of mciro-satellites The increasing micro-satellite number can influence the robustness of the ELMSI. In practice, since microsatellites are very rare, few micro-satellites will exist in a given 10Mbps chromosomal sequence region. Even so, for testing ELMSI, we intended to increase this density. Based on Table 3, we can see that ELMSI can identify micro-satellites and exclude non micro-satellites interference accurately. The results also show that ELMSI can offer a high reliability. Sequencing coverage affects somatic mutation calling, which in turn would presumably affect the performance of ELMSI. To assess the influence of the different coverage on ELMSI, we further varied the coverage from 10 × to 100 ×. As is shown in Table 4 the coverage changes intuitively affect the changes in key indicators. In this group of experiments, we set the number of microsatellites to be 20, 40, or 60, and set read length to be 100 bps.
Table 4

Comparisons of the performance of ELMSI in different coverages

Number of MSIsCoverageAccuracyRecallPrecisionGainMCC
2010 ×0.36670.55000.52380.05-0.4629
20 ×0.38460.50000.57140.20-0.4330
30 ×0.41670.52630.64290.26-0.3974
40 ×0.48150.65000.65000.30-0.3500
50 ×0.69570.80000.84210.65-0.1777
60 ×0.68180.78950.83330.63-0.1873
70 ×0.64000.76190.80000.57-0.2182
80 ×0.60710.73910.77270.52-0.2435
90 ×0.54830.70830.70830.42-0.3077
100 ×0.52940.69230.69230.38-0.3077
4010 ×0.39280.611100.52380.06-0.4303
20 ×0.46150.60000.66670.30-0.3651
30 ×0.51060.68570.66670.34-0.3237
40 ×0.52080.67560.67440.38-0.3148
50 ×0.51920.67500.69230.38-0.3162
60 ×0.57620.83330.75550.48-0.2670
70 ×0.68090.80000.82050.63-0.1895
80 ×0.66000.82500.76740.58-0.2017
90 ×0.65380.79060.75550.53-0.2262
100 ×0.68000.85000.77270.60-0.1846
6010 ×0.42180.58690.60000.20-0.4065
20 ×0.42470.51670.70450.30-0.3779
30 ×0.48610.62500.68620.34-0.3430
40 ×0.54410.69810.71150.42-0.2951
50 ×0.52320.68180.69230.38-0.3129
60 ×0.50000.65710.67650.34-0.3331
70 ×0.54550.70000.71190.42-0.2940
80 ×0.50000.64700.68750.35-0.3321
90 ×0.62160.76670.76670.53-0.2333
100 ×0.54440.70000.71010.41-0.2949
Comparisons of the performance of ELMSI in different coverages The lower the coverage, the greater the difficulty faced by this computational approach. Consistent with this, Table 4 indicates that the performance of ELMSI increases as coverage increases, with maximal recall rate more than 80%. Thus, the higher the coverage, the higher the accuracy of ELMSI for inferring micro-satellites. ELMSI can also stay valid when the read length is altered. The number of micro-satellites was set to be 20, or 50, coverage was set to be 30 ×, 60 ×, 100 ×, or 120 ×, and the read length was set to be 100bps, 150bps, 200bps, 250bps and 300bps. The results are shown in Table 5.
Table 5

Key indicators of ELMSI corresponding to different read lengths

Number of MSIsRead lengthCoverageAccuracyRecallPrecisionGainMCC
2010030 ×0.310.450.50-0.25
60 ×0.330.50.50-0.5
100 ×0.310.450.50-0.52
120 ×0.360.550.520.05-0.46
15030 ×0.310.450.50-0.25
60 ×0.370.50.590.15-0.45
100 ×0.460.60.60.3-0.36
120 ×0.420.550.650.25-0.40
20030 ×0.50.650.680.35-0.33
60 ×0.490.570.650.1-0.55
100 ×0.480.70.610.25-0.34
120 ×0.520.70.670.35-0.32
25030 ×0.60.750.750.5-0.25
60 ×0.540.70.70.4-0.3
100 ×0.550.750.690.2-0.38
120 ×0.580.750.710.25-0.34
5010030 ×0.430.60.60.2-0.4
60 ×0.370.550.540.08-0.46
100 ×0.450.630.610.23-0.38
120 ×0.360.550.520.05-0.46
15030 ×0.510.680.630.35-0.33
60 ×0.450.630.610.23-0.38
100 ×0.480.580.620.05-0.45
120 ×0.470.730.630.45-0.28
20030 ×0.510.70.650.45-0.32
60 ×0.610.780.640.5-0.24
100 ×0.520.630.670.15-0.40
120 ×0.560.750.680.4-0.28
25030 ×0.530.70.680.38-0.31
60 ×0.570.780.70.23-0.36
100 ×0.610.70.670.35-0.32
120 ×0.590.750.70.25-0.33
Key indicators of ELMSI corresponding to different read lengths The main weakness of this method is the huge amount of splicing required. The longer the read length, the smaller the splicing workload, and the fewer errors will be introduced by splicing. We thus predict that with the increased of read length, ELMSI performance will improve. Table 5 validates this hypothesis, and shows that the longer the read length is, the more accurate estimation result is.

Conclusion

In this article, we focus on the computational problem of inferring the length distributions and states of all kinds of micro-satellites in tumors with normal cell contamination. Existing approaches, such as MSIsensor, mSINGS, MANTIS and MSIseq, perform well in handling the genomic micro-satellite event whose length is shorter than one read length, but often encounter a significant loss of accuracy when the length of micro-satellite becomes longer. Meanwhile, all of these MSI detection algorithms implies a general assumption before establishing a mathematical model that the input sample is a pure tumor sample, which is difficult to achieve under existing sequencing technology. We have therefore proposed an algorithm to break these limitations, handling micro-satellites with a wide range of length from a mixed normal-tumor sample based on NGS data. Our proposed algorithm, termed ELMSI, directly computes on the aligned reads. ELMSI can clearly recognize the length distributions and states of micro-satellites with a wide range of length from mixed sequenced samples. For short microsatellites, it can identify the lengths accurately, while for long micro-satellites, it can estimate the normal distribution parameters. ELMSI is among the first approaches to recognize and identify long micro-satellites. However, due to the nature of sequencing data and the limitation of computing capacity, the estimated mean μ is relatively accurate, while the estimated variance σ has a certain deviation. Thus, for longer MSI detection, our algorithm uses independent z-test mainly. When the sample size can be calculated during the iteration process, we can estimate the variance of the longer micro-satellite more accurately, and thus we recommend to use independent t-test to infer the MSI state. The performance of ELMSI is compared with MSIsensor, and ELMSI is superior for the hybrid shorter micro-satellites classification. For the mixed longer samples, ELMSI can also obtain the satisfactory results. The simulation experimental results demonstrate that ELMSI is robust, with good performance in response to variations in coverage, read length, and the number of micro-satellites. It will be useful for micro-satellites screening and we anticipate a wider usage in cancer clinical sequencing.
  26 in total

Review 1.  Microsatellites: simple sequences with complex evolution.

Authors:  Hans Ellegren
Journal:  Nat Rev Genet       Date:  2004-06       Impact factor: 53.242

2.  A genome-wide study of microsatellite instability in advanced gastric carcinoma.

Authors:  C W Wu; G D Chen; K C Jiang; A F Li; C W Chi; S S Lo; J Y Chen
Journal:  Cancer       Date:  2001-07-01       Impact factor: 6.860

3.  Incidence of microsatellite instability in synchronous tumors of the ovary and endometrium.

Authors:  Catherine Shannon; Judy Kirk; Rebecca Barnetson; Justin Evans; Margaret Schnitzler; Michael Quinn; Neville Hacker; Alex Crandon; Paul Harnett
Journal:  Clin Cancer Res       Date:  2003-04       Impact factor: 12.531

Review 4.  Toward better understanding of artifacts in variant calling from high-coverage samples.

Authors:  Heng Li
Journal:  Bioinformatics       Date:  2014-06-27       Impact factor: 6.937

Review 5.  Evolving approach and clinical significance of detecting DNA mismatch repair deficiency in colorectal carcinoma.

Authors:  Jinru Shia
Journal:  Semin Diagn Pathol       Date:  2015-02-04       Impact factor: 3.464

6.  MSIsensor: microsatellite instability detection using paired tumor-normal sequence data.

Authors:  Beifang Niu; Kai Ye; Qunyuan Zhang; Charles Lu; Mingchao Xie; Michael D McLellan; Michael C Wendl; Li Ding
Journal:  Bioinformatics       Date:  2013-12-25       Impact factor: 6.937

7.  Performance evaluation for rapid detection of pan-cancer microsatellite instability with MANTIS.

Authors:  Esko A Kautto; Russell Bonneville; Jharna Miya; Lianbo Yu; Melanie A Krook; Julie W Reeser; Sameek Roychowdhury
Journal:  Oncotarget       Date:  2017-01-31

8.  MSIpred: a python package for tumor microsatellite instability classification from tumor mutation annotation data using a support vector machine.

Authors:  Chen Wang; Chun Liang
Journal:  Sci Rep       Date:  2018-12-03       Impact factor: 4.379

9.  Patterns of microsatellite distribution across eukaryotic genomes.

Authors:  Surabhi Srivastava; Akshay Kumar Avvaru; Divya Tej Sowpati; Rakesh K Mishra
Journal:  BMC Genomics       Date:  2019-02-22       Impact factor: 3.969

10.  MSIseq: Software for Assessing Microsatellite Instability from Catalogs of Somatic Mutations.

Authors:  Mi Ni Huang; John R McPherson; Ioana Cutcutache; Bin Tean Teh; Patrick Tan; Steven G Rozen
Journal:  Sci Rep       Date:  2015-08-26       Impact factor: 4.379

View more
  1 in total

1.  Main findings and advances in bioinformatics and biomedical engineering- IWBBIO 2018.

Authors:  Olga Valenzuela; Fernando Rojas; Ignacio Rojas; Peter Glosekotter
Journal:  BMC Bioinformatics       Date:  2020-05-05       Impact factor: 3.169

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.