| Literature DB >> 25558350 |
Jian-Feng Yang1, Xiao-Fan Ding1, Lei Chen2, Wai-Kin Mat1, Michelle Zhi Xu3, Jin-Fei Chen3, Jian-Min Wang4, Lin Xu5, Wai-Sang Poon6, Ava Kwong7, Gilberto Ka-Kit Leung7, Tze-Ching Tan8, Chi-Hung Yu8, Yue-Bin Ke9, Xin-Yun Xu9, Xiao-Yan Ke10, Ronald Cw Ma11, Juliana Cn Chan11, Wei-Qing Wan12, Li-Wei Zhang12, Yogesh Kumar1, Shui-Ying Tsang1, Shao Li13, Hong-Yang Wang2,14, Hong Xue1.
Abstract
BACKGROUND: AluScan combines inter-Alu PCR using multiple Alu-based primers with opposite orientations and next-generation sequencing to capture a huge number of Alu-proximal genomic sequences for investigation. Its requirement of only sub-microgram quantities of DNA facilitates the examination of large numbers of samples. However, the special features of AluScan data rendered difficult the calling of copy number variation (CNV) directly using the calling algorithms designed for whole genome sequencing (WGS) or exome sequencing.Entities:
Keywords: AluScan sequencing; CNV calling; Cancer classification; Machine learning
Year: 2014 PMID: 25558350 PMCID: PMC4273479 DOI: 10.1186/s13336-014-0015-z
Source DB: PubMed Journal: J Clin Bioinforma ISSN: 2043-9113
Figure 1Schematic diagram of the AluScanCNV calling method. CNV calling is conducted employing the test sample either with a reference template constructed from pooled reference samples in (I) unpaired analysis, or with a paired control sample in (II) paired analysis, to yield read-depth ratios. GHT is used to call localized CNVs and in turn recurrent CNVs; or alternately CBS is used to call extended CNVs.
Figure 2Distribution of transformed -values. Upper panel - without GC content normalization; and lower panel - with GC content normalization. Y-axis shows the frequency and X-axis shows the t-value from Eqn. 9 or 10. The t-values were estimated from the AluScan of GL2B as test sample compared with the 23-sample reference template, and window size was 5 kb.
Figure 3Poisson binomial distribution of CNVs among samples. The frequency for any window is the percentage of total samples that display a CNV at that window, and the density is the fraction of all the windows analyzed that display a given frequency. Accordingly, CNVs that give rise to frequencies to the right of the cut-off frequency (indicated by red line) represent CNVs that occur at an exceptionally high percentage of samples with p <0.01, and are therefore regarded as recurrent CNVs. The curve shown was calculated using localized CNVs called from the AluScans of the 38 cancer samples in column 2 of Additional file 1: Table S1, in each case employing for comparison the 23-sample reference template.
Figure 4Q-Q plot of read-depth distributions. In paired analysis (A), test glioma tissue GL2T was compared to its paired non-cancerous control GL2B; Pearson’s correlation coefficient was 0.9986. In unpaired analysis (B), the same test sample was compared to the 23-sample reference template; Pearson’s correlation coefficient was 0.9939. The read-depths of 5 kb windows are represented by densely overlapping solid circles, and the red lines are the linear regression lines.
Figure 5Chromosomal distribution of localized CNVs called using different window sizes. GHT-based localized CNVs were called from GL2T AluScan compared to the 23-sample reference template using 5 kb (A), 100 kb (B), 300 kb (C) and 500 kb windows (D). CNV Frequency on the y-axis represents the fraction of windows on a chromosome showing CNV gain (upward blue bars) or CNV loss (downward red bars).
Figure 6Chromosomal locations of localized CNVs in a glioma sample using 500 kb windows. GHT-based localized CNVs were called from AluScan data of glioma tumor tissue GL2T compared to (A) its paired blood control GL2B AluScan, and to (B) the 23-sample reference template using 500 kb windows. Upward blue bars represent copy number gains, and downward red bars copy number losses.
Figure 7Chromosomal distribution of recurrent CNVs in twenty-one liver cancers. The 52 recurrent copy number gains (red upward bars) and 99 recurrent copy number losses (red downward bars) were called from the AluScans of 21 liver cancers from Additional file 2: Table S2 using the 23-sample reference template for comparison. Blue bars represent CNVs the frequencies of which did not exceed the green lines marking significant recurrence (p <0.01). The orange columns represent CNVs called from WGS data by Kan et al. [36].
Figure 8Hierarchical clusters of liver and non-liver cancers based on distinguishing CNV-features. (A) Clustering using localized CNV-features and (B) Clustering using recurrent CNV-features. The 21 liver and 16 non-liver cancers analyzed are described in Additional file 2: Table S2. The distinguishing localized and recurrent CNV-features selected by machine learning for the purpose of clustering these two classes of cancers are listed in Additional file 6: Table S3A and 3B respectively. The numbers in orange shown at the nodes for the ‘liver cancer’ (blue solid box) and ‘non-liver cancer’ (green dashed box) clusters indicate the approximate unbiased probabilities, and the three incorrectly clustered samples in Part (B) are shown in red. Clustering of samples was performed as described in Methods.
Figure 9Chromosomal distribution of extended CNVs in glioma GL2T. (A) GL2T tumor tissue was compared with either (A) paired control blood sample GL2B from the same patient; or (B) the 23-sample reference template. The Z scores of windows are shown by green and black dots on alternate autosomal chromosomes. Red horizontal bars with Z ≥0.2 represent extended copy number gains, and those with Z 0.2 represent extended copy number losses.
Figure 10Comparison of CNV callings by AluScanCNV and FREEC. (A) Chromosomal distribution of CNV gains obtained by FREEC based on hg18 [7] (green bands above cytobands) or by the CBS-based extended CNV calling in AluScanCNV (orange bands below cytobands). Correlation between the two sets of results yielded Pearson’s R =0.776. (B) Chromosomal distribution of CNV losses obtained by FREEC (green bands above cytobands) and by AluScanCNV (orange bands below cytobands). Correlation between the two sets of results yielded Pearson’s R =0.935. The same dataset on cancer cell line HCC1143 from ref.27 was employed in all the CNV estimations. Correlation R values were estimated using the human genome graph function in UCSC (http://genome.ucsc.edu/cgi-bin/hgGenome).