Literature DB >> 29374408

Cancer Diagnosis Epigenomics Scientific Workflow Scheduling in the Cloud Computing Environment Using an Improved PSO Algorithm

Abstract

Objective: Epigenetic modifications involving DNA methylation and histone statud are responsible for the stable maintenance of cellular phenotypes. Abnormalities may be causally involved in cancer development and therefore could have diagnostic potential. The field of epigenomics refers to all epigenetic modifications implicated in control of gene expression, with a focus on better understanding of human biology in both normal and pathological states. Epigenomics scientific workflow is essentially a data processing pipeline to automate the execution of various genome sequencing operations or tasks. Cloud platform is a popular computing platform for deploying large scale epigenomics scientific workflow. Its dynamic environment provides various resources to scientific users on a pay-per-use billing model. Scheduling epigenomics scientific workflow tasks is a complicated problem in cloud platform. We here focused on application of an improved particle swam optimization (IPSO) algorithm for this purpose.
Methods: The IPSO algorithm was applied to find suitable resources and allocate epigenomics tasks so that the total cost was minimized for detection of epigenetic abnormalities of potential application for cancer diagnosis. Result: The results showed that IPSO based task to resource mapping reduced total cost by 6.83 percent as compared to the traditional PSO algorithm.
Conclusion: The results for various cancer diagnosis tasks showed that IPSO based task to resource mapping can achieve better costs when compared to PSO based mapping for epigenomics scientific application workflow. Creative Commons Attribution License

Entities: Chemical Disease Gene Species

Keywords: Cancer diagnosis; genomics; gene expression; particle swam optimization; scientific workflow; scheduling

Year: 2018 PMID： 29374408 PMCID： PMC5844625 DOI： 10.22034/APJCP.2018.19.1.243

Source DB: PubMed Journal: Asian Pac J Cancer Prev ISSN： 1513-7368

Introduction

Genomics is defined as the study of entire genomes of organisms, including extra-chromosomal DNA such as the mitochondrial genetic material. This field includes intensive efforts to determine the entire DNA sequence of organisms, using fine-scale genetic mapping and DNA sequencing with current and emerging technologies. In contrast, investigating the roles of single genes is a primary focus of genetics. Single gene research does not fall into the definition of genomics unless the aim is to verify the effect that a gene may have on the entire genome’s networks and pathways. Genomics has been the main focus in molecular biology, especially after the completion of the sequencing of genomes from several organisms. Genomics tools have already helped in the understanding of several aspects of the genome of cancer cells when compared to normal controls. One important example is the identification of the gene HER2/neu (ErbB-2), which is an oncogene mapped to human chromosome that is over-expressed, or amplified, in ~30% of breast cancer tumors (Slamon et al., 1989). Identification of this molecular characteristic culminated in the development of the drug trastuzumab (Herceptin®) (Paik et al., 2008). Breast cancer patients that are HER2/neu (ErbB-2) positive have increased survival rates when treated with this drug. Recently, cancer genomes were sequenced and compared with normal cells for leukemia, breast, lung, and other tumor types, using second-generation DNA sequencing technologies (Ley et al., 2008; Stephens et al., 2009; Lee et al., 2010; Pleasance et al., 2010). The purpose was to identify mutations that could give rise to new biomarkers and new therapies for these types of cancers. In addition, the 1,000 Genome Project was recently launched (Butler et al., 2010) with the objective of sequencing the genome of thousands of individuals in a small period of time. In parallel, companies are starting to provide whole genome sequencing services, with the aim of understanding the individual’s susceptibility for diseases, including different cancers (Kaye, 2008; Pandi and Premalatha, 2015). Medical science workflow applications consists of large number of tasks which are either complex or simple, also there exists a huge amount of data transfer between the tasks. The tasks of the workflow applications are interdependent and are normally represented using directed acyclic graphs (DAG). Cloud platform is currently considered as the cost-effective distributed computing platform for scientific workflow applications (Hoffa et al., 2008; Sadhasivam and Thangaraj, 2017). Currently, most of the cloud service providers provide the resources to the customers using different billing models and charge the usage either on hourly basis or on minute basis. So scheduling of medical science workflow applications requires the user to properly select the resources (Sahni and Vidyarthi, 2015). However, since there are large number of resources which are dynamic so the workflow application scheduling to cloud resources can be efficiently addressed by using meta-heuristics. The medical workflow application Epigenomics which is used at USC Epigenome center for analyzing the genome sequencing of the human body. It contains sequence of automated operations to collect short DNA segments using high-throughput gene sequencing machines and it uses MAQ software and IlluminiaSolexa genetic analyzer (Gil et al., 2007; Deelman et al., 2015). The relationship between the tasks of the Epigenomic workflow application is modelled using eight levels, with each level containing tasks that perform certain functions. Figure 1 shows the workflow structure of the Epigenomics workflow application. This application is a data intensive workflow application and contains both large and small size tasks.

Figure 1

Sample Structure for Epigenomics Scientific Workflow

Materials and Methods

Problem Formulation

The problem can be formulated to identify a task to resource mapping instance P, such that when calculating the total cost incurred using each compute resource PC, the high cost among all the computer resources is minimized. The primary objective of this work is to derive the metaheuristic optimization for mapping the Epigenomics tasks to the compute resources such that the total cost is minimized. Where Cexe = The total execution cost for Epigenomics tasks Ctx = The communication cost between the resources

Epigenomics workflow scheduling for Cancer Diagnosis using Particle Swam Optimization

Particle swarm optimization (PSO) is a population based optimization technique inspired by social behavior of bird flocking (Kennedy and Eberhart, 1995; Balamurugan et al., 2017). It combines self-experiences with social experiences and uses a number of particles that represent a swarm moving around in the search space looking for the best solution. In PSO, each single solution is a “particle” in the search space. All of particles have fitness values which are evaluated by the fitness function to be optimized and velocities which direct the moving of the particles. PSO is initialized with a group of random particles. In every iteration, each particle is updated by the two “best” values. After finding the two best values, the particle updates by velocity and position equations.

Particle velocity and position renewal of PSO

Particle Swarm Optimization Steps

Initialize the particles with random solutions Evaluate the fitness value of each particle’s according to the objective function If the current fitness value is better than the previous Pbest , Set current fitness value as new Pbest Otherwise keep Previous Pbest Choose the global best particle (best particle of all pbest particles) Calculate particles’ velocities according equation (4) Update particle’s new positions according equation (5) Repeat from step 2 until stopping criteria are satisfied.

Improved PSO for Epigenomics scientific workflow application scheduling in Cancer Diagnosis

In the standard PSO algorithm, the convergence speed of particles is fast, but the adjustments of cognition component and social component make particles search around entire solution. So, the whole swarm will be trapped into a local optimum; and the capacity of swarm jump out of a local optimum is rather weak. In order to avoid being trapped into a local optimum, the new PSO adopts a new information sharing mechanism. The proposed method one can not only record the best positions an individual particle and the whole swarm have experienced, one can also record the worst positions.

Particle velocity and position renewal of EPSO

Results

The result is experimented and analyzed with the cloudsim and it consists of 10 resources with different processing speed from Amazon EC2 services. The test has been conducted for the task scheduling problem from 10 processors with 50 tasks. The experimental parameter settings of PSO and IPSO algorithms are shown in Table 1.

Table 1

Parameters and Its Value for PSO and IPSO

Parameter description	Parameter value
Size of Swarm	50
Self-recognition coefficient c1	2
2 Social coefficient c2	2
Weight(w)	0.9
Iterations	50

Parameters and Its Value for PSO and IPSO Figure 3 plots the computation cost computed by PSO and IPSO over the 50 number of iterations for different sizes of cancer diagnosis Epigenomics workflow applications such as Eigenomics_24, Eigenomics_46, Eigenomics_100 and Eigenomics_997 applications respectively. Initially, the particles are randomly initialized. Therefore, the initial total cost is always high. This initial cost corresponds to the 0th iteration. As the algorithm progresses, the convergence is drastic and it finds a global minima very quickly. The average number of iterations needed for the convergence is seen to be 30-35, for this application environment. It displays that IPSO usually had better average completion time values than PSO. Pseudocode of IPSO Algorithm Cancer Diagnosis Epigenomics Workflow Applications Computation Cost of PSO and IPSO

Comparative Analysis of PSO and IPSO

Table 2 plots comparison of optimal total cost between PSO based resource selection and IPSO algorithms when varying total data size of a workflow. IPSO achieves 10.39 percentages of improvements for Epigenomics_24 application with 24 tasks processed than the PSO algorithm. For Epigenomics_46 application with 46 tasks and Epigenomics_100 application with 100 tasks, the proposed IPSO method attains 9.89 and 6.26 percentage of improvements respectively. Whereas for Epigenomics_997 application with 997 tasks the proposed IPSO method returns 6.83 percentage of improvements in optimal total computation cost. Clearly, IPSO based mapping has much lower cost as compared to that of the existing PSO based mapping.

Table 2

Comparison of Computation Cost with Various Data Size for PSO and IPSO

Size of Data	PSO	IPSO	Percentage of Improvement
Epigenomics_24	31.19	28.01	10.19%
Epigenomics_46	31.32	28.22	9.89%
Epigenomics_100	33.03	30.96	6.26%
Epigenomics_997	58.7	54.69	6.83%

Comparison of Computation Cost with Various Data Size for PSO and IPSO

Discussion

The burgeoning fields of genomics and epigenomics comprise essential facets of modern cancer research. A single genes and groups of genes from the same pathway have been identified as differentially methylated in cancers, and some have been used as molecular biomarkers in order to identify patients with a better or a worse prognosis. Based on the information, growing evidence indicates that new epigenomic tools will increasingly affect the way to monitor and manage cancer in the future. The Epigenomics scientific workflow application is essentially a data processing pipeline to automate the execution of the various genome sequencing operations or tasks. Cloud platform is a popular distributed computing platform for deploying large scale of Epigenomics scientific workflow applications. Scheduling Epigenomics scientific workflow tasks are a complicated problem in cloud platform. In the proposed method IPSO is achieved better results compared with existing methods. The performance of IPSO is analyzed with different cancer diagnosis tasks.

Statement conflict of Interest

The authors whose names are listed immediately below certify that they have no affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers’ bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript. Author names: Sadhasivam N Balamurugan R Pandi M

11 in total

1. The mutation spectrum revealed by paired genome sequences from a lung cancer patient.

Authors: William Lee; Zhaoshi Jiang; Jinfeng Liu; Peter M Haverty; Yinghui Guan; Jeremy Stinson; Peng Yue; Yan Zhang; Krishna P Pant; Deepali Bhatt; Connie Ha; Stephanie Johnson; Michael I Kennemer; Sankar Mohan; Igor Nazarenko; Colin Watanabe; Andrew B Sparks; David S Shames; Robert Gentleman; Frederic J de Sauvage; Howard Stern; Ajay Pandita; Dennis G Ballinger; Radoje Drmanac; Zora Modrusan; Somasekar Seshagiri; Zemin Zhang
Journal: Nature Date: 2010-05-27 Impact factor: 49.962

2. HER2 status and benefit from adjuvant trastuzumab in breast cancer.

Authors: Soonmyung Paik; Chungyeul Kim; Norman Wolmark
Journal: N Engl J Med Date: 2008-03-27 Impact factor: 91.245

3. Studies of the HER-2/neu proto-oncogene in human breast and ovarian cancer.

Authors: D J Slamon; W Godolphin; L A Jones; J A Holt; S G Wong; D E Keith; W J Levin; S G Stuart; J Udove; A Ullrich
Journal: Science Date: 1989-05-12 Impact factor: 47.728

4. A comprehensive catalogue of somatic mutations from a human cancer genome.

Authors: Erin D Pleasance; R Keira Cheetham; Philip J Stephens; David J McBride; Sean J Humphray; Chris D Greenman; Ignacio Varela; Meng-Lay Lin; Gonzalo R Ordóñez; Graham R Bignell; Kai Ye; Julie Alipaz; Markus J Bauer; David Beare; Adam Butler; Richard J Carter; Lina Chen; Anthony J Cox; Sarah Edkins; Paula I Kokko-Gonzales; Niall A Gormley; Russell J Grocock; Christian D Haudenschild; Matthew M Hims; Terena James; Mingming Jia; Zoya Kingsbury; Catherine Leroy; John Marshall; Andrew Menzies; Laura J Mudie; Zemin Ning; Tom Royce; Ole B Schulz-Trieglaff; Anastassia Spiridou; Lucy A Stebbings; Lukasz Szajkowski; Jon Teague; David Williamson; Lynda Chin; Mark T Ross; Peter J Campbell; David R Bentley; P Andrew Futreal; Michael R Stratton
Journal: Nature Date: 2009-12-16 Impact factor: 49.962

Review 5. The regulation of direct-to-consumer genetic tests.

Authors: Jane Kaye
Journal: Hum Mol Genet Date: 2008-10-15 Impact factor: 6.150

6. A small-cell lung cancer genome with complex signatures of tobacco exposure.

Authors: Erin D Pleasance; Philip J Stephens; Sarah O'Meara; David J McBride; Alison Meynert; David Jones; Meng-Lay Lin; David Beare; King Wai Lau; Chris Greenman; Ignacio Varela; Serena Nik-Zainal; Helen R Davies; Gonzalo R Ordoñez; Laura J Mudie; Calli Latimer; Sarah Edkins; Lucy Stebbings; Lina Chen; Mingming Jia; Catherine Leroy; John Marshall; Andrew Menzies; Adam Butler; Jon W Teague; Jonathon Mangion; Yongming A Sun; Stephen F McLaughlin; Heather E Peckham; Eric F Tsung; Gina L Costa; Clarence C Lee; John D Minna; Adi Gazdar; Ewan Birney; Michael D Rhodes; Kevin J McKernan; Michael R Stratton; P Andrew Futreal; Peter J Campbell
Journal: Nature Date: 2009-12-16 Impact factor: 49.962

7. The landscape of somatic copy-number alteration across human cancers.

Authors: Rameen Beroukhim; Craig H Mermel; Dale Porter; Guo Wei; Soumya Raychaudhuri; Jerry Donovan; Jordi Barretina; Jesse S Boehm; Jennifer Dobson; Mitsuyoshi Urashima; Kevin T Mc Henry; Reid M Pinchback; Azra H Ligon; Yoon-Jae Cho; Leila Haery; Heidi Greulich; Michael Reich; Wendy Winckler; Michael S Lawrence; Barbara A Weir; Kumiko E Tanaka; Derek Y Chiang; Adam J Bass; Alice Loo; Carter Hoffman; John Prensner; Ted Liefeld; Qing Gao; Derek Yecies; Sabina Signoretti; Elizabeth Maher; Frederic J Kaye; Hidefumi Sasaki; Joel E Tepper; Jonathan A Fletcher; Josep Tabernero; José Baselga; Ming-Sound Tsao; Francesca Demichelis; Mark A Rubin; Pasi A Janne; Mark J Daly; Carmelo Nucera; Ross L Levine; Benjamin L Ebert; Stacey Gabriel; Anil K Rustgi; Cristina R Antonescu; Marc Ladanyi; Anthony Letai; Levi A Garraway; Massimo Loda; David G Beer; Lawrence D True; Aikou Okamoto; Scott L Pomeroy; Samuel Singer; Todd R Golub; Eric S Lander; Gad Getz; William R Sellers; Matthew Meyerson
Journal: Nature Date: 2010-02-18 Impact factor: 49.962

8. Systematic sequencing of renal carcinoma reveals inactivation of histone modifying genes.

Authors: Gillian L Dalgliesh; Kyle Furge; Chris Greenman; Lina Chen; Graham Bignell; Adam Butler; Helen Davies; Sarah Edkins; Claire Hardy; Calli Latimer; Jon Teague; Jenny Andrews; Syd Barthorpe; Dave Beare; Gemma Buck; Peter J Campbell; Simon Forbes; Mingming Jia; David Jones; Henry Knott; Chai Yin Kok; King Wai Lau; Catherine Leroy; Meng-Lay Lin; David J McBride; Mark Maddison; Simon Maguire; Kirsten McLay; Andrew Menzies; Tatiana Mironenko; Lee Mulderrig; Laura Mudie; Sarah O'Meara; Erin Pleasance; Arjunan Rajasingham; Rebecca Shepherd; Raffaella Smith; Lucy Stebbings; Philip Stephens; Gurpreet Tang; Patrick S Tarpey; Kelly Turrell; Karl J Dykema; Sok Kean Khoo; David Petillo; Bill Wondergem; John Anema; Richard J Kahnoski; Bin Tean Teh; Michael R Stratton; P Andrew Futreal
Journal: Nature Date: 2010-01-06 Impact factor: 49.962

9. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome.

Authors: Timothy J Ley; Elaine R Mardis; Li Ding; Bob Fulton; Michael D McLellan; Ken Chen; David Dooling; Brian H Dunford-Shore; Sean McGrath; Matthew Hickenbotham; Lisa Cook; Rachel Abbott; David E Larson; Dan C Koboldt; Craig Pohl; Scott Smith; Amy Hawkins; Scott Abbott; Devin Locke; Ladeana W Hillier; Tracie Miner; Lucinda Fulton; Vincent Magrini; Todd Wylie; Jarret Glasscock; Joshua Conyers; Nathan Sander; Xiaoqi Shi; John R Osborne; Patrick Minx; David Gordon; Asif Chinwalla; Yu Zhao; Rhonda E Ries; Jacqueline E Payton; Peter Westervelt; Michael H Tomasson; Mark Watson; Jack Baty; Jennifer Ivanovich; Sharon Heath; William D Shannon; Rakesh Nagarajan; Matthew J Walter; Daniel C Link; Timothy A Graubert; John F DiPersio; Richard K Wilson
Journal: Nature Date: 2008-11-06 Impact factor: 49.962

10. Complex landscapes of somatic rearrangement in human breast cancer genomes.

Authors: Philip J Stephens; David J McBride; Meng-Lay Lin; Ignacio Varela; Erin D Pleasance; Jared T Simpson; Lucy A Stebbings; Catherine Leroy; Sarah Edkins; Laura J Mudie; Chris D Greenman; Mingming Jia; Calli Latimer; Jon W Teague; King Wai Lau; John Burton; Michael A Quail; Harold Swerdlow; Carol Churcher; Rachael Natrajan; Anieta M Sieuwerts; John W M Martens; Daniel P Silver; Anita Langerød; Hege E G Russnes; John A Foekens; Jorge S Reis-Filho; Laura van 't Veer; Andrea L Richardson; Anne-Lise Børresen-Dale; Peter J Campbell; P Andrew Futreal; Michael R Stratton
Journal: Nature Date: 2009-12-24 Impact factor: 49.962