| Literature DB >> 25733795 |
Shuyun Ye1, John A Dawson1, Christina Kendziorski2.
Abstract
Genomic-based studies of disease now involve diverse types of data collected on large groups of patients. A major challenge facing statistical scientists is how best to combine the data, extract important features, and comprehensively characterize the ways in which they affect an individual's disease course and likelihood of response to treatment. We have developed a survival-supervised latent Dirichlet allocation (survLDA) modeling framework to address these challenges. Latent Dirichlet allocation (LDA) models have proven extremely effective at identifying themes common across large collections of text, but applications to genomics have been limited. Our framework extends LDA to the genome by considering each patient as a "document" with "text" detailing his/her clinical events and genomic state. We then further extend the framework to allow for supervision by a time-to-event response. The model enables the efficient identification of collections of clinical and genomic features that co-occur within patient subgroups, and then characterizes each patient by those features. An application of survLDA to The Cancer Genome Atlas ovarian project identifies informative patient subgroups showing differential response to treatment, and validation in an independent cohort demonstrates the potential for patient-specific inference.Entities:
Keywords: cancer; genomics; latent Dirichlet allocation; survival; time-to-event
Year: 2015 PMID: 25733795 PMCID: PMC4332045 DOI: 10.4137/CIN.S16354
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1The left panel shows a heat map of the estimated patient-specific distributions over topics (θ) for each of 511 patients (the background topic is not shown). Topics are given in the rows; patients are clustered along the columns. Colors range from deep blue (topic underrepresented in the patient’s document) to red (topic overrepresented). The right panel shows Kaplan–Meier survival curves for patients classified into one of the six nonbackground topics. Each patient was assigned to the topic having highest weight in his/her document, as estimated by θ1:.
Figure 2The left panel shows a heat map of the topics derived from survLDA. Topics are shown in the columns; words are clustered along the rows. The colors range from blue (word underrepresented in the topic) to red (word overrepresented), with white in the middle (average representation). To aid in interpretation, we add the risk direction and data source from which each word was derived. For example, CYP19A1 – mRNA indicates that underexpression of CYP19A1 is associated with increased risk and that CYP19A1 words were entered into a document for patients with underexpression of CYP19A1. As the heat map shows, there are many words that distinguish topics 1 and 2, having high weight in one topic but not the other. The insets highlight 40 such words; those having high weight in topic 1 (topic 2) are shown in the upper (lower) right.
Figure 3Heat maps showing co-occurrence of the 40 high-weight topic 1 and topic 2 words shown in Figure 2. The left heat map considers the 25 patients having documents with highest weight on topic 1. Shown are the percentages of those patients having both words in their document, ranging from 0 (blue) to 100% (red). The black line separates topic 1 and topic 2 words. The right panel is similar, showing percentages of co-occurrence in documents of the 85 patients best described by topic 2 words.
Figure 4Topic-based prediction of overall survival in an independent patient cohort.