| Literature DB >> 31240260 |
Aubin Thomas1, Sylvain Barriere1, Lucile Broseus1, Julie Brooke1, Claudio Lorenzi1, Jean-Philippe Villemin1, Gregory Beurier2, Robert Sabatier3, Christelle Reynes3, Alban Mancheron4,5, William Ritchie1.
Abstract
Comparative analysis of high throughput sequencing data between multiple conditions often involves mapping of sequencing reads to a reference and downstream bioinformatics analyses. Both of these steps may introduce heavy bias and potential data loss. This is especially true in studies where patient transcriptomes or genomes may vary from their references, such as in cancer. Here we describe a novel approach and associated software that makes use of advances in genetic algorithms and feature selection to comprehensively explore massive volumes of sequencing data to classify and discover new sequences of interest without a mapping step and without intensive use of specialized bioinformatics pipelines. We demonstrate that our approach called GECKO for GEnetic Classification using k-mer Optimization is effective at classifying and extracting meaningful sequences from multiple types of sequencing approaches including mRNA, microRNA, and DNA methylome data.Entities:
Keywords: Machine learning; Predictive medicine
Mesh:
Substances:
Year: 2019 PMID: 31240260 PMCID: PMC6586863 DOI: 10.1038/s42003-019-0456-9
Source DB: PubMed Journal: Commun Biol ISSN: 2399-3642
Fig. 1Overview of the GECKO algorithm. Input fastq or bam files from two or more conditions are transformed into a matrix of k-mer counts across all samples. The k-mers for which the counts are below a noise threshold or that do not vary across samples are removed (red dots on the right of the k-mer matrix). The adaptive genetic algorithm randomly selects groups of k-mers from the k-mer matrix to form individuals. These individuals will go through rounds of mutation, crossing-over and selection to discover individuals capable of classifying the input samples with high accuracy
Fig. 2GECKO can accurately classify miRNA data from seven types of blood cells using three k-mers. a GECKO output showing the separation of the seven blood-cell types at each generation (G) of GECKO analysis using t-SNE visualization applied to k-mer counts. b GECKO output showing the accuracy of separation for the training and test set across 6000 generations. c variance stabilized counts of the three miRNAs that correspond to the three k-mers discovered by GECKO across the seven blood-cell types (n = 43 biologically independent donors)
Fig. 3GECKO discovers 10 30-mers that classify breast cancer subtypes. Comparison of breast cancer subtype classification using the frequency of k-mers discovered by GECKO and the transcript per million values of the PAM50 gene. Panels show the t-SNE separation of the four classes
Confusion matrices of breast cancer subtype classification using the frequency of k-mers discovered by GECKO and the transcript per million values of the PAM50 gene set
| Classification with GECKO | Classification with PAM50 TPM values | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Predicted class | Basal | 97.7 | 2.2 | 0 | 0 | Predicted class | Basal | 86 | 5.2 | 5.5 | 3.3 |
| Her2 | 2 | 87.5 | 6.2 | 4.2 | Her2 | 15.3 | 60.6 | 3.6 | 20.6 | ||
| LumA | 1.5 | 1.5 | 92.3 | 4.6 | LumA | 15.3 | 2.2 | 88.1 | 8.6 | ||
| LumB | 0 | 3.4 | 18.8 | 77.8 | LumB | 5.9 | 15.4 | 36.5 | 42.2 | ||
| Basal | Her2 | LumA | LumB | Basal | Her2 | LumA | LumB | ||||
| True class | True class | ||||||||||
Fig. 4GECKO voting mode for small sample sizes. a GECKO’s voting mode will run 10 separate genetic algorithms with added Gaussian noise. The best solutions of these runs will be fed into a final genetic algorithm to produce a final solution. b GECKO output showing the t-SNE separation of patients with complete response to chemotherapy from those that did not using five k-mers from the winning individual. Triangles correspond to the test dataset that was excluded from GECKO training can thus be used to estimate overfitting
Fig. 5GECKO can accurately classify normal and CLL patients using k-mers from bisulfite sequencing data. a GECKO output showing the t-SNE separation of CLL and normal samples using 20 k-mers from the winning individual. b GECKO output of K-mer exploration across 20,000 generations; k-mers that are frequently found in winning organisms are displayed as horizontal lines across generations; dots represent k-mers that were selected in one generation but eliminated in the following generation often due to a decrease in fitness of the model. c IGV screenshots showing the methylation status of normal and CLL samples of regions corresponding to three most frequently used k-mers in winning organisms determined by the Bismark software