| Literature DB >> 34862561 |
Anthony M Musolf1, Emily R Holzinger2, James D Malley1, Joan E Bailey-Wilson3.
Abstract
Genetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) techniques offer an alternative solution. ML algorithms are commonly used in genetics to predict or classify subjects, but some methods evaluate which features (variables) are responsible for creating a good prediction; this is called feature importance. This is critical in genetics, as researchers are often interested in which features (e.g., SNP genotype or environmental exposure) are responsible for a good prediction. This allows for the deeper analysis beyond simple prediction, including the determination of risk factors associated with a given phenotype. Feature importance further permits the researcher to peer inside the black box of many ML algorithms to see how they work and which features are critical in informing a good prediction. This review focuses on ML methods that provide feature importance metrics for the analysis of genetic data. Five major categories of ML algorithms: k nearest neighbors, artificial neural networks, deep learning, support vector machines, and random forests are described. The review ends with a discussion of how to choose the best machine for a data set. This review will be particularly useful for genetic researchers looking to use ML methods to answer questions beyond basic prediction and classification.Entities:
Mesh:
Year: 2021 PMID: 34862561 PMCID: PMC9360120 DOI: 10.1007/s00439-021-02402-z
Source DB: PubMed Journal: Hum Genet ISSN: 0340-6717 Impact factor: 5.881
List of machine learning software
| Software name | Machine type | Software type | Application | Data type | Website |
|---|---|---|---|---|---|
| SURF | Part of open-source package | Epistatic interactions | SNP | ||
| STIR | Standalone Program | Epistatic interactions | SNP | ||
| ReliefSeq | Standalone Program | Epistatic interactions | RNA-seq | ||
| KNN-MDR | Standalone program | Epistatic interactions | SNP | n/a | |
| GANN | ANN | Standalone program | Gene-based expression association | RNA-seq | n/a |
| ANNI | ANN | Standalone program | Epistatic interactions | SNP | n/a |
| ATHENA | ANN | Standalone program | Epistatic interactions | SNP | |
| Basset | Deep learning | Standalone program | Noncoding annotation | DNA-seq | |
| DeepSEA | Deep learning | Standalone program | Noncoding annotation | DNA-seq | |
| DeepWAS | Deep learning | Standalone program | GWAS/annotation integration | GWAS | |
| DFIM | Deep learning | Standalone program | Epistatic interactions | DNA-seq | |
| PrimateAI | Deep learning | Standalone program | Variant pathogenicity | DNA-seq | |
| CADD | SVM | Standalone program | Variant pathogenicity | DNA-seq | |
| MSIpred | SVM | Python package | Microsatellite instability prediction | WES | |
| REVEL | RF | Standalone program | Variant pathogenicity | DNA-seq | |
| Random jungle | RF | R package | GWAS | SNP | |
| Ranger | RF | R package | GWAS | SNP | |
| Open target genetics | RF | Standalone program | SNP/gene prioritization | GWAS Results | |
| Permuted RF | RF | Standalone program | Epistatic interactions | SNP | n/a |
| RF fishing | RF | Standalone program | Epistatic interactions | SNP | n/a |
| SWSFS | RF | Standalone program | Epistatic interactions | SNP | n/a |
| r2VIM | RF | Standalone program | Epistatic interactions | SNP | |
| Boruta | RF | R package | Epistatic interactions | SNP | |
| Vita | RF | R package | Epistatic interactions | SNP |
A list of the software referenced in this review. The columns represent the software names, the type of machine used in the software, the type of software (i.e., whether the software is a stand-alone program or a package), the application for the software, the type of data the software analyzes (note that programs that use SNP data can also use DNA-seq), and the link to download the software (if available)
Fig. 1k-nearest neighbors. A diagram showing an example of the k-nearest neighbor machine. Subjects are plotted based on feature values, and an individual’s classification is determined by a majority vote in the subject’s neighborhood (k). The choosing of k is crucial to classification. For instance, if we wished to classify the green individual based on k = 4, the individual would be classified as blue. If we extended this to k = 9, the individual would be classified as red
Fig. 2Classification and Regression Trees (CART) and Random Forest. a Diagram showing a single CART. CARTs take a heterogeneous group of data and repeatedly split on feature values to create more homogeneous groups. b Diagram showing a random forest. A random forest is a collection of CARTs, each running on a slightly different subset of the same data set
Fig. 3Artificial neural networks. A schematic of an artificial neural network. Data are analyzed by different models, the results of which are passed onto a new set of models. In this example, data are first analyzed in the input layer (blue). The results are then passed onto an intermediate layer, called a hidden layer (green). Finally, the results of the hidden layer are passed onto and analyzed by the models of the output layer (red)
Fig. 4Deep learning. A schematic of a deep learning machine. Deep learning is a specialized version of artificial neural networks that contain many additional hidden layers
Fig. 5Support vector machines. A diagram showing an example of a support vector machine. Subjects are plotted based on feature values, and a special boundary called the hyperplane is formed to classify individuals. The hyperplane is oriented as far as possible from the two closest individuals in each class (in this example, the orange and purple individuals)