| Literature DB >> 27729304 |
Robert Eugene Hoyt1, Dallas Snider2, Carla Thompson3, Sarita Mantravadi1.
Abstract
BACKGROUND: We live in an era of explosive data generation that will continue to grow and involve all industries. One of the results of this explosion is the need for newer and more efficient data analytics procedures. Traditionally, data analytics required a substantial background in statistics and computer science. In 2015, International Business Machines Corporation (IBM) released the IBM Watson Analytics (IBMWA) software that delivered advanced statistical procedures based on the Statistical Package for the Social Sciences (SPSS). The latest entry of Watson Analytics into the field of analytical software products provides users with enhanced functions that are not available in many existing programs. For example, Watson Analytics automatically analyzes datasets, examines data quality, and determines the optimal statistical approach. Users can request exploratory, predictive, and visual analytics. Using natural language processing (NLP), users are able to submit additional questions for analyses in a quick response format. This analytical package is available free to academic institutions (faculty and students) that plan to use the tools for noncommercial purposes.Entities:
Keywords: data analysis; data mining; machine learning; natural language processing; statistical data analysis
Year: 2016 PMID: 27729304 PMCID: PMC5080525 DOI: 10.2196/publichealth.5810
Source DB: PubMed Journal: JMIR Public Health Surveill ISSN: 2369-2960
IBM Watson statistical tests.
| Statistical test | Indication |
| Analysis of variance (ANOVA) | ANOVA tests mean differences among 2 or more groups and whether the mean target value varies across combinations of categories of 2 inputs; If the variation is significant, there is an interaction effect |
| Asymmetry index | Ratio of skewness to the standard error |
| Chi-square automatic interaction detector classification tree | Decision tree using chi-square for prediction |
| Chi-square automatic interaction detector regression tree | Decision tree using chi-square and regression for prediction |
| Chi-square tests | Using chi-square to compare frequencies in groups, independence, and marginal distributions |
| D’Agostino’s K-squared test of normality | Determines if normal distribution is present |
| Distribution test | Chi-square test compares conditional distributions with overall distribution |
| Fisher r-to-t test | Transforms Pearson’s |
| High low analysis | Partitions categories into high or low groups for analysis |
| Influence test | Chi-square test determines whether the number of records in a group is significantly different from the expected frequency. |
| Model comparison test | Tests whether the key driver has an effect on the logistic regression |
| Paired samples | Dependent |
| Unusually high or low analysis | Determines which categories or combinations of categories across categorical fields have unusually high or low target mean values |
Features of IBM Watson Analytics Professional.
| Features | IBM Watson Analytics Professional |
| Maximum number of rows per dataset | 10,000,000 |
| Maximum number of columns per dataset | 500 |
| Input in .csv, .xls or, .xlsx formats | Uploaded from PC, Dropbox, IBM Cognos, Box, and Microsoft OneDrive |
| Data connections | IBM Cognos BI server, IBM dash DB, IBM DB2, IBM SQL, Microsoft SQL server, MySQL, Oracle, and PostgreSQL |
| Storage | 100 GB; can be increased in increments of 50 GB |
Features of SPSS.
| Tool | Function |
| Core stats and graphics | Standard statistical tests for nominal, ordinal, interval, and ratio data |
| Integration with R and Python languages | Expands programmability involving additional languages |
| Multiple linear and mixed modeling | Analyze complex relationships |
| Nonlinear regression | Predictions on nonlinear data |
| Simulation modeling | Build risk models when inputs are uncertain |
| Geospatial analytics | Integrate and analyze time and location data |
| Customized tables | Analyze and report on numerical and categorical data |
| Charts, graphs, and mapping | Assist reporting capabilities |
| Missing value analysis | Address missing data, imputation, etc |
| Advanced data preparation | Identify data anomalies |
| Decision trees | Identify group relationships to predict future events |
| Forecasting techniques | Predict trends with time-series data |
| SPSS Text Analytics | This add-on complementary software package accompanies SPSS to provide qualitative data analyses and visuals for quantitative data simultaneously analyzed with SPSS |
Features of Microsoft Excel Analysis ToolPak.
| Tool | Function |
| Analysis of variance (ANOVA) | Determines variance on single or multiple factors and mean differences among 2 or more groups |
| Correlation | Determines if a pair of variables are related |
| Covariance | Determines if a pair of variables move together and mean differences in 2 or more groups when controlling for initial group differences |
| Descriptive statistics | Determines central tendency and variability in the data |
| Exponential smoothing | Predicts a value based on prior forecast |
| Performs a 2-sample | |
| Fourier analysis | Transforms time-based patterns into cyclical components |
| Histogram | Calculates frequencies of values in dataset |
| Moving average | Forecasts values based on prior averages |
| Random number generation | Fills a range with independent random numbers |
| Rank and percentile | Creates a table with ordinal and percentile ranks and used with chi-square analyses |
| Regression | Linear regression based on “least squares” method |
| Sampling | Creates a sample from a population |
| Tests for equality of population means, with equal and unequal variances based on 1-group or 2-group datasets | |
| Performs a 1-sample |
Features of Microsoft SQL Server Analysis Services.
| Tool | Function |
| Multiple data inputs | Use tabular data, spreadsheets, and text files |
| Data management | Data cleaning; management; and extract, transform, and load |
| Model testing | Use cross-validation, lift, and scatter charts |
| Data mining algorithms | Clustering, Naïve Bayes, decision trees, neural networks, regression, and association rules |
| Scripting language support | Mining objects are programmable |
Features of Waikato Environment for Knowledge Analysis.
| Tool | Function |
| Preprocess | Descriptive statistics and ability to preprocess data; Data from .csv and .arff files, web data, database data, and ability to generate artificial data |
| Classify | Classify data from Bayes, neural networks, regression, decision trees, production rules, and other algorithms |
| Cluster | 12 clustering algorithms, to include the common simple k-means |
| Associate | Association rules for pattern recognition in data |
| Select attributes | Searches for best set of attributes in dataset |
| Visualize | Visualization of data into graphs, etc |
Figure 1The % obese by Florida county.
Figure 2The relationship between % physically inactive and % obese by county.
Figure 3Predictors for factors related to % obese.
Figure 4Predict option for nominal category of less than or greater than 30% obese by county.
Figure 5Dashboard of Florida County Health Rankings.
Results of the comparison of different analytical packages.
| Software package | Results |
| IBM Watson Analytics | Using logistic regression, the thallium test had a predictive strength of 76% (percent correct classification); chi-square test revealed |
| Statistical Package for the Social Sciences | Using logistic regression, a full model with thallium, number of vessels calcified on fluoroscopy, and interaction test increased the predictive strength to 78%; however, a statistically insignificant chi-square test proved that the single model using thallium had the better model fit. |
| SQL Server Analysis Services | Decision tree analysis yielded a sensitivity of 0.80 and specificity of 0.78, while neural networks yielded a sensitivity of 0.77 and a specificity of 0.92 |
| Waikato Environment for Knowledge Analysis | Decision tree precision for presence of heart disease was 0.93 and recall (sensitivity) was 0.63; precision for absence of heart disease was 0.69 and recall was 0.95 |
Figure 6Overlap between statistics and machine learning.