| Literature DB >> 34907319 |
Longbing Cao1, Chengzhang Zhu2.
Abstract
Enterprise data typically involves multiple heterogeneous data sources and external data that respectively record business activities, transactions, customer demographics, status, behaviors, interactions and communications with the enterprise, and the consumption and feedback of its products, services, production, marketing, operations, and management, etc. They involve enterprise DNA associated with domain-oriented transactions and master data, informational and operational metadata, and relevant external data. A critical challenge in enterprise data science is to enable an effective 'whole-of-enterprise' data understanding and data-driven discovery and decision-making on all-round enterprise DNA. Accordingly, here we introduce a neural encoder Table2Vec for automated universal representation learning of entities such as customers from all-round enterprise DNA with automated data characteristics analysis and data quality augmentation. The learned universal representations serve as representative and benchmarkable enterprise data genomes (similar to biological genomes and DNA in organisms) and can be used for enterprise-wide and domain-specific learning tasks. Table2Vec integrates automated universal representation learning on low-quality enterprise data and downstream learning tasks. Such automated universal enterprise representation and learning cannot be addressed by existing enterprise data warehouses (EDWs), business intelligence and corporate analytics systems, where 'enterprise big tables' are constructed with reporting and analytics conducted by specific analysts on respective domain subjects and goals. It addresses critical limitations and gaps of existing representation learning, enterprise analytics and cloud analytics, which are analytical subject, task and data-specific, creating analytical silos in an enterprise. We illustrate Table2Vec in characterizing all-round customer data DNA in an enterprise on complex heterogeneous multi-relational big tables to build universal customer vector representations. The learned universal representation of each customer is all-round, representative and benchmarkable to support both enterprise-wide and domain-specific learning goals and tasks in enterprise data science. Table2Vec significantly outperforms the existing shallow, boosting and deep learning methods typically used for enterprise analytics. We further discuss the research opportunities, directions and applications of automated universal enterprise representation and learning and the learned enterprise data DNA for automated, all-purpose, whole-of-enterprise and ethical machine learning and data science.Entities:
Mesh:
Year: 2021 PMID: 34907319 PMCID: PMC8671530 DOI: 10.1038/s41598-021-03443-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Subject-oriented enterprise data science tailored for domain-specific subjects, problems and tasks by individual analysts in an enterprise. Driven by analytical subjects, analysts build workflows for each subject to implement data ETL, data preparation, feature engineering, and modeling.
Fitness of self-contained classifiers to address characteristics and issues in enterprise data.
| Classifiers | Imbalance | Mixed features | Heterogeneity | Sparsity | Inconsistency | Dynamics | Data quality issues |
|---|---|---|---|---|---|---|---|
| KNN | |||||||
| Naive Byes | |||||||
| SVM | |||||||
| Decision Tree | |||||||
| Random Forest | |||||||
| XGBoost | |||||||
| DNN | |||||||
| Table2Vec |
Typical classifiers only focus on classification, requiring heavy and duplicated commitments by each analyst to address similar data quality issues. Table2Vec instead addresses both data quality issues and representation learning in one go, enabling end-to-end and automated enterprise data science.
Figure 2Learning enterprise representations to capture the universal, representative and benchmarkable data genomes of an enterprise and their customers: enterprise data DNA, and customer data DNA.
Figure 3Automated universal representation-based enterprise data science for general-purpose and specific tasks in an enterprise. Data-augmented universal representations are learned on integrated enterprise data and enhanced to address business and data characteristics and dynamics, which can then be used for enterprise-wide and domain-specific learning tasks or applications in the enterprise.
Figure 4The neural framework of Table2Vec with automated data characteristics analysis and augmentation. NC-Recognizer automatically recognize and distinguish numerical and categorical features, SD-Recognizer automatically recognize and distinguish static and dynamic features, ND-Embedding vectorize dynamic numerical feature values, NS-Embedding vectorize static numerical feature values, CS-Embedding vectorize static categorical feature values, CD-Embedding vectorize dynamic categorical feature values.
Figure 5The information flow of Table2Vec. Customer-related records are extracted from the enterprise big data table, on which Table2Vec learns a representation of each customer; the learned customer representation can be used for business problem-specific applications or be further transformed to support learning tasks.
Figure 6The interpretation module of Table2Vec. The features discriminating a customer’s matter of concerne is traced from the original data, forming the customer’s pattern corresponding to the learned discriminative representations.
Figure 7Data characteristics of the enterprise data from a bank. SC static categorical data, SN static numerical data, DC dynamic categorical data, DN to dynamic numerical data.
Figure 8The ROC results of shallow and deep models versus Table2Vec-based classification on the banking enterprise data.
Figure 9The weighted accuracy of classification by baselines versus Table2Vec-based classification on the banking enterprise data.
Figure 10Data characteristics of a Kaggle data set. SC static categorical data, SN static numerical data, DC dynamic categorical data, DN dynamic numerical data.
Figure 11The ROC results of shallow and deep models versus Table2Vec-based classification on the Kaggle data set.
Figure 12The weighted accuracy of classification by baselines against Table2Vec-based classification on the Kaggle data set.
Figure 13The weighted accuracy of classification on the original table versus the Table2Vec-enabled representation.
Figure 14The representative ‘data genomes’ discriminating customer churn vs. retention in the Kaggle data set. The most representative genomic factors of a customer and their intrinsic contributions to customer churn or retention are discovered by the Table2Vec interpretation module.
Figure 15The representative ‘data genomes’ identified from the representations learned by Table2Vec.
Figure 16The comparison of Table2Vec and XGBoost in terms of their learned customer data representations.