| Literature DB >> 29881248 |
Vladimer B Kobayashi1, Stefan T Mol1, Hannah A Berkers1, Gábor Kismihók1, Deanne N Den Hartog1.
Abstract
Despite the ubiquity of textual data, so far few researchers have applied text mining to answer organizational research questions. Text mining, which essentially entails a quantitative approach to the analysis of (usually) voluminous textual data, helps accelerate knowledge discovery by radically increasing the amount data that can be analyzed. This article aims to acquaint organizational researchers with the fundamental logic underpinning text mining, the analytical stages involved, and contemporary techniques that may be used to achieve different types of objectives. The specific analytical techniques reviewed are (a) dimensionality reduction, (b) distance and similarity computing, (c) clustering, (d) topic modeling, and (e) classification. We describe how text mining may extend contemporary organizational research by allowing the testing of existing or new research questions with data that are likely to be rich, contextualized, and ecologically valid. After an exploration of how evidence for the validity of text mining output may be generated, we conclude the article by illustrating the text mining process in a job analysis setting using a dataset composed of job vacancies.Entities:
Keywords: classification; clustering; dimensionality reduction; job analysis; text mining; topic modeling; validation
Year: 2017 PMID: 29881248 PMCID: PMC5975701 DOI: 10.1177/1094428117722619
Source DB: PubMed Journal: Organ Res Methods ISSN: 1094-4281
Summary of Questions That Text Mining Can Address.
| Question | Name | Definition | Specific Techniques | Example | Text Representation | Examples of Potential Applications in Organizational Research |
|---|---|---|---|---|---|---|
| How do I assign text to predefined categories? | Text classification | Using an initial set of labeled text, train a classifier that can automatically sort text into existing categories. | Classification algorithms from data mining such as naive Bayes, support vector machines, neural networks, nearest neighbors, random forest, and boosting |
Distinguishing between positive and negative product reviews ( Subjective genre classification of product reviews ( Assigning semantic attributes to product descriptions ( Annotating clinical documents with semantic tags ( | Vector space model (i.e., individual terms are used as features); kernel-based methods such as support vector machines deal with text treated as strings; can use other types of features but text is still represented as vectors | Predicting performance and charisma using leaders’ collected speeches and
biographies ( |
| How do I extract topics from a corpus of documents? | Topic modeling | Identify patterns in word frequencies and use the patterns as a basis to define “topics.” For each document possible topics are determined. | Latent Dirichlet allocation model and probabilistic latent semantic analysis |
Topic modeling to extract latent evidence during the analysis phase of
digital forensic investigations ( Topic models to enhance the feature set for scientific titles
classification ( | Vector space model where the words are weighted by their frequencies | Analyzing underlying motives or leadership themes from coded interview data,
formal vision statements, and company mission or vision statements ( |
| How can I form groups of text? | Text clustering | Define a concept of text similarity. Use the concept to group documents together. Each group is called a cluster. Documents in the same cluster are more similar than documents in different clusters. | K-means, hierarchical clustering, biclustering, and nonnegative matrix factorization |
Clustering clinical trial records to narrow down search results about
existing protocols ( Organizing collections of legal documents and assisting automatic
generation of legal taxonomies ( | Vector space model; can use other types of features but text is still represented as vectors | Investigating patterns of communication between different parties through the
analysis of emails or other virtual communication between employees within firms
( |
| How can I summarize text and extract keywords and key sentences? | Text summarization | Measuring the importance of each sentence (word) in a document and using a threshold to determine which sentences (words) to retain and which to delete | Content selection using pattern matching, hidden Markov models, and keyword or key phrase extraction |
Biographical summarization ( Summarizing web page content for display on small screens of handheld
devices ( Keyword extraction in publications to narrow down and organize query
results to support systematic reviews ( | Vector space model; text is treated in terms of strings | Automatic summarization of companies’ code of conduct to gain greater understanding of what is or is not currently included in organizational policy on ethical behavior in the workplace |
| How can I analyze trends in text? | Keyword extraction over time, dynamic topic modeling, and clustering with temporal information | Find interesting terms or topics and analyze changes of usage or prevalence in documents indexed by time. | Most frequent term extraction, and dynamic topic modeling |
Analyze trends in SMS messages by tracking the use of specific keywords
( Analyzing information in software repositories to model software project
progress ( Tracing significant historical trends in the field of cognition ( | Text is treated in terms of strings; vector space model | Analyzing changes in emphasis in companies’ code of conduct in reaction to
specific events Identifying emergent skills in a corpus of job vacancies ( |
| How can I find other documents that are similar to the one I have? | Distance and similarity | Given a document, find other similar documents | Distance metrics and similarity measures |
Information retrieval ( Input to clustering ( | Vector space model | Analyzing conversation between people to facilitate exchange of rewarding
information or detection of dangerous activities ( |
| After transforming documents using the vector space model, how can I cope with many variables? | Dimensionality reduction techniques | Reduce the number of variables while preserving relative similarity among documents | Feature selection techniques based on thresholding (e.g., information gain) and feature transformation techniques such as principal component analysis, latent semantic analysis, and random projection |
Most dimensionality reduction techniques promote computational efficiency
and a more compact representation of text data ( | Vector space model | The output from this can be used in other techniques such as in classification or clustering |
Figure 1.Flowchart of the text mining process.
An Illustration of Text Preprocessing Applied to Original Text.
| Original text | Processed text | |
|---|---|---|
| D1 | Ability or experience in reviewing and authoring aircraft flight manuals, apps spec, and pilot’s guides | abil experi review author aircraft flight manual app spec pilot guid |
| D2 | Work with KEMP Management to gain approval for new product concepts/ideas | work kemp manag gain approv product concept idea |
| D3 | Handle client queries and/or requests | handl client queri request |
| D4 | 3-5 years of supervisory or product management experience required | 3-5 year supervisori product manag experi requir |
| D5 | Understanding of XML, parsing, send/receive, and experience with web services | understand xml pars send receiv experi web servic |
| D6 | Responsible for developing and maintaining quality management procedures and systems | respons develop maintain quality manag procedur system |
| Additional text | ||
| D7 | Experience with J2EE technology components (e.g., JSP, Servlets, XML, and web services) is a requirement | |
| D8 | Minimum 5 years of experience in marketing or product management roles | |
| D9 | Handling consultant and client queries | |
| D10 | Define customer applications for the product and design product positioning to support these applications | |
| D11 | 3-5 years of experience in the engineering and/or maintenance field strongly preferred | |
Note: The 6 preprocessed texts were obtained after applying stop word removal, stemming, and punctuation removal except for intraword dashes and stripping extra whitespaces.
Lower Rank Approximation of the Term-by-Document Matrix Obtained From Table 5 Using LSA by Retaining 2 Dimensions.
| D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 | D9 | D10 | D11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3-5 | 0.23 | 0.16 | 0.00 | 0.35 | 0.26 | 0.07 | 0.34 | 0.31 | 0.00 | 0.22 | 0.25 |
| abil | 0.15 | –0.01 | 0.00 | 0.11 | 0.18 | 0.01 | 0.23 | 0.08 | 0.00 | –0.10 | 0.12 |
| aircraft | 0.15 | –0.01 | 0.00 | 0.11 | 0.18 | 0.01 | 0.23 | 0.08 | 0.00 | –0.10 | 0.12 |
| experi | 0.88 | 0.22 | 0.00 | 0.94 | 1.02 | 0.15 | 1.30 | 0.78 | 0.00 | 0.05 | 0.81 |
| product | –0.03 | 0.95 | 0.00 | 0.94 | –0.08 | 0.28 | –0.07 | 0.95 | 0.00 | 1.97 | 0.32 |
| web | 0.41 | –0.05 | 0.00 | 0.28 | 0.49 | 0.02 | 0.62 | 0.21 | 0.00 | –0.30 | 0.32 |
| work | –0.01 | 0.13 | 0.00 | 0.12 | –0.03 | 0.04 | –0.03 | 0.13 | 0.00 | 0.28 | 0.04 |
| xml | 0.41 | –0.05 | 0.00 | 0.28 | 0.49 | 0.02 | 0.62 | 0.21 | 0.00 | –0.30 | 0.32 |
| year | 0.31 | 0.29 | 0.00 | 0.55 | 0.35 | 0.11 | 0.46 | 0.49 | 0.00 | 0.45 | 0.37 |
Note: This table is truncated.
Parameters and Performance Metrics for the Three Classifiers.
| Parameter | Accuracy (%) | F-measure for Job Activity | F-measure for Job Attribute | |
|---|---|---|---|---|
| Support vector machine | Dot product kernel Cost of misclassification = 1 | 97.30 | .9703 | .9751 |
| Random forest | Number of trees grown = 500 Number of variables sampled at each split = 4 | 97.31 | .9700 | .9750 |
| Naive Bayes | Laplace smoothing = 0.01 | 96.60 | .9463 | .9554 |
Figure 2.Cluster dendrogram of 11 texts.
The 168 Variables for the Vacancy Mining task.
| Feature Type | Number of Derived Features | Variable Type |
|---|---|---|
| Part of speech (POS) tag of the first word | 1 | Categorical (actual POS) |
| Is the first word in this sentence unique in work activity sentences (based on the labeled data)? | 1 | Numeric |
| Is the first word in this sentence unique in worker attribute sentences (based on the labeled data)? | 1 | Numeric |
| Is the last word in this sentence unique in work activity sentences (based on the labeled data)? | 1 | Numeric |
| Is the last in this sentence unique in worker attribute sentences (based on the labeled data)? | 1 | Numeric |
| Proportion of adjectives | 1 | Numeric |
| Proportion of verbs | 1 | Numeric |
| Proportion of the word “to” | 1 | Numeric |
| Proportion of modal verbs | 1 | Numeric |
| Proportion of numbers | 1 | Numeric |
| Proportion of adverbs | 1 | Numeric |
| Proportion of nouns | 1 | Numeric |
| Proportion of nouns, verbs, adjectives, adverbs, and other part of speech tags followed by another verb | 5 | |
| Proportion of unique words found only in work activity sentences (based on the labeled data) | 1 | Numeric |
| Proportion of unique words found only in worker attributes sentences (based on the labeled data) | 1 | Numeric |
| Frequency of keywords for work activity and worker attributes sentences | 149 | Numeric |
Figure 3.Word correlation networks for (a) Topic 86, (b) Topic 105, and (c) Topic 15.
Figure 4.(a) Intertopic distance map and (b) cluster dendrogram of medically related jobs.
Document-by-Term Matrix Constructed From the First Six Texts of Table 2.
| 3-5 | Abil | aircraft | app | approv | author | client | experi | manag | product | |
|---|---|---|---|---|---|---|---|---|---|---|
| D1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
| D2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
| D3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| D4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
| D5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| D6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Note: This table is truncated due to space limitations.
Sample Topics Extracted From the 11 Texts for LDA and CTM.
| Latent Dirichlet Allocation (LDA) | Correlated Topic Model (CTM) | |||||
|---|---|---|---|---|---|---|
| Topic 1 | Topic 2 | Topic 3 | Topic 1 | Topic 2 | Topic 3 | |
| Terms | product | abil | experi | experi | abil | product |
| manag | aircraft | client | manag | aircraft | applic | |
| applic | app | handl | year | app | handl | |
| experi | approv | queri | product | author | queri | |
| year | author | servic | 3-5 | develop | client | |
| Documents | 4, 6, 8, 10 | 1, 2 | 3, 5, 7, 9, 11 | 2, 4, 7, 8, 11 | 1, 6 | 3, 5, 9, 10 |
Some of the Topics Obtained From Applying LDA to Worker Attribute Sentences.
| Topic 100 development software agile methodologies application scrum design life | Topic 86 new learn quickly willingness adapt technologies internet desire | Topic 132 travel willingness willing work time needed internationally international | Topic 75 communication written oral verbal interpersonal presentation effective listening |
| Topic 18 highly motivated oriented self driven organized starter selfstarter | Topic 45 detail attention oriented organizational accuracy multitask follow details | Topic 20 sales selling salesforcecom outside crm success account inside | Topic 105 results leadership others goals achieve influence motivate deliver |
| Topic 60 scripting python linux programming java perl languages unix | Topic 15 attitude positive can energetic team flexible enthusiastic professional | Topic 55 design adobe creative photoshop user illustrator graphic production | Topic 108 problem solving analytical solver troubleshooting approach abilities capabilities |
| Topic 61 license valid drivers driving record transportation reliable vehicle | Topic 129 work team independently part environment pressure members member | Topic 81 management time project organizational change people planning pm | Topic 16 data analysis quantitative research statistics economics statistical modeling |