| Literature DB >> 29881249 |
Vladimer B Kobayashi1, Stefan T Mol1, Hannah A Berkers1, Gábor Kismihók1, Deanne N Den Hartog1.
Abstract
Organizations are increasingly interested in classifying texts or parts thereof into categories, as this enables more effective use of their information. Manual procedures for text classification work well for up to a few hundred documents. However, when the number of documents is larger, manual procedures become laborious, time-consuming, and potentially unreliable. Techniques from text mining facilitate the automatic assignment of text strings to categories, making classification expedient, fast, and reliable, which creates potential for its application in organizational research. The purpose of this article is to familiarize organizational researchers with text mining techniques from machine learning and statistics. We describe the text classification process in several roughly sequential steps, namely training data preparation, preprocessing, transformation, application of classification techniques, and validation, and provide concrete recommendations at each step. To help researchers develop their own text classifiers, the R code associated with each step is presented in a tutorial. The tutorial draws from our own work on job vacancy mining. We end the article by discussing how researchers can validate a text classification model and the associated output.Entities:
Keywords: naive Bayes; random forest; support vector machines; text classification; text mining
Year: 2017 PMID: 29881249 PMCID: PMC5975702 DOI: 10.1177/1094428117719322
Source DB: PubMed Journal: Organ Res Methods ISSN: 1094-4281
Figure 1.Confusion matrix as a reference to compute the evaluation measures. Note: FN = false negative; FP = false positive; TN = true negative; TP = true positive.
Figure 2.Diagrammatic depiction of the text classification process.
Training Sizes, Number of Categories, Evaluation Measures, and Evaluation Procedures Used in Various Text Classification Studies.
| Citations | Subject Matter | Training Size | Number of Categories | Evaluation Measure | Evaluation Procedure |
|---|---|---|---|---|---|
|
| Domain disambiguation for web search results | 12,340 | 8 | Accuracy | Fivefold cross validation (CV) |
|
| Disease classification for medical abstracts | 28,145 | 5 | Accuracy | Fivefold CV |
|
| Titles of scientific documents | 8,100 | 6 | Accuracy & F-measure | Fivefold CV |
|
| Response emails of operators to customers | 1,486 | 14 | Accuracy & F-measure | Tenfold CV |
|
| News items | 764-19,997 | 4-93 | Accuracy, micro averaged breakeven points, F-measure, recall, precision, breakeven point | 1 training and 2 test, single train-test, 20 splits with intercorpus evaluation, fourfold CV, tenfold CV, 20 splits |
|
| Sentiment analysis of actor-issue relationship | 5,348 | 2 | F-measure | Did not mention |
|
| Product reviews sentiments | 31,574 | 7 | Accuracy | 2 test |
|
| Song lyrics | 6,499 | 33 | Micro-averaged break even points | Single-train test |
|
| Chinese news texts | 2,816 | 10 | Accuracy & F-measure | Single-train test |
|
| Arabic news documents | 1,445 | 9 | F-measure | Fourfold CV |
|
| Dutch documents | 1,436 | 4 | Accuracy | Single train-test |
|
| Emails | 400-9,332 | 2 | F-measure | Single train-test |
|
| Turkish news items | 1,150-99,021 | 5-10 | F-measure | |
|
| Italian news items | 16,000 | 8 | F-measure & breakeven point | 20 splits with intercorpus validation |
|
| Czech news items | 8,000 | 5 | F-measure | Fourfold CV |
|
| Cyberspace comment sentiment analysis | 1,041 | 5 | Accuracy | Tenfold CV |
|
| Disgruntled employee communications | 80 | 2 | Accuracy | Single train—test varying proportion |
|
| Personality from emails (this is a multilabel classification problem) | 114,907 | 3 categories per personality | Accuracy | Tenfold CV and single train-test |
Text Classification Based on the Input-Process-Output Approach.
| Text Preprocessing | Text Transformation | Dimensionality Reduction | Classification | Evaluation | Validation | ||
|---|---|---|---|---|---|---|---|
| Text Preparation | Text Cleaning | ||||||
| Input | Raw html files | Output from text preparation | Output from text cleaning | Document-by-term matrix | Output from dimensionality reduction | Classification model, test data, and an evaluation measure | Classification from the model |
| Process | Parsing, sentence segmentation | Punctuation, number, and stopword removal, lower case transformation | Word tokenization, constructing the document-by-term matrix where the words are the features and the entries are raw frequencies of the words in each document | Latent semantic analysis and/or supervised scoring methods for feature selection | Apply classification algorithms such as naive Bayes, support vector machines, or random forest | Classify the documents in test data and compare with the actual labels; calculate the value of the evaluation measure | Compute classification performance using an independent validation data set or compare the classification to the classification of domain experts |
| Output | Raw text file (one sentence per line) | Raw text file sentences where all letters are in lower cases and without punctuation, number and stopwords | Document-by-term matrix | Matrix where the columns are the new set of features or the reduced document-by-term matrix | Classification model | Value for the evaluation | Measure of agreement (one can quantify the agreement through the use existing evaluation measure) |
Figure 3.Illustration of text preprocessing from raw HTML file to document-by-term matrix.
Figure 4.Loadings of the terms on the first 6 LSA dimensions using 422 sentences from 11 vacancies.
Figure 5.Comparison of classification performance among three classifiers and between the term-based and LSA-based features.
Basic Care and Medical Care Core Nursing Tasks Extracted From Nursing Vacancies by Applying Text Classification.
| Task | German Translation | Task Cluster |
|---|---|---|
| Monitoring the patients’ therapy | Überwachung der Therapie des Patienten | Basic care |
| Caring for the elderly | Pflege von älteren Menschen | Basic care |
| Providing basic or general care | Durchführung der Allgemeinen Pflege | Basic care |
| Providing palliative care | Durchführung von Palliativpflege | Basic care |
| Caring for mentally ill patients | Pflege von psychisch kranken Menschen | Basic care |
| Caring for children | Pflege von Kindern | Basic care |
| Assisting at intake of food | Hilfe bei der Nahrungsaufnahme | Basic care |
| Supporting of rehabilitation | Unterstützung der Rehabilitation | Basic care |
| Providing holistic care | Durchführung ganzheitlicher Pflege | Basic care |
| Accompanying patients | Begleitung von Patienten | Basic care |
| Assisting at surgical interventions | Assistenz bei operativen Eingriffen | Medical care |
| Doing laboratory tests | Durchführung von Labortests | Medical care |
| Participating in resuscitations | Beteiligung an Reanimationsmaßnahmen | Medical care |
| Conducting ECG | Durchführung von EKG | Medical care |
| Collecting blood | Durchführung der Blutabnahme | Medical care |
| Preparing and administer intravenous drugs | Vorbereitung und Verabreichung von intravenösen Medikamenten | Medical care |
| Assisting at diagnostic interventions | Assistenz bei diagnostischen Maßnahmen | Medical care |
| Operating the technical equipment | Bedienung der technischen Geräteschaften | Medical care |
| Assisting at endoscopic tests | Assistenz bei endoskopischen Maßnahmen | Medical care |
| Assisting at examination | Assistenz bei Untersuchungen | Medical care |