| Literature DB >> 35720939 |
Varun Dogra1, Sahil Verma2,3, Pushpita Chatterjee4, Jana Shafi5, Jaeyoung Choi6, Muhammad Fazal Ijaz7.
Abstract
With the rapid advancement of information technology, online information has been exponentially growing day by day, especially in the form of text documents such as news events, company reports, reviews on products, stocks-related reports, medical reports, tweets, and so on. Due to this, online monitoring and text mining has become a prominent task. During the past decade, significant efforts have been made on mining text documents using machine and deep learning models such as supervised, semisupervised, and unsupervised. Our area of the discussion covers state-of-the-art learning models for text mining or solving various challenging NLP (natural language processing) problems using the classification of texts. This paper summarizes several machine learning and deep learning algorithms used in text classification with their advantages and shortcomings. This paper would also help the readers understand various subtasks, along with old and recent literature, required during the process of text classification. We believe that readers would be able to find scope for further improvements in the area of text classification or to propose new techniques of text classification applicable in any domain of their interest.Entities:
Mesh:
Year: 2022 PMID: 35720939 PMCID: PMC9203176 DOI: 10.1155/2022/1883698
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1Monitoring and downloading relevant text documents to subject groups.
Figure 2Labeling text documents with appropriate predefined classes or labels during the process of text classification.
Figure 3Subtasks of the text classification process cover state-of-the-art data collection, text representation, dimensionality reduction, and machine learning models for classifying text documents to an associated predefined class/label.
Figure 4A text classification framework. Note: Black connecting lines represent training and blue connecting lines represent the testing phase.
Figure 5One-hot representation, a tensor that is used to represent each document. Each document tensor is made up of a potentially lengthy sequence of 0/1 vectors, resulting in a massive and sparse representation of the document corpus [43].
Figure 6The word2vec algorithm uses two alternative methods: (a) continuous bag of words (CBoW) and (b) skip-gram (SG) [50].
Figure 7The transformer architecture.
Benefits and limitations of text representation or feature extraction methods.
| Method | Benefits | Limitations |
|---|---|---|
| Bag-of-words | Works well with unseen words and is easy to implement as it is based on the most frequent terms in a document | Does not cover the syntactic and semantic relation of the words, common words impact classification |
| TF-IDF | In like bag-of-words approach, common words are excluded due to IDF so does not impact the result | Does not cover the syntactic and semantic relation of the words |
| Word2vec | Covers the syntactic and semantic relation of the words in the text | Does not cover the words' polysemy |
| GloVe | As the same as word2vec but performs better, eliminates common words, trained on a large corpus | Does not cover the words' polysemy and does not work well for unseen words |
| Context-aware representation | Covers the context or meaning of the words in the text | Huge memory is required for storage and does not work well for unseen words |
Benefits and limitations of feature selection methods.
| Method | Benefits | Limitations | |
|---|---|---|---|
| Univariate filter method | Information gain | Results into the relevance of an attribute or feature | Biased towards multi-valued attributes and overfitting |
| Chi-square | Reduces training time and avoids overfitting | Highly sensitive to sample size | |
| Fishers' score | Evaluates features individually to reduce the feature set | Does not handle features redundancy | |
| Pearson's correlation coefficient | Is simplest and fast and measures the linear correlation between features | It is only sensitive to a linear relationship | |
| Variance threshold | Removes features with variance below a certain cutoff | Does not consider the relationship with the target variable | |
|
| |||
| Multi-variate filter method | mRMR (minimal redundancy maximum relevance) | Measures the nonlinear relationship between feature and target variable and provides low error accuracies | Features may be mutually as dissimilar to each other as possible |
| Multi-variate relative discriminative criterion | Best determines the contribution of individual features to the underlying dimensions | Does not fit for a small sample size | |
|
| |||
| Linear multi-variate wrapper method | Recursive feature elimination | Considers high-quality top- | Computationally expensive and correlation of features not considered |
| Forward/backward stepwise selection | Is computationally efficient and greedy optimization | Sometimes impossible to find features with no correlation between them | |
| Genetic algorithm | Accommodates data set with a large number of features and knowledge about a problem not required | Stochastic nature and computationally expensive | |
|
| |||
| Nonlinear multi-variate wrapper methods | Nonlinear kernel multiplicative | De-emphasizes the least useful features by multiplying features with a scaling factor | The complexity of kernel computation and multiplication |
| Relief | Is feasible for binary classification, based on nearest neighbor instance pairs and is noise-tolerant | Does not evaluate boundaries between redundant features, not suitable for the low number of training data sets | |
|
| |||
| Embedded methods | LASSO | L1 regularization reduces overfitting, and it can be applied when features are even more than the data | Random selection when features are highly correlated |
| Ridge regression | L2 regularization is preferred over L1 when features are highly correlated | Reduction of features is a challenge | |
| Elastic net | Is better than L1 and L2 for dealing with highly correlated features, is flexible, and solves optimization problems | High computational cost | |
Benefits and limitations of the machine and deep learning model.
| Model | Benefits | Limitations |
|---|---|---|
| Naïve Bayes | It needs less training data; probabilistic approach handles continuous and discrete data; and it is not sensitive to irrelevant features, easily updatable | Data scarcity can lead to loss of accuracy because it is based on assumption that any two features are independent given the output class. |
|
| ||
| SVM | It is possible to apply to unstructured data also such as text, images, and so on; kernel provides strength to the algorithm and can work for high-dimensional data | It needs long training time on large data sets and is difficult to choose good kernel function, and choosing key parameters varies from problem to problem. |
|
| ||
| KNN | It can be implemented for classification and regression problems and produces the best results if large training data is available or even noisy training data, preferred for multi-class problems | Cost is high for computing distance for each instance; finding attributes for distance-based learning is quite a difficult task; imbalanced data causes problems; and no treatment is required for missing value. |
|
| ||
| Decision tree | It reduces ambiguity in decision-making; implicitly performs feature selection, easy representation, and interpretation; and requires fewer efforts for data preparation | It is unstable due to the effect of changes in data requires changes in the whole structure, is not suitable for continuous values, and causes overfitting problem. |
|
| ||
| Boosted decision tree | It is highly interpretable and prediction accuracy is improved. It can model feature interactions and execute feature selection on its own. Gradient boosted trees are less prone to overfitting since they are trained on a randomly selected subset of the training data. | These are computationally expensive and frequently need a large number of trees (>1,000), which can take a long time and consume a lot of memory. |
|
| ||
| Random forest | In contrast to other methods, clusters of decision trees are very easy to train, and the preparation and preprocessing of the input data do not require. | More trees in random forests increase the time complexity in the prediction stage, and high chances of overfitting occur. |
|
| ||
| CNN | It provides fast predictions, is best suited for a large volume of data, and requires no human efforts for feature design. | Computationally expensive requires a large data set for training. |
|
| ||
| RNN | It implements feedback model so considers best for time series problems and makes accurate predictions than other ANN models. | Training of model is difficult and takes a long time to find nonlinearity in data, and gradient vanishing problem occurs. |
|
| ||
| LSTM, Bi-LSTM | Adds short- and long-term memory components into RNN so it considers best for applications that have a sequence and uses for solving NLP problems such as text classification and text generation, and computing speed is high. Bi-LSTM solves the issue of predicting fixed sequence to sequence. | It is expensive and complex due to the backpropagation model, increases the dimensionality of the problem, and makes it harder to find the optimal solution. Since Bi-LSTM has double LSTM cells so it is expensive to implement. |
|
| ||
| Gated RNN (GRU) | In natural language processing, GRUs learn quicker and perform better than LSTMs on less training data. As it requires fewer training parameters. GRUs are simpler and hence easier to modify and do not need memory units, such as by adding extra gates if the network requires more input. | Slow convergence and limited learning efficiency are still issues with GRU. |
|
| ||
| Transformer with an attention mechanism | The issue with RNNs and CNNs is that when sentences are too long, they are not able to keep up with context and content. By paying attention to the word that is currently being operated on this limitation was resolved, the attention strategy is an effort to selectively concentrate on a few important items while avoiding those in deep neural networks to execute the same operation, enabling much more parallelization than RNNs and thus reduces training times. | At inference time, it is strongly compute-intensive. |