| Literature DB >> 36236775 |
SeungHun Lee1, Wafa Shafqat2, Hyun-Chul Kim1.
Abstract
Crowdfunding has seen an enormous rise, becoming a new alternative funding source for emerging companies or new startups in recent years. As crowdfunding prevails, it is also under substantial risk of the occurrence of fraud. Though a growing number of articles indicate that crowdfunding scams are a new imminent threat to investors, little is known about them primarily due to the lack of measurement data collected from real scam cases. This paper fills the gap by collecting, labeling, and analyzing publicly available data of a hundred fraudulent campaigns on a crowdfunding platform. In order to find and understand distinguishing characteristics of crowdfunding scams, we propose to use a broad range of traits including project-based traits, project creator-based ones, and content-based ones such as linguistic cues and Named Entity Recognition features, etc. We then propose to use the feature selection method called Forward Stepwise Logistic Regression, through which 17 key discriminating features (including six original and hitherto unused ones) of scam campaigns are discovered. Based on the selected 17 key features, we present and discuss our findings and insights on distinguishing characteristics of crowdfunding scams, and build our scam detection model with 87.3% accuracy. We also explore the feasibility of early scam detection, building a model with 70.2% of classification accuracy right at the time of project launch. We discuss what features from which sections are more helpful for early scam detection on day 0 and thereafter.Entities:
Keywords: crowdfunding; deception detection; feature selection; linguistic cues; natural language processing; scam
Mesh:
Year: 2022 PMID: 36236775 PMCID: PMC9573152 DOI: 10.3390/s22197677
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1A well-known crowdfunding scam.
Figure 2An example screenshot of a crowdfunding project.
Linguistic cues and their descriptions.
| Quantity |
|---|
| 1. (Total # of) words, adverbs, clauses, verbs, phrases, characters, punctuation, nouns, sentences, adjectives, noun phrases |
| (a phrase consisting of a noun, its modifiers and determinants) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Key features selected by Logistic Regression for scam detection and their descriptive statistics. Nagelkerke R2 = 0.706. Significance: *** p < 0.001, ** p < 0.01, * p < 0.05, is the measured coefficient for each feature in the model’s equation. If p-value is less than 0.05 then it becomes more significant for the model.
| Scam | Non-Scam | |||||||
|---|---|---|---|---|---|---|---|---|
|
| SE | Mean | SD | Mean | SD | |||
|
| Existence of a link to a Facebook ID | −1.326 | 0.446 | ** | 0.350 | 0.480 | 0.550 | 0.499 |
| Num. external links & websites | −0.665 | 0.169 | *** | 1.570 | 1.570 | 2.510 | 1.570 | |
| Num. backed projects | −0.042 | 0.015 | ** | 8.740 | 15.327 | 22.550 | 34.270 | |
| Num. created projects | −0.320 | 0.150 | * | 1.730 | 1.536 | 2.380 | 2.656 | |
|
| Redundancy | 0.206 | 0.128 | 0.108 | 5.367 | 2.819 | 4.887 | 1.699 |
| Num. images | 0.060 | 0.021 | ** | 17.090 | 16.690 | 13.470 | 11.226 | |
|
| Num. third person pronouns/Num. updates | 0.285 | 0.101 | ** | 3.653 | 2.943 | 4.353 | 5.091 |
| Num. images/Num. updates | −0.488 | 0.222 | ** | 0.777 | 1.087 | 1.017 | 1.248 | |
| Num. emails/Num. updates | −4.551 | 1.978 | * | 0.046 | 0.095 | 0.159 | 0.260 | |
| Num. location/Num. updates | −1.585 | 0.402 | *** | 0.544 | 0.670 | 1.086 | 1.286 | |
| Num. past tense verbs/Total words | −0.686 | 0.272 | * | 0.025 | 0.010 | 0.028 | 0.007 | |
|
| Num. verbs/Num. creator comments | 0.835 | 0.140 | *** | 13.906 | 10.562 | 9.916 | 5.407 |
| Num. sentences/Num. creator comments | −0.539 | 0.214 | * | 3.819 | 2.621 | 3.374 | 1.944 | |
| Num. first person plural pronouns/Num. creator comments | −1.070 | 0.276 | *** | 1.799 | 1.791 | 1.726 | 1.179 | |
| Num. second person pronouns/Num. creator comments | −1.068 | 0.339 | ** | 1.756 | 1.561 | 1.660 | 1.014 | |
| Num. third person pronouns/Num. creator comments | −1.971 | 0.542 | *** | 1.310 | 1.056 | 1.071 | 0.831 | |
| Num. present tense verbs/Total words | 0.151 | 0.076 | * | 0.119 | 0.028 | 0.115 | 0.024 | |
Performance of our model built with each category of features using Logistic Regression (Precision and Recall on Scams).
| Feature | Precision | Recall | Accuracy | AUC |
|---|---|---|---|---|
| Creator-related | 65.3% | 62.7% | 71.4% | 0.758 |
| Campaign | 55.5% | 14.7% | 60.7% | 0.593 |
| Updates | 62.5% |
| 69.8% | 0.752 |
| Comments |
| 55.8% |
|
|
| Full model | 84.3% | 84.3% | 87.3% | 0.939 |
Figure 3Cumulative distribution function(CDF) of key creator-related features.
Scammers’ comments using modal verbs: examples.
|
|
|
|
|
|
|
|
|
|
Performance comparisons of different classification algorithms (10-fold cross validation).
| Algorithm | Precision | Recall | Accuracy | AUC |
|---|---|---|---|---|
| Logistic regression |
|
|
|
|
| Random Forest | 77.5% | 67.6% | 79.0% | 0.851 |
| SVM | 68.4% | 63.7% | 73.4% | 0.719 |
| Naive bayes | 61.8% | 71.5% | 70.6% | 0.734 |
| KNN (k = 9) | 66.6% | 50.9% | 69.8% | 0.757 |
| J48 Decision Tree | 58.4% | 57.8% | 66.3% | 0.660 |
Figure 4Estimated Average Classification Performance vs. Elapsed time (days). (a) Accuracy. (b) AUC. (c) Precision on Scams. (d) Recall on Scams.