| Literature DB >> 35910813 |
Anupama Namburu1, Prabha Selvaraj1, M Varsha2.
Abstract
E-commerce platforms have been around for over two decades now, and their popularity among buyers and sellers alike has been increasing. With the COVID-19 pandemic, there has been a boom in online shopping, with many sellers moving their businesses towards e-commerce platforms. Product pricing is quite difficult at this increased scale of online shopping, considering the number of products being sold online. For instance, the strong seasonal pricing trends in clothes-where Brand names seem to sway the prices heavily. Electronics, on the other hand, have product specification-based pricing, which keeps fluctuating. This work aims to help business owners price their products competitively based on similar products being sold on e-commerce platforms based on the reviews, statistical and categorical features. A hybrid algorithm X-NGBoost combining extreme gradient boost (XGBoost) with natural gradient boost (NGBoost) is proposed to predict the price. The proposed model is compared with the ensemble models like XGBoost, LightBoost and CatBoost. The proposed model outperforms the existing ensemble boosting algorithms.Entities:
Keywords: CatBoost; Ensemble algorithms; Product pricing; X-NGBoost; XGBoost
Year: 2022 PMID: 35910813 PMCID: PMC9309595 DOI: 10.1007/s11334-022-00465-3
Source DB: PubMed Journal: Innov Syst Softw Eng ISSN: 1614-5046
Training data preview
| S.no. | Product | Product_Brand | Item_Category | subcategory_1 | subcategory_2 | Item_Rating | Date | Sellng_Price |
|---|---|---|---|---|---|---|---|---|
| 0 | P-2610 | B-659 | Bags wallets belts | Bags | Handbags | 4.3 | 2/312017 | 291.0 |
| 1 | P-2453 | B-3078 | Clothing | Women’s clothing | Western wear | 3.1 | 7/1/2015 | 897.0 |
| 2 | P-6802 | B-1810 | Home décor festive needs | Show pieces | ethnic | 3.5 | 1/12/2019 | 792.0 |
| 3 | P-4452 | B-3078 | Beauty and personal care | Eye care | h2opluseyecare | 4.0 | 12/12/2014 | 837..0 |
| 4 | P-8454 | B-3078 | Clothing | Men’s clothing | Tshirts | 4.3 | 12/1212013 | 470.0 |
Testing dataset preview
| S. no. | Product | Product_Brand | Item_Category | Subcategory_1 | Subcategory_2 | Item_Rating | Date |
|---|---|---|---|---|---|---|---|
| 0 | P-11284 | B-2984 | Computers | Network components | Routers | 4.3 | 1/12/2018 |
| 1 | P-6580 | B-1732 | Jewellery | Bangles Bracelets armlets | Bracelets | 3.0 | 20/12/2012 |
| 2 | P-5843 | B-3078 | Clothing | Women’s clothing | Western wear | 1.5 | 1/12/2014 |
| 3 | P-5334 | B-1421 | Jewellery | Necklaces chains | Necklaces | 3.9 | 1/12/2019 |
| 4 | P-5586 | B-3078 | Clothing | Women’s clothing | Western wear | 1.4 | 1/12/2017 |
Fig. 1Target variable before preprocessing
Fig. 2Target variable after preprocessing
Dataset after the addition of statistical features
| S. no. | Product | Product_ Brand | Item_ Category | subcategory_1 | subcategory_2 | Item_Rating | Month | Day | DayofYear | Week | Quarter |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | P-2610 | B-659 | Computers | Network components | Routers | 4.3 | 2 | 3 | 34 | 5 | 1 |
| 1 | P-2453 | B-3078 | Jewellery | Bangles bracelets armlets | Bracelets | 3.0 | 7 | 1 | 182 | 27 | 3 |
| 2 | P-6802 | B-1810 | Clothing | Women’s clothing | Western wear | 1.5 | 1 | 12 | 12 | 1 | 1 |
| 3 | P-4452 | B-3078 | Jewellery | Necklaces chains | Necklaces | 3.9 | 12 | 12 | 346 | 50 | 4 |
| 4 | P-8454 | B-3078 | Clothing | Women’s clothing | Western wear | 1.4 | 12 | 12 | 246 | 50 | 4 |
Dataset after the addition of statistical features
| Is_month_start | Is_month_end | Unique_Item_category_per_ product_brand | Unique_Subcategory_1_ product_brand | Unique_Subcategory_2_ product_brand |
|---|---|---|---|---|
| False | False | 1 | 1 | 1 |
| True | False | 13 | 29 | 100 |
| False | False | 1 | 1 | 1 |
| False | False | 13 | 29 | 100 |
| False | False | 13 | 29 | 100 |
Dataset after the addition of categorical features
| S. no. | Product | Product_ Brand | Item_ Category | subcategory_ 1 | subcategory_ 2 | Item_ Rating | Selling_ Price | month | Day | Day of Year | Week | Quarter |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | P-2610 | B-659 | 9 | 11 | 159 | 4.3 | 5.676754 | 2 | 3 | 34 | 5 | 1 |
| 1 | P-2453 | B-3078 | 17 | 139 | 387 | 3.0 | 6.800170 | 7 | 1 | 182 | 27 | 3 |
| 2 | P-6802 | B-1810 | 38 | 119 | 118 | 1.5 | 6.675823 | 1 | 12 | 12 | 1 | 1 |
| 3 | P-4452 | B-3078 | 12 | 40 | 155 | 3.9 | 6.731018 | 12 | 12 | 346 | 50 | 4 |
| 4 | P-8454 | B-3078 | 17 | 86 | 344 | 1.4 | 6.154858 | 12 | 12 | 246 | 50 | 4 |
Dataset after the addition of categorical features
| Is_month_start | Is_month_end | Unique_Item_category_per_ product_brand | Unique_Subcategory_1_ product_brand | Unique_Subcategory_2_ product_brand |
|---|---|---|---|---|
| False | False | 1 | 1 | 1 |
| True | False | 13 | 29 | 100 |
| False | False | 1 | 1 | 1 |
| False | False | 13 | 29 | 100 |
| False | False | 13 | 29 | 100 |
Dataset after the addition of categorical features
| Std_rating_per_product_brand | Std_rating_Item_category | Std_rating_Subcategory_1 | Std_rating_Subcategory_2 |
|---|---|---|---|
| NaN | 1.105474 | 1.094790 | 1.018614 |
| 1.197104 | 1.205383 | 1.227087 | 1.186726 |
| 1.202082 | 1.192523 | 1.094301 | 1.028398 |
| 1.197104 | 1.215595 | 0.636396 | NaN |
| 1.197104 | 1.205383 | 1.146774 | 1.123001 |
Fig. 3Heat map of correlation between the attributes
Fig. 4The flow chart of the proposed model
Fig. 5X-NGBoost algorithm
Root-mean-square error of the algorithms
| Model | RMSE training | RMSE testing |
|---|---|---|
| XGBoost | 6.62 | 7.48 |
| LightGBM | 6.62 | 7.44 |
| CatBoost | 5.91 | 6.96 |
| X-NGBoost | 4.23 | 5.34 |
Attributes of the data set
| Name | Description |
|---|---|
| Product | Name of the product |
| Product_Brand | Brand the product belongs to |
| Item_Category | The wider category of items the product belongs to |
| Subcategory_1 | The subcategory the product belongs to—one level deep |
| Subcategory_2 | Specific category the item belongs to—two levels deep |
| Item_Rating | The reviewed rating left behind by buyers of the products |
| Date | Date at which the product was sold at the specific price |
| Selling_Price | Price of the product sold on the specified date |
Comparison table of boosting algorithms
| Function | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Splits | It does not use any weighted sampling technique, which makes its division process slower than GOSS and MVS. | It provides gradient-based one-sided sampling (GOSS), which uses instances with large gradients and random samples with small gradients to select segmentation | It provides a new technique called minimal variance sampling (MVS), where weighted sampling occurs at the tree level rather than the split level |
| Missing value | Missing values will be allocated to the side that reduces the loss in each split | The missing values will be assigned to the side that reduces the loss in each division | It has “Min” and “Max” modes for processing missing values |
| Leaf Growth | It splits to the specified max_depth and then starts pruning the tree backwards by deleting splits which have no positive gain, because splits without loss reduction can be followed by splits with loss reduction | Use the best first tree growth because you select the leaves that minimise growth loss, allowing unbalanced trees to grow because overfitting can occur when data are small | Grow a balanced tree and at each layer in the tree, the feature segmentation pair that produces the least loss is selected and used for all nodes in this level |
| Training speed | Slower than CatBoost and LightGBM | Faster than CatBoost, XGBoost | Faster than XGBoost |
| Categorical feature handling | It does not have a built-in method for classifying features—the user has to do the encoding. | The method is to sort the categories according to the training objectives. Categorical_features: Used to specify the features that we consider when training the model | Combines one-hot encoding and advanced media encoding. One_hot_max_size: Perform active encoding for all functions with multiple different values |
| Parameters to control overfitting | - Learning rate - Maximum depth - Minimum child weight | - Learning rate - Maximum depth - Number of leaves - Minimum data in leaf | - Learning rate - Depth - L2 leaf reg |