| Literature DB >> 19958499 |
Pengyi Yang1, Liang Xu, Bing B Zhou, Zili Zhang, Albert Y Zomaya.
Abstract
BACKGROUND: Medical and biological data are commonly with small sample size, missing values, and most importantly, imbalanced class distribution. In this study we propose a particle swarm based hybrid system for remedying the class imbalance problem in medical and biological data mining. This hybrid system combines the particle swarm optimization (PSO) algorithm with multiple classifiers and evaluation metrics for evaluation fusion. Samples from the majority class are ranked using multiple objectives according to their merit in compensating the class imbalance, and then combined with the minority class to form a balanced dataset.Entities:
Mesh:
Year: 2009 PMID: 19958499 PMCID: PMC2788388 DOI: 10.1186/1471-2164-10-S3-S34
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Schematic flow chart of sampling and evaluation processes. The original imbalanced dataset are split to training and test sets with an external stratified cross validation. The sampling process is then conducted on an internal stratified cross validation for creating a balanced training set. The classification models are built on the balanced training set and the test set from the external cross validation is classified using the obtained classification models.
Figure 2Particle swarm based hybrid module for data sampling. Multiple classification algorithms are used to guide the sampling process. Within each classification algorithm, three evaluation metrics are employed to evaluated the goodness of the sample subsets. PSO algorithm is used to optimize the sample subsets according to the evaluation results of each classification component.
Figure 3The main loop of the BPSO based hybrid algorithm.
Summary of the medical and biological datasets used in the experiments.
| Dataset | # Feature | # Negative | # Positive | Prevalence |
|---|---|---|---|---|
| Blood | 4 | 568 | 180 | 24.1% |
| Survival | 3 | 225 | 81 | 26.5% |
| Diabetes | 8 | 500 | 268 | 34.9% |
| Breast | 32 | 151 | 47 | 23.7% |
| AMD-CGA | 25 | 100 | 46 | 31.5% |
| AMD-Neov | 25 | 96 | 50 | 34.2% |
Parameter settings of the particle swarm based hybrid system.
| Parameter | Value |
|---|---|
| Size of Classification Committee | 5 |
| Number of Evaluation Metrics | 3 |
| Size of Particle Population | 100 |
| Iteration | 150 |
| Update Rule | Sigmoid Function |
| Cognitive Constant | 1.43 |
| Social Acceleration Constant | 1.43 |
| Inertia Weight | 0.689 |
| Velocity Bound | 0.018-0.982 |
| Fitness Weight |
Evaluation results of Blood dataset using different sampling strategies with three metrics across ten classification algorithms.
| Method | Metric | Classifier | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| J48 | 3NN | NB | RF5 | LOG | 1NN | 7NN | SMO | RF10 | RBFNet | R. Avg. | ||
| PSO | AUC | 0.693 | 0.656 | 0.706 | 0.656 | 0.736 | 0.612 | 0.696 | 0.667 | 0.660 | 0.720 | 0.680 |
| FMeasure | 0.495 | 0.446 | 0.458 | 0.430 | 0.494 | 0.409 | 0.485 | 0.486 | 0.434 | 0.487 | 0.462 | |
| GMean | 0.671 | 0.622 | 0.634 | 0.605 | 0.668 | 0.590 | 0.662 | 0.655 | 0.614 | 0.663 | 0.638 | |
| C. Avg. | 0.620 | 0.575 | 0.599 | 0.564 | 0.633 | 0.537 | 0.614 | 0.603 | 0.569 | 0.623 | 0.593 | |
| RU | AUC | 0.663 | 0.647 | 0.713 | 0.632 | 0.745 | 0.597 | 0.689 | 0.666 | 0.638 | 0.710 | 0.669 |
| FMeasure | 0.474 | 0.425 | 0.417 | 0.419 | 0.511 | 0.393 | 0.461 | 0.486 | 0.424 | 0.462 | 0.447 | |
| GMean | 0.643 | 0.609 | 0.586 | 0.600 | 0.686 | 0.577 | 0.641 | 0.655 | 0.605 | 0.639 | 0.624 | |
| C. Avg. | 0.593 | 0.560 | 0.572 | 0.550 | 0.647 | 0.522 | 0.597 | 0.602 | 0.556 | 0.604 | 0.580 | |
| RO | AUC | 0.657 | 0.635 | 0.710 | 0.618 | 0.749 | 0.573 | 0.652 | 0.671 | 0.629 | 0.715 | 0.661 |
| FMeasure | 0.460 | 0.422 | 0.375 | 0.387 | 0.514 | 0.339 | 0.432 | 0.491 | 0.380 | 0.474 | 0.428 | |
| GMean | 0.635 | 0.607 | 0.538 | 0.568 | 0.689 | 0.522 | 0.615 | 0.663 | 0.561 | 0.651 | 0.605 | |
| C. Avg. | 0.584 | 0.555 | 0.541 | 0.524 | 0.651 | 0.478 | 0.566 | 0.608 | 0.523 | 0.613 | 0.565 | |
| Cluster | AUC | 0.616 | 0.660 | 0.677 | 0.614 | 0.651 | 0.571 | 0.661 | 0.658 | 0.629 | 0.711 | 0.645 |
| FMeasure | 0.449 | 0.420 | 0.382 | 0.405 | 0.429 | 0.318 | 0.414 | 0.419 | 0.348 | 0.454 | 0.404 | |
| GMean | 0.587 | 0.583 | 0.556 | 0.559 | 0.560 | 0.534 | 0.616 | 0.635 | 0.608 | 0.658 | 0.590 | |
| C. Avg. | 0.551 | 0.554 | 0.538 | 0.526 | 0.547 | 0.474 | 0.564 | 0.571 | 0.528 | 0.608 | 0.546 | |
Evaluation results of Survival dataset using different sampling strategies with three metrics across ten classification algorithms.
| Method | Metric | Classifier | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| J48 | 3NN | NB | RF5 | LOG | 1NN | 7NN | SMO | RF10 | RBFNet | R. Avg. | ||
| PSO | AUC | 0.660 | 0.626 | 0.660 | 0.668 | 0.698 | 0.614 | 0.618 | 0.617 | 0.680 | 0.700 | 0.654 |
| FMeasure | 0.495 | 0.422 | 0.495 | 0.469 | 0.542 | 0.459 | 0.421 | 0.447 | 0.464 | 0.492 | 0.471 | |
| GMean | 0.641 | 0.580 | 0.635 | 0.618 | 0.687 | 0.612 | 0.581 | 0.587 | 0.614 | 0.643 | 0.620 | |
| C. Avg. | 0.599 | 0.543 | 0.597 | 0.585 | 0.642 | 0.562 | 0.540 | 0.550 | 0.586 | 0.612 | 0.582 | |
| RU | AUC | 0.636 | 0.565 | 0.633 | 0.628 | 0.668 | 0.589 | 0.619 | 0.586 | 0.659 | 0.651 | 0.623 |
| FMeasure | 0.482 | 0.406 | 0.459 | 0.455 | 0.486 | 0.436 | 0.429 | 0.399 | 0.460 | 0.477 | 0.449 | |
| GMean | 0.626 | 0.562 | 0.598 | 0.599 | 0.637 | 0.589 | 0.587 | 0.554 | 0.608 | 0.619 | 0.598 | |
| C. Avg. | 0.581 | 0.511 | 0.563 | 0.561 | 0.597 | 0.538 | 0.545 | 0.513 | 0.576 | 0.582 | 0.557 | |
| RO | AUC | 0.619 | 0.617 | 0.641 | 0.631 | 0.684 | 0.588 | 0.602 | 0.615 | 0.639 | 0.663 | 0.631 |
| FMeasure | 0.465 | 0.433 | 0.427 | 0.389 | 0.487 | 0.354 | 0.413 | 0.411 | 0.368 | 0.459 | 0.422 | |
| GMean | 0.617 | 0.579 | 0.561 | 0.553 | 0.632 | 0.514 | 0.573 | 0.547 | 0.534 | 0.608 | 0.574 | |
| C. Avg. | 0.567 | 0.543 | 0.543 | 0.524 | 0.601 | 0.485 | 0.529 | 0.524 | 0.514 | 0.577 | 0.542 | |
| Cluster | AUC | 0.623 | 0.564 | 0.642 | 0.601 | 0.664 | 0.546 | 0.570 | 0.595 | 0.616 | 0.634 | 0.606 |
| FMeasure | 0.451 | 0.376 | 0.443 | 0.366 | 0.460 | 0.325 | 0.397 | 0.380 | 0.389 | 0.436 | 0.402 | |
| GMean | 0.602 | 0.538 | 0.559 | 0.539 | 0.636 | 0.497 | 0.546 | 0.512 | 0.570 | 0.601 | 0.560 | |
| C. Avg. | 0.559 | 0.493 | 0.548 | 0.502 | 0.587 | 0.456 | 0.504 | 0.496 | 0.525 | 0.557 | 0.523 | |
Evaluation results of Diabetes dataset using different sampling strategies with three metrics across ten classification algorithms.
| Method | Metric | Classifier | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| J48 | 3NN | NB | RF5 | LOG | 1NN | 7NN | SMO | RF10 | RBFNet | R. Avg. | ||
| PSO | AUC | 0.746 | 0.761 | 0.808 | 0.801 | 0.827 | 0.693 | 0.793 | 0.740 | 0.817 | 0.786 | 0.777 |
| FMeasure | 0.660 | 0.618 | 0.638 | 0.662 | 0.661 | 0.612 | 0.651 | 0.662 | 0.671 | 0.639 | 0.647 | |
| GMean | 0.734 | 0.698 | 0.717 | 0.736 | 0.734 | 0.691 | 0.727 | 0.738 | 0.745 | 0.719 | 0.724 | |
| C. Avg. | 0.713 | 0.692 | 0.721 | 0.733 | 0.741 | 0.665 | 0.724 | 0.713 | 0.744 | 0.715 | 0.716 | |
| RU | AUC | 0.697 | 0.739 | 0.801 | 0.765 | 0.829 | 0.665 | 0.773 | 0.737 | 0.791 | 0.759 | 0.756 |
| FMeasure | 0.635 | 0.603 | 0.635 | 0.628 | 0.665 | 0.581 | 0.636 | 0.657 | 0.659 | 0.609 | 0.631 | |
| GMean | 0.707 | 0.686 | 0.714 | 0.705 | 0.740 | 0.660 | 0.714 | 0.734 | 0.734 | 0.694 | 0.709 | |
| C. Avg. | 0.680 | 0.676 | 0.717 | 0.699 | 0.745 | 0.635 | 0.708 | 0.709 | 0.728 | 0.687 | 0.699 | |
| RO | AUC | 0.709 | 0.722 | 0.799 | 0.774 | 0.831 | 0.653 | 0.774 | 0.735 | 0.796 | 0.797 | 0.760 |
| FMeasure | 0.634 | 0.592 | 0.628 | 0.612 | 0.665 | 0.549 | 0.621 | 0.656 | 0.616 | 0.636 | 0.622 | |
| GMean | 0.713 | 0.675 | 0.705 | 0.696 | 0.739 | 0.643 | 0.699 | 0.733 | 0.696 | 0.716 | 0.702 | |
| C. Avg. | 0.685 | 0.663 | 0.711 | 0.694 | 0.745 | 0.615 | 0.698 | 0.708 | 0.703 | 0.716 | 0.695 | |
| Cluster | AUC | 0.701 | 0.729 | 0.801 | 0.759 | 0.813 | 0.608 | 0.769 | 0.753 | 0.788 | 0.784 | 0.751 |
| FMeasure | 0.624 | 0.603 | 0.635 | 0.598 | 0.684 | 0.513 | 0.626 | 0.680 | 0.643 | 0.629 | 0.624 | |
| GMean | 0.702 | 0.678 | 0.711 | 0.686 | 0.723 | 0.615 | 0.698 | 0.752 | 0.721 | 0.711 | 0.700 | |
| C. Avg. | 0.676 | 0.670 | 0.716 | 0.681 | 0.740 | 0.579 | 0.698 | 0.728 | 0.717 | 0.708 | 0.692 | |
Evaluation results of Breast dataset using different sampling strategies with three metrics across ten classification algorithms.
| Method | Metric | Classifier | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| J48 | 3NN | NB | RF5 | LOG | 1NN | 7NN | SMO | RF10 | RBFNet | R. Avg. | ||
| PSO | AUC | 0.580 | 0.610 | 0.602 | 0.603 | 0.736 | 0.570 | 0.625 | 0.670 | 0.637 | 0.549 | 0.618 |
| FMeasure | 0.418 | 0.423 | 0.393 | 0.369 | 0.489 | 0.394 | 0.425 | 0.487 | 0.392 | 0.359 | 0.415 | |
| GMean | 0.593 | 0.599 | 0.577 | 0.550 | 0.661 | 0.553 | 0.595 | 0.660 | 0.577 | 0.544 | 0.591 | |
| C. Avg. | 0.530 | 0.544 | 0.524 | 0.507 | 0.629 | 0.506 | 0.548 | 0.606 | 0.535 | 0.484 | 0.541 | |
| RU | AUC | 0.562 | 0.597 | 0.587 | 0.604 | 0.722 | 0.515 | 0.612 | 0.650 | 0.639 | 0.568 | 0.606 |
| FMeasure | 0.393 | 0.407 | 0.378 | 0.403 | 0.479 | 0.345 | 0.422 | 0.466 | 0.399 | 0.388 | 0.408 | |
| GMean | 0.552 | 0.583 | 0.558 | 0.571 | 0.656 | 0.502 | 0.597 | 0.637 | 0.581 | 0.568 | 0.581 | |
| C. Avg. | 0.502 | 0.529 | 0.508 | 0.526 | 0.619 | 0.454 | 0.544 | 0.584 | 0.540 | 0.508 | 0.532 | |
| RO | AUC | 0.569 | 0.564 | 0.596 | 0.593 | 0.789 | 0.544 | 0.582 | 0.688 | 0.639 | 0.480 | 0.604 |
| FMeasure | 0.327 | 0.391 | 0.384 | 0.354 | 0.560 | 0.325 | 0.382 | 0.509 | 0.297 | 0.286 | 0.382 | |
| GMean | 0.508 | 0.565 | 0.569 | 0.522 | 0.701 | 0.512 | 0.560 | 0.683 | 0.452 | 0.475 | 0.555 | |
| C. Avg. | 0.468 | 0.507 | 0.516 | 0.490 | 0.683 | 0.460 | 0.508 | 0.627 | 0.463 | 0.414 | 0.514 | |
| Cluster | AUC | 0.543 | 0.616 | 0.538 | 0.554 | 0.711 | 0.547 | 0.603 | 0.641 | 0.586 | 0.528 | 0.587 |
| FMeasure | 0.384 | 0.402 | 0.356 | 0.325 | 0.466 | 0.371 | 0.417 | 0.450 | 0.342 | 0.332 | 0.385 | |
| GMean | 0.579 | 0.585 | 0.550 | 0.508 | 0.648 | 0.527 | 0.583 | 0.638 | 0.540 | 0.529 | 0.569 | |
| C. Avg. | 0.502 | 0.534 | 0.481 | 0.462 | 0.608 | 0.482 | 0.534 | 0.576 | 0.489 | 0.463 | 0.514 | |
Evaluation results of AMD-CGA dataset using different sampling strategies with three metrics across ten classification algorithms.
| Method | Metric | Classifier | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| J48 | 3NN | NB | RF5 | LOG | 1NN | 7NN | SMO | RF10 | RBFNet | R. Avg. | ||
| PSO | AUC | 0.609 | 0.559 | 0.606 | 0.591 | 0.566 | 0.599 | 0.590 | 0.572 | 0.573 | 0.540 | 0.581 |
| FMeasure | 0.481 | 0.462 | 0.489 | 0.464 | 0.448 | 0.485 | 0.468 | 0.460 | 0.465 | 0.455 | 0.468 | |
| GMean | 0.580 | 0.557 | 0.572 | 0.551 | 0.538 | 0.576 | 0.573 | 0.550 | 0.545 | 0.539 | 0.558 | |
| C. Avg. | 0.557 | 0.526 | 0.556 | 0.535 | 0.517 | 0.553 | 0.544 | 0.527 | 0.528 | 0.511 | 0.536 | |
| RU | AUC | 0.569 | 0.547 | 0.549 | 0.594 | 0.567 | 0.569 | 0.579 | 0.556 | 0.604 | 0.570 | 0.570 |
| FMeasure | 0.434 | 0.457 | 0.439 | 0.476 | 0.453 | 0.439 | 0.446 | 0.417 | 0.456 | 0.446 | 0.446 | |
| GMean | 0.538 | 0.556 | 0.538 | 0.580 | 0.551 | 0.547 | 0.553 | 0.523 | 0.550 | 0.549 | 0.549 | |
| C. Avg. | 0.514 | 0.520 | 0.509 | 0.550 | 0.524 | 0.518 | 0.526 | 0.499 | 0.537 | 0.522 | 0.522 | |
| RO | AUC | 0.566 | 0.568 | 0.565 | 0.597 | 0.581 | 0.569 | 0.586 | 0.558 | 0.576 | 0.586 | 0.575 |
| FMeasure | 0.394 | 0.405 | 0.375 | 0.442 | 0.420 | 0.415 | 0.407 | 0.358 | 0.421 | 0.428 | 0.407 | |
| GMean | 0.523 | 0.530 | 0.505 | 0.564 | 0.544 | 0.539 | 0.537 | 0.490 | 0.547 | 0.555 | 0.533 | |
| C. Avg. | 0.494 | 0.501 | 0.482 | 0.534 | 0.515 | 0.508 | 0.510 | 0.469 | 0.515 | 0.523 | 0.505 | |
| Cluster | AUC | 0.519 | 0.595 | 0.560 | 0.601 | 0.580 | 0.580 | 0.566 | 0.545 | 0.581 | 0.576 | 0.570 |
| FMeasure | 0.306 | 0.343 | 0.358 | 0.357 | 0.332 | 0.270 | 0.301 | 0.368 | 0.387 | 0.339 | 0.445 | |
| GMean | 0.499 | 0.558 | 0.532 | 0.566 | 0.548 | 0.569 | 0.546 | 0.511 | 0.554 | 0.548 | 0.543 | |
| C. Avg. | 0.441 | 0.499 | 0.483 | 0.508 | 0.487 | 0.473 | 0.471 | 0.475 | 0.507 | 0.488 | 0.519 | |
Evaluation results of AMD-Neov dataset using different sampling strategies with three metrics across ten classification algorithms.
| Method | Metric | Classifier | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| J48 | 3NN | NB | RF5 | LOG | 1NN | 7NN | SMO | RF10 | RBFNet | R. Avg. | ||
| PSO | AUC | 0.681 | 0.659 | 0.661 | 0.662 | 0.678 | 0.656 | 0.694 | 0.628 | 0.686 | 0.672 | 0.668 |
| FMeasure | 0.549 | 0.557 | 0.537 | 0.566 | 0.545 | 0.556 | 0.572 | 0.559 | 0.552 | 0.559 | 0.555 | |
| GMean | 0.622 | 0.628 | 0.619 | 0.643 | 0.626 | 0.630 | 0.648 | 0.637 | 0.631 | 0.631 | 0.632 | |
| C. Avg. | 0.617 | 0.615 | 0.605 | 0.624 | 0.616 | 0.614 | 0.638 | 0.608 | 0.623 | 0.621 | 0.618 | |
| RU | AUC | 0.652 | 0.627 | 0.625 | 0.622 | 0.635 | 0.649 | 0.622 | 0.619 | 0.663 | 0.631 | 0.635 |
| FMeasure | 0.549 | 0.526 | 0.524 | 0.534 | 0.519 | 0.531 | 0.543 | 0.529 | 0.561 | 0.539 | 0.536 | |
| GMean | 0.637 | 0.602 | 0.601 | 0.609 | 0.596 | 0.615 | 0.615 | 0.604 | 0.636 | 0.612 | 0.613 | |
| C. Avg. | 0.613 | 0.585 | 0.583 | 0.588 | 0.583 | 0.598 | 0.593 | 0.584 | 0.620 | 0.594 | 0.595 | |
| RO | AUC | 0.643 | 0.643 | 0.646 | 0.659 | 0.635 | 0.655 | 0.638 | 0.632 | 0.660 | 0.657 | 0.647 |
| FMeasure | 0.507 | 0.542 | 0.491 | 0.516 | 0.498 | 0.516 | 0.521 | 0.506 | 0.534 | 0.531 | 0.516 | |
| GMean | 0.602 | 0.629 | 0.589 | 0.610 | 0.599 | 0.612 | 0.612 | 0.598 | 0.624 | 0.623 | 0.610 | |
| C. Avg. | 0.584 | 0.605 | 0.575 | 0.595 | 0.577 | 0.594 | 0.590 | 0.579 | 0.606 | 0.603 | 0.591 | |
| Cluster | AUC | 0.656 | 0.624 | 0.627 | 0.629 | 0.625 | 0.652 | 0.644 | 0.594 | 0.642 | 0.638 | 0.633 |
| FMeasure | 0.551 | 0.524 | 0.502 | 0.538 | 0.506 | 0.521 | 0.546 | 0.504 | 0.536 | 0.537 | 0.527 | |
| GMean | 0.641 | 0.605 | 0.587 | 0.624 | 0.591 | 0.610 | 0.630 | 0.585 | 0.620 | 0.621 | 0.611 | |
| C. Avg. | 0.616 | 0.584 | 0.572 | 0.597 | 0.574 | 0.594 | 0.607 | 0.561 | 0.599 | 0.599 | 0.590 | |
Figure 4Comparison of each sampling method with respect to different evaluation metrics.
Figure 5Comparison of each sampling method with respect to different classification algorithms.