| Literature DB >> 32456608 |
Kitsuchart Pasupa1, Wanthanee Rathasamuth2, Sissades Tongsima3.
Abstract
BACKGROUND: The number of porcine Single Nucleotide Polymorphisms (SNPs) used in genetic association studies is very large, suitable for statistical testing. However, in breed classification problem, one needs to have a much smaller porcine-classifying SNPs (PCSNPs) set that could accurately classify pigs into different breeds. This study attempted to find such PCSNPs by using several combinations of feature selection and classification methods. We experimented with different combinations of feature selection methods including information gain, conventional as well as modified genetic algorithms, and our developed frequency feature selection method in combination with a common classification method, Support Vector Machine, to evaluate the method's performance. Experiments were conducted on a comprehensive data set containing SNPs from native pigs from America, Europe, Africa, and Asia including Chinese breeds, Vietnamese breeds, and hybrid breeds from Thailand.Entities:
Keywords: Feature selection; Genetic algorithm; Information gain; Single nucleotide polymorphisms; Support vector machine
Year: 2020 PMID: 32456608 PMCID: PMC7251909 DOI: 10.1186/s12859-020-3471-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Examples of four modified GA individuals and their connection with matrix S
Fig. 2The framework of experiments on the IG+FFS and IG+modified GA+FFS in combination with SVM methods
Fig. 3Classification accuracy versus number of generations of population in modified GA
Fig. 4Effects of the assigned percentage of selected features by GA in the initialization step on the final number of selected PCSNPs by IG+GA+FFS method (b) and their classification accuracy (a) that vary with FFS selection threshold
The mean number of finally selected PCSNPs by each of the five feature selection methods and their resulting accuracy as well as the accuracy provided by using the entire swine SNPs in the data set
| Method | PCSNPs | Accuracy (%) | ||
|---|---|---|---|---|
| Linear | RBF | Linear | RBF | |
| Whole SNPs | 10,210 | 10,210 | ||
| (100%) | (100%) | |||
| IG+GA | 207.70 ±42.71 | 209.90 ±37.57 | 95.27 ±1.57 | 94.88 ±1.11 |
| (2.03%) | (2.06%) | |||
| IG+modified GA | 319.10 ±104.02 | 310.70 ±84.89 | 95.74 ±1.47 | 95.35 ±1.42 |
| (3.13%) | (3.04%) | |||
| IG | 413.30 ±88.22 | 410.30 ±88.22 | 95.43 ±1.53 | 95.58 ±1.51 |
| (4.05%) | (4.05%) | |||
| IG+FFS | 240.80 ±15.33 | 240.80 ±15.33 | 95.58 ±1.27 | |
| (2.36%) | (2.36%) | |||
| IG+modified GA+FFS | 94.81 ±1.46 | 95.12 ±1.55 | ||
One-way ANOVA results of the significance difference between mean accuracy values obtained from using the whole features in the data set and from using only the features selected by various selection methods
| Source | Sum of squares | Degrees of freedom | Mean square | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Linear | RBF | Linear | RBF | Linear | RBF | Linear | RBF | Linear | RBF | |
| Methods | 5.82 | 5.13 | 5 | 5 | 1.16 | 1.03 | 0.57 | 0.56 | 0.73 | 0.73 |
| Error | 111.11 | 99.39 | 54 | 54 | 2.06 | 1.84 | - | - | - | - |
| Total | 116.93 | 104.52 | 59 | 59 | - | - | - | - | - | - |
Fig. 5a The numbers of selected PCSNPs obtained from 10 randomly-seeded data sets; b the classification accuracy values obtained from using the selected PCSNPs from those data sets
Fig. 6Fig. 6 Final mean number of selected PCSNPs (a and c) and the classification accuracy (b and d) obtained from the selected PCSNPs from 10 runs of different randomly-seeded data sets; (a) and (b) are from IG+GA while (c) and (d) are from IG+modified GA
Fig. 7Conventional PCA projection from using the entire SNPs in the data set
Fig. 8Conventional PCA projection from using only 183 PCSNPs in the data set. a PC1-vs-PC2 and b PC1-vs-PC3
Details of swine samples in the data set used in this study
| Breed | Location | Number of samples |
|---|---|---|
| Creole | Alto Baudo-Colombia, Baja Verapaz-Guatemala, Granma-Cuba, Guanacaste, Alajuela-Costa Rica, | |
| Loja-Ecuador, Misiones-Argentina, Pinar del Rio-Cuba, Titicaca area-Peru | 90 | |
| Monterio | Pocone-Brazil | 10 |
| Zungo | Cerete-Colombia | 10 |
| Jiangquhai | China | 11 |
| Jinhua | China | 16 |
| Meishan | China | 16 |
| Xiang pig | China | 11 |
| Iberian | Spain | 15 |
| Duroc | Denmark, Holland, USA, Thailand* | 44 |
| Landrace | Denmark, Holland, USA, Thailand*, Hanoi-Vietnam** | 146 |
| Large white | Denmark, Holland, USA, Thailand* | 149 |
| Semi- feral | Formosa-Argentina | 10 |
| Wild boar | Hungary, Poland, Tunisia | 13 |
| Yucatan | Indiana-USA | 10 |
| Hampshire | UK, USA | 14 |
| Guinea hog | USA | 15 |
| Bisaro | Portugal | 14 |
| Mixed-breed pig | Thailand* | 48 |
| HU-TN | Vietnam** | 11 |
| BA-ME | Vietnam** | 11 |
| CP-SO | Vietnam** | 12 |
Note: * indicates that the swine samples are from Thailand Pig data set [28]; ** indicates that the samples are from [22]; the rest of the samples are from [21]
Discovered gene families from the final selected PCSNPs
| No. | Gene symbol | Chr | MAPINFO | No. | Gene symbol | Chr | MAPINFO | No. | Gene symbol | Chr | MAPINFO |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PTPRK | 1 | 38598147 | 33 | GALNT12 | 1 | 268890145 | 65 | HES1 | 13 | 140688388 |
| 2 | ABCA5 | 12 | 11388469 | 34 | FIGN | 15 | 75340693 | 66 | CCDC13 | 13 | 28893435 |
| 3 | SEMA3E | 9 | 106779607 | 35 | BMPR1B | 8 | 133950496 | 67 | KIT | 8 | 43651639 |
| 4 | KCNU1 | 15 | 14889528 | 36 | PPEF2 | 8 | 75662581 | 68 | GRK5 | 14 | 140846738 |
| 5 | SLC28A3 | 10 | 34818681 | 37 | GNAT3 | 9 | 110416160 | 69 | PIK3C3 | 6 | 118157644 |
| 6 | RTN3 | 2 | 7728326 | 38 | CD3E | 9 | 50672784 | 70 | PDE4B | 6 | 134910792 |
| 7 | SORCS3 | 14 | 125978735 | 39 | PSAP | 14 | 80626307 | 71 | LOC100153360 | 1 | 313111618 |
| 8 | DBH | 1 | 307192626 | 40 | SNCA | 8 | 138635995 | 72 | LOC100739240 | 3 | 74337201 |
| 9 | SNTB1 | 4 | 19453169 | 41 | HACE1 | 1 | 80302938 | 73 | NCR2 | 7 | 41832022 |
| 10 | VAT1L | 6 | 10041550 | 42 | TRHR | 4 | 30842196 | 74 | CDK8 | 11 | 3651117 |
| 11 | LOC100622482 | 6 | 82916168 | 43 | NTS | 5 | 101073253 | 75 | SATB1 | 13 | 6000732 |
| 12 | CUEDC1 | 12 | 35017301 | 44 | ADRA1B | 16 | 69129145 | 76 | ROR2 | 14 | 3557219 |
| 13 | CALB2 | 6 | 13899346 | 45 | RXRG | 4 | 93070713 | 77 | TRPM2 | 2 | 143991472 |
| 14 | MACROD1 | 2 | 7103886 | 46 | AAAS | 5 | 18962460 | 78 | CAPZB | 6 | 71859152 |
| 15 | KLHL25 | 7 | 93415034 | 47 | NEK2 | 9 | 144617825 | 79 | ANKRD35 | 4 | 109093503 |
| 16 | GRK5 | 14 | 140846738 | 48 | RNF180 | 16 | 45700636 | 80 | SECISBP2L | 1 | 136455430 |
| 17 | DPEP1 | 6 | 504970 | 49 | EML5 | 7 | 117152298 | 81 | LMX1B | 1 | 301126002 |
| 18 | LOC100155953 | 7 | 122798672 | 50 | ABLIM1 | 14 | 135899761 | 82 | DTL | 9 | 144214338 |
| 19 | ZMIZ1 | 14 | 88275273 | 51 | RBM19 | 14 | 40500953 | 83 | PPP2R5A | 9 | 144185861 |
| 20 | LOC100156904 | 1 | 296533542 | 52 | PRUNE2 | 1 | 256372239 | 84 | RCAN1 | 13 | 208012602 |
| 21 | PCDH15 | 14 | 104808991 | 53 | PDZK1IP1 | 6 | 119087839 | 85 | RAPGEF4 | 15 | 24972365 |
| 22 | SLC22A5 | 2 | 140066357 | 54 | GAD2 | 10 | 54668661 | 86 | LHX2 | 1 | 298735016 |
| 23 | LNX1 | 8 | 42621415 | 55 | CP | 13 | 97407074 | 87 | IQSEC3 | 5 | 69759629 |
| 24 | DNAJB12 | 14 | 81222592 | 56 | SAMD3 | 1 | 36604527 | 88 | LY96 | 4 | 67548067 |
| 25 | CDKAL1 | 7 | 17100569 | 57 | SLC35F4 | 1 | 207232466 | 89 | WHAMM | 7 | 57639263 |
| 26 | CRB2 | 1 | 297932234 | 58 | FCRLB | 4 | 96854257 | 90 | CHD1L | 4 | 110076256 |
| 27 | SPOCK2 | 14 | 80904334 | 59 | ENPP5 | 7 | 47241389 | 91 | ADAMTS16 | 16 | 82812184 |
| 28 | CCND2 | 5 | 68326348 | 60 | CYP7B1 | 4 | 75934281 | 92 | TBC1D14 | 8 | 3354915 |
| 29 | TXNDC15 | 2 | 142718262 | 61 | AGRP | 6 | 25411042 | 93 | PARM1 | 8 | 74934147 |
| 30 | FRAS1 | 8 | 77822157 | 62 | NOX4 | 9 | 25460973 | 94 | FGFR1 | 15 | 55262655 |
| 31 | A2M | 5 | 65318067 | 63 | LOC100511652 | 9 | 12772773 | ||||
| 32 | STAT3 | 12 | 20767800 | 64 | ARHGAP26 | 2 | 150907623 |
Functions of gene products
| No. | Gene symbol | Gene ontology biological process complete |
|---|---|---|
| 1 | PTPRK | transforming growth factor beta receptor signaling pathway (GO:0007179); |
| negative regulation of keratinocyte proliferation (GO:0010839); cell migration (GO:0016477); | ||
| negative regulation of cell migration (GO:0030336); protein localization to cell surface (GO:0034394); | ||
| cellular response to reactive oxygen species (GO:0034614); cellular response to UV (GO:0034644); | ||
| peptidyl-tyrosine dephosphorylation (GO:0035335); negative regulation of cell cycle (GO:0045786); | ||
| negative regulation of transcription; DNA-templated (GO:0045892); focal adhesion assembly (GO:0048041) | ||
| 2 | ABCA5 | negative regulation of macrophage derived foam cell differentiation (GO:0010745); |
| cholesterol transport (GO:0030301); cholesterol efflux (GO:0033344); | ||
| high-density lipoprotein particle remodeling (GO:0034375); transmembrane transport (GO:0055085) | ||
| 3 | SEMA3E | branching involved in blood vessel morphogenesis (GO:0001569); |
| negative regulation of cell-matrix adhesion (GO:0001953); | ||
| sprouting angiogenesis (GO:0002040); regulation of cell shape (GO:0008360); | ||
| negative regulation of angiogenesis (GO:0016525); synapse organization (GO:0050808); | ||
| negative chemotaxis (GO:0050919); semaphorin-plexin signaling pathway (GO:0071526); | ||
| regulation of actin cytoskeleton reorganization (GO:2000249) | ||
| 4 | KCNU1 | potassium ion transport (GO:0006813); ion transmembrane transport (GO:0034220); |
| potassium ion transmembrane transport (GO:0071805) | ||
| 5 | SLC28A3 | pyrimidine nucleobase transport (GO:0015855); purine nucleoside transmembrane transport |
| (GO:0015860); | ||
| pyrimidine nucleoside transport (GO:0015864); sodiumion transmembrane transport (GO:0035725); | ||
| pyrimidine-containing compound transmembrane transport (GO:0072531); purine nucleobase | ||
| transmembrane transport (GO:1904823) |
Discovered genes that did not match any genes in the PANTHER database
| No. | Gene symbol | No. | Gene symbol | No. | Gene symbol |
|---|---|---|---|---|---|
| 1 | LOC100511786 | 14 | DLK1 | 27 | LOC100516653 |
| 2 | LOC100625374 | 15 | LOC100738463 | 28 | LOC100628179 |
| 3 | LOC100515404 | 16 | LOC100157816 | 29 | AGMO |
| 4 | PTPN3 | 17 | ITGB5 | 30 | LOC100127144 |
| 5 | LOC100737182 | 18 | LOC100512373 | 31 | LOC100154421 |
| 6 | LOC100153068 | 19 | LOC100511786 | 32 | LOC100525245 |
| 7 | DLK1 | 20 | LOC100513826 | 33 | LOC100624347 |
| 8 | TLL1 | 21 | LOC100625374 | 34 | LOC100515332 |
| 9 | LOC100738463 | 22 | LOC100628176 | 35 | LOC100622308 |
| 10 | ITGB5 | 23 | LOC100622588 | 36 | BRP44L |
| 11 | LOC100156777 | 24 | LOC100627046 | 37 | TCF4 |
| 12 | LOC100512373 | 25 | C13H21orf63 | 38 | LOC100736576 |
| 13 | CTNNA2 | 26 | LOC100519752 |
Fig. 9Effects of FFS threshold on classification accuracy and the final number of PCSNPs selected by IG+modified GA+FFS
Fig. 10Relative differences of classification accuracy and number of selected SNPs by IG+modified GA+FFS with different values of FFS threshold with respect to FFS threshold equal to 1
Fig. 11Plot of information gain value versus number of selected PCSNPs by IG (blue bars), IG+modified GA+FFS (red bars) and the genes found in the PANTHER pathway (green bars)
Fig. 12Performance of the model versus their MAF value when each PCSNP is left out
Fig. 13Frequency of occurrences of PCSNPs in all subsets of selected features from IG and IG+modified GA versus the number of PCSNPs that occurred with that frequency (a); and a schematic diagram of FFS operation (b)