| Literature DB >> 28165495 |
Yonghui Xu1, Huaqing Min2, Qingyao Wu2,3, Hengjie Song2, Bicui Ye4.
Abstract
Multi-Instance (MI) learning has been proven to be effective for the genome-wide protein function prediction problems where each training example is associated with multiple instances. Many studies in this literature attempted to find an appropriate Multi-Instance Learning (MIL) method for genome-wide protein function prediction under a usual assumption, the underlying distribution from testing data (target domain, i.e., TD) is the same as that from training data (source domain, i.e., SD). However, this assumption may be violated in real practice. To tackle this problem, in this paper, we propose a Multi-Instance Metric Transfer Learning (MIMTL) approach for genome-wide protein function prediction. In MIMTL, we first transfer the source domain distribution to the target domain distribution by utilizing the bag weights. Then, we construct a distance metric learning method with the reweighted bags. At last, we develop an alternative optimization scheme for MIMTL. Comprehensive experimental evidence on seven real-world organisms verifies the effectiveness and efficiency of the proposed MIMTL approach over several state-of-the-art methods.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28165495 PMCID: PMC5292966 DOI: 10.1038/srep41831
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Important Definitions.
| Symbols | Definitions |
|---|---|
| SD | The source domain dataset. |
| TD | The target domain dataset. |
| The average of all the instances in bag | |
| Represent the Gene Ontology terms assigned to | |
| The center of | |
| The square of the Mahalanobis distance between instance | |
| The square of the Mahalanobis distance between bags | |
| The learned Mahalanobis distance metric. | |
| The loss corresponding to traditional Multi-Instance Metric Learning. | |
| The expected loss. | |
| A constant to limit the minimum distance between the center of the bag and the instance in the bag. | |
| A constant to limit the maximum distance between bags from different class. | |
| Two slack vectors to improve the robustness of the algorithm. | |
| The weight vector of bags. |
Figure 1An example to show the different distributions of source domain and target domain.
Characteristics of the seven datasets.
| Genome | Bags | Classes | Instances | Dimensions | Bags in Source Domain | Bags in Target Domain |
|---|---|---|---|---|---|---|
| HM | 304 | 234 | 950 | 216 | 152 | 152 |
| PF | 425 | 321 | 1317 | 216 | 213 | 212 |
| AV | 407 | 340 | 1251 | 216 | 204 | 203 |
| GS | 379 | 320 | 1214 | 216 | 190 | 189 |
| CE | 2512 | 940 | 8509 | 216 | 1256 | 1256 |
| DM | 2605 | 1035 | 9146 | 216 | 1303 | 1302 |
| SC | 3509 | 1566 | 6533 | 216 | 1755 | 1754 |
Details information about positive and negative instances of the seven datasets.
| HM | PF | AV | GS | CE | DM | SC | |
|---|---|---|---|---|---|---|---|
| Instances per bag | 3.13 | 3.1 | 3.07 | 3.2 | 3.39 | 3.51 | 1.86 |
| Labels per instance | 3.25 | 4.48 | 4 | 3.14 | 6.07 | 6.02 | 5.89 |
| Positives instance per classes | 4.22 | 5.93 | 4.79 | 3.72 | 16.22 | 15.15 | 13.19 |
| Positives instance/negative instance | 1.41% | 1.42% | 1.19% | 0.99% | 0.65% | 0.59% | 0.38% |
Figure 2Comparison results with MIMTL and MIMTL on seven real-world organisms.
Comparison results using the data setting of MICS31.
| Genome | MIMLNN | MIMLSVM | EnMIMLmetric | MIMTL |
|---|---|---|---|---|
| Ranking Loss ↓ | ||||
| HM | 0.3411 ± 0.0229 (4) | 0.3315 ± 0.0161 (3) | 0.3248 ± 0.0092 (2) | 0.2904 ± 0.0348 (1) |
| PF | 0.3153 ± 0.0172 (3) | 0.3443 ± 0.0149 (4) | 0.3136 ± 0.0132 (2) | 0.2784 ± 0.0197 (1) |
| AV | 0.3887 ± 0.0167 (4) | 0.3629 ± 0.0139 (2) | 0.3882 ± 0.0049 (3) | 0.3340 ± 0.0136 (1) |
| GS | 0.4338 ± 0.0116 (4) | 0.3979 ± 0.0153 (2) | 0.4197 ± 0.0084 (3) | 0.3376 ± 0.0224 (1) |
| CE | 0.3929 ± 0.0138 (4) | 0.2759 ± 0.0056 (1) | 0.3643 ± 0.0103 (3) | 0.3139 ± 0.0176 (2) |
| DM | 0.3440 ± 0.0077 (3) | 0.2797 ± 0.0056 (1) | 0.3548 ± 0.0091 (4) | 0.3002 ± 0.0145 (2) |
| SC | 0.4081 ± 0.0071 (2) | 0.2884 ± 0.0022 (1) | 0.4083 ± 0.0067 (3) | 0.4238 ± 0.0174 (4) |
| Average Rank | 3.4286 | 2 | 2.8571 | 1.7143 |
| Coverage ↓ | ||||
| HM | 104.9102 ± 6.1158 (4) | 96.5986 ± 4.5074 (2) | 99.1306 ± 3.6104 (3) | 82.9551 ± 8.3830 (1) |
| PF | 147.5855 ± 6.2129 (4) | 147.3493 ± 7.0038 (3) | 142.1638 ± 5.3214 (2) | 124.2589 ± 7.3765 (1) |
| AV | 168.0298 ± 5.7280 (4) | 147.0354 ± 5.5012 (2) | 160.9854 ± 2.6453 (3) | 139.0677 ± 5.6253 (1) |
| GS | 159.9543 ± 4.0170 (4) | 139.6457 ± 4.5796 (2) | 151.9766 ± 3.9327 (3) | 120.0859 ± 8.4693 (1) |
| CE | 453.8295 ± 13.6709 (4) | 315.3995 ± 6.1653 (1) | 421.4290 ± 9.1993 (3) | 360.8714 ± 17.8819 (2) |
| DM | 479.2888 ± 8.2933 (3) | 376.5261 ± 7.0924 (1) | 480.6691 ± 8.4298 (4) | 407.3001 ± 18.8377 (2) |
| SC | 834.1562 ± 11.0925 (3) | 568.0499 ± 4.9514 (1) | 834.9128 ± 9.2612 (4) | 820.2915 ± 21.4581 (2) |
| Average Rank | 3.7143 | 1.7143 | 3.1429 | 1.4286 |
| Average-Recall ↑ | ||||
| HM | 0.0020 ± 0.0028 (4) | 0.0951 ± 0.0185 (3) | 0.1451 ± 0.0191 (2) | 0.2840 ± 0.0377 (1) |
| PF | 0.0039 ± 0.0031 (4) | 0.0748 ± 0.0312 (3) | 0.0859 ± 0.0086 (2) | 0.2910 ± 0.0379 (1) |
| AV | 0.0074 ± 0.0084 (4) | 0.0548 ± 0.0132 (2) | 0.0470 ± 0.0063 (3) | 0.1819 ± 0.0221 (1) |
| GS | 0.0055 ± 0.0053 (4) | 0.0830 ± 0.0204 (3) | 0.1405 ± 0.0343 (2) | 0.2244 ± 0.0266 (1) |
| CE | 0.0756 ± 0.0042 (4) | 0.1022 ± 0.0045 (3) | 0.1310 ± 0.0055 (2) | 0.1779 ± 0.0163 (1) |
| DM | 0.0499 ± 0.0058 (4) | 0.0713 ± 0.0050 (3) | 0.1104 ± 0.0067 (2) | 0.1808 ± 0.0124 (1) |
| SC | 0.0062 ± 0.0023 (4) | 0.0289 ± 0.0010 (2) | 0.0269 ± 0.0026 (3) | 0.0455 ± 0.0235 (1) |
| Average Rank | 4 | 2.7143 | 2.2857 | 1 |
| Average-F1 ↑ | ||||
| HM | 0.0040 ± 0.0054 (4) | 0.1315 ± 0.0220 (3) | 0.2042 ± 0.0191 (2) | 0.2459 ± 0.0433 (1) |
| PF | 0.0076 ± 0.0060 (4) | 0.1072 ± 0.0374 (3) | 0.1342 ± 0.0103 (2) | 0.2496 ± 0.0411 (1) |
| AV | 0.0136 ± 0.0151 (4) | 0.0800 ± 0.0173 (2) | 0.0777 ± 0.0092 (3) | 0.1659 ± 0.0245 (1) |
| GS | 0.0107 ± 0.0099 (4) | 0.1130 ± 0.0242 (3) | 0.1775 ± 0.0325 (2) | 0.1910 ± 0.0304 (1) |
| CE | 0.1086 ± 0.0047 (4) | 0.1418 ± 0.0046 (3) | 0.1695 ± 0.0052 (2) | 0.1698 ± 0.0129 (1) |
| DM | 0.0781 ± 0.0076 (4) | 0.1076 ± 0.0062 (3) | 0.1458 ± 0.0064 (2) | 0.1698 ± 0.0096 (1) |
| SC | 0.0116 ± 0.0042 (4) | 0.0473 ± 0.0014 (2) | 0.0427 ± 0.0033 (3) | 0.0502 ± 0.0194 (1) |
| Average Rank | 4 | 2.7143 | 2.2857 | 1 |
↓ (↑) indicates the smaller (larger), the better of the performance.
Comparison results using the evaluation protocol of EnMIMLNN4 (the source domain and target domain are drawn from the same distribution).
| Genome | MIMLNN | MIMLSVM | EnMIMLNNmetric | MIMTL |
|---|---|---|---|---|
| Ranking Loss ↓ | ||||
| HM | 0.3146 ± 0.0218 (3) | 0.3461 ± 0.0132 (4) | 0.3096 ± 0.0236 (2) | 0.2666 ± 0.0177 (1) |
| PF | 0.3168 ± 0.0178 (2) | 0.3557 ± 0.0138 (4) | 0.3230 ± 0.0170 (3) | 0.2859 ± 0.0159 (1) |
| AV | 0.3721 ± 0.0159 (3) | 0.3804 ± 0.0189 (4) | 0.3707 ± 0.0127 (2) | 0.3212 ± 0.0155 (1) |
| GS | 0.3693 ± 0.0199 (2) | 0.3813 ± 0.0250 (3) | 0.3928 ± 0.0136 (4) | 0.3194 ± 0.0192 (1) |
| CE | 0.2307 ± 0.0033 (4) | 0.1931 ± 0.0098 (1) | 0.2097 ± 0.0061 (2) | 0.2157 ± 0.0065 (3) |
| DM | 0.2317 ± 0.0083 (4) | 0.1893 ± 0.0049 (1) | 0.2143 ± 0.0081 (3) | 0.2126 ± 0.0098 (2) |
| SC | 0.3090 ± 0.0066 (3) | 0.2496 ± 0.0057 (1) | 0.3352 ± 0.0073 (4) | 0.2872 ± 0.0171 (2) |
| Average Rank | 3 | 2.5714 | 2.8571 | 1.5714 |
| Coverage ↓ | ||||
| HM | 102.5066 ± 5.1192 (3) | 106.1454 ± 4.2328 (4) | 99.5941 ± 4.6906 (2) | 84.3914 ± 5.1049 (1) |
| PF | 153.5061 ± 6.7742 (2) | 158.7094 ± 5.0619 (4) | 156.6540 ± 7.1480 (3) | 137.2249 ± 7.1226 (1) |
| AV | 168.5917 ± 6.1400 (4) | 157.8088 ± 5.8091 (2) | 157.9515 ± 5.5142 (3) | 137.6652 ± 4.8167 (1) |
| GS | 161.7774 ± 7.1352 (4) | 149.8095 ± 7.5255 (2) | 160.1811 ± 4.5556 (3) | 127.2642 ± 6.4359 (1) |
| CE | 317.1824 ± 4.2358 (4) | 265.7309 ± 12.6447 (1) | 287.8608 ± 6.8164 (3) | 277.9165 ± 7.5019 (2) |
| DM | 371.9936 ± 15.5735 (4) | 307.6074 ± 8.2128 (1) | 348.6276 ± 13.8078 (3) | 318.3221 ± 19.4774 (2) |
| SC | 726.7027 ± 13.0010 (3) | 564.7905 ± 7.7727 (1) | 754.1319 ± 11.8070 (4) | 625.5630 ± 35.9015 (2) |
| Average Rank | 3.4286 | 2.1429 | 3 | 1.4286 |
| Average-Recall ↑ | ||||
| HM | 0.0633 ± 0.0081 (4) | 0.1678 ± 0.0094 (3) | 0.1803 ± 0.0174 (2) | 0.3934 ± 0.0156 (1) |
| PF | 0.0533 ± 0.0100 (4) | 0.1264 ± 0.0136 (3) | 0.1416 ± 0.0150 (2) | 0.3632 ± 0.0259 (1) |
| AV | 0.0546 ± 0.0116 (4) | 0.1150 ± 0.0088 (3) | 0.1279 ± 0.0134 (2) | 0.2739 ± 0.0240 (1) |
| GS | 0.0511 ± 0.0117 (4) | 0.1286 ± 0.0129 (2) | 0.1272 ± 0.0181 (3) | 0.3092 ± 0.0180 (1) |
| CE | 0.1671 ± 0.0079 (4) | 0.2184 ± 0.0076 (3) | 0.3170 ± 0.0132 (2) | 0.5681 ± 0.0327 (1) |
| DM | 0.1562 ± 0.0102 (4) | 0.1926 ± 0.0088 (3) | 0.2998 ± 0.0080 (2) | 0.5710 ± 0.0289 (1) |
| SC | 0.0350 ± 0.0028 (4) | 0.0613 ± 0.0045 (3) | 0.0739 ± 0.0055 (2) | 0.4590 ± 0.0515 (1) |
| Average Rank | 4 | 2.8571 | 2.1429 | 1 |
| Average-F1 ↑ | ||||
| HM | 0.1073 ± 0.0114 (4) | 0.2160 ± 0.0111 (3) | 0.2465 ± 0.0166 (2) | 0.3269 ± 0.0192 (1) |
| PF | 0.0902 ± 0.0146 (4) | 0.1684 ± 0.0138 (3) | 0.1991 ± 0.0154 (2) | 0.2965 ± 0.0212 (1) |
| AV | 0.0898 ± 0.0151 (4) | 0.1514 ± 0.0100 (3) | 0.1770 ± 0.0159 (2) | 0.2373 ± 0.0174 (1) |
| GS | 0.0860 ± 0.0166 (4) | 0.1647 ± 0.0151 (3) | 0.1760 ± 0.0197 (2) | 0.2549 ± 0.0130 (1) |
| CE | 0.2306 ± 0.0083 (3) | 0.2842 ± 0.0085 (2) | 0.3808 ± 0.0113 (1) | 0.1710 ± 0.0278 (4) |
| DM | 0.2186 ± 0.0116 (3) | 0.2577 ± 0.0098 (2) | 0.3608 ± 0.0079 (1) | 0.2013 ± 0.0255 (4) |
| SC | 0.0587 ± 0.0040 (4) | 0.0927 ± 0.0055 (3) | 0.1065 ± 0.0061 (1) | 0.1049 ± 0.0069 (2) |
| Average Rank | 3.7143 | 2.7143 | 1.5714 | 2 |
↓ (↑) indicates the smaller (larger), the better of the performance.
Figure 3Comparison results with MIMTL and MICS on seven real-world organisms.
Comparison results on the dataset where the source and target domains are drawn from different clusters.
| Genome | MIMLNN | MIMLSVM | EnMIMLmetric | MIMTL |
|---|---|---|---|---|
| Ranking Loss ↓ | ||||
| HM | 0.3281 ± 0.0279 (2) | 0.3494 ± 0.0119 (4) | 0.3333 ± 0.0276 (3) | 0.3033 ± 0.0162 (1) |
| PF | 0.3279 ± 0.0103 (2) | 0.3524 ± 0.0039 (4) | 0.3316 ± 0.0195 (3) | 0.3035 ± 0.0231 (1) |
| AV | 0.3786 ± 0.0110 (3) | 0.3912 ± 0.0134 (4) | 0.3772 ± 0.0155 (2) | 0.3511 ± 0.0162 (1) |
| GS | 0.3628 ± 0.0190 (2) | 0.3722 ± 0.0098 (3) | 0.3833 ± 0.0078 (4) | 0.3353 ± 0.0226 (1) |
| CE | 0.2304 ± 0.0029 (4) | 0.1910 ± 0.0076 (1) | 0.2099 ± 0.0044 (2) | 0.2221 ± 0.0037 (3) |
| DM | 0.2344 ± 0.0012 (4) | 0.1892 ± 0.0012 (1) | 0.2144 ± 0.0016 (3) | 0.2057 ± 0.0082 (2) |
| SC | 0.3062 ± 0.0056 (3) | 0.2502 ± 0.0018 (1) | 0.3371 ± 0.0071 (4) | 0.2829 ± 0.0152 (2) |
| Average Rank | 2.8571 | 2.5714 | 3 | 1.5714 |
| Coverage ↓ | ||||
| HM | 105.4079 ± 2.7543 (4) | 105.1294 ± 5.2652 (3) | 103.8311 ± 5.4997 (2) | 92.6952 ± 4.3571 (1) |
| PF | 156.9030 ± 4.5401 (2) | 157.1299 ± 3.8358 (3) | 160.0532 ± 9.8530 (4) | 135.1252 ± 10.3370 (1) |
| AV | 167.2059 ± 9.4550 (4) | 160.3971 ± 8.4329 (3) | 156.1176 ± 6.9683 (2) | 144.0719 ± 7.0694 (1) |
| GS | 161.1491 ± 4.3131 (4) | 147.8368 ± 5.6174 (2) | 158.7298 ± 8.2310 (3) | 134.2649 ± 10.7539 (1) |
| CE | 316.9668 ± 6.1003 (4) | 262.3989 ± 4.3975 (1) | 286.4719 ± 2.7980 (3) | 285.3007 ± 11.3299 (2) |
| DM | 378.4963 ± 3.4863 (4) | 311.0507 ± 7.4777 (1) | 350.9775 ± 7.2447 (3) | 312.6278 ± 12.7775 (2) |
| SC | 718.7297 ± 8.7041 (3) | 565.9772 ± 4.0773 (1) | 754.0876 ± 17.5188 (4) | 616.2332 ± 37.8974 (2) |
| Average Rank | 3.5714 | 2 | 3 | 1.4286 |
| Average-Recall ↑ | ||||
| HM | 0.0664 ± 0.0080 (4) | 0.1713 ± 0.0102 (3) | 0.1751 ± 0.0070 (2) | 0.5435 ± 0.0456 (1) |
| PF | 0.0525 ± 0.0054 (4) | 0.1187 ± 0.0190 (3) | 0.1423 ± 0.0124 (2) | 0.6026 ± 0.0284 (1) |
| AV | 0.0533 ± 0.0056 (4) | 0.1088 ± 0.0072 (3) | 0.1237 ± 0.0057 (2) | 0.4727 ± 0.0176 (1) |
| GS | 0.0455 ± 0.0060 (4) | 0.1313 ± 0.0077 (2) | 0.1158 ± 0.0192 (3) | 0.4979 ± 0.0074 (1) |
| CE | 0.1647 ± 0.0050 (4) | 0.2205 ± 0.0094 (3) | 0.3090 ± 0.0092 (2) | 0.6066 ± 0.0186 (1) |
| DM | 0.1590 ± 0.0119 (4) | 0.1920 ± 0.0085 (3) | 0.3036 ± 0.0081 (2) | 0.5785 ± 0.0279 (1) |
| SC | 0.0319 ± 0.0020 (4) | 0.0610 ± 0.0038 (3) | 0.0769 ± 0.0043 (2) | 0.4752 ± 0.0483 (1) |
| Average Rank | 4 | 2.8571 | 2.1429 | 1 |
| Average-F1 ↑ | ||||
| HM | 0.1121 ± 0.0116 (4) | 0.2209 ± 0.0125 (3) | 0.2387 ± 0.0060 (2) | 0.2814 ± 0.0528 (1) |
| PF | 0.0895 ± 0.0079 (4) | 0.1600 ± 0.0202 (2) | 0.1994 ± 0.0146 (1) | 0.1595 ± 0.0137 (3) |
| AV | 0.0877 ± 0.0074 (4) | 0.1414 ± 0.0069 (3) | 0.1699 ± 0.0037 (1) | 0.1599 ± 0.0111 (2) |
| GS | 0.0786 ± 0.0089 (4) | 0.1700 ± 0.0092 (2) | 0.1660 ± 0.0226 (3) | 0.1846 ± 0.0481 (1) |
| CE | 0.2280 ± 0.0046 (3) | 0.2871 ± 0.0102 (2) | 0.3739 ± 0.0057 (1) | 0.1584 ± 0.0150 (4) |
| DM | 0.2226 ± 0.0124 (3) | 0.2570 ± 0.0095 (2) | 0.3629 ± 0.0075 (1) | 0.2037 ± 0.0320 (4) |
| SC | 0.0543 ± 0.0031 (4) | 0.0922 ± 0.0047 (3) | 0.1097 ± 0.0054 (1) | 0.1001 ± 0.0142 (2) |
| Average Rank | 3.7143 | 2.4286 | 1.4286 | 2.4286 |
↓ (↑) indicates the smaller (larger), the better of the performance.
Figure 4The average ranks diagrams45 for the ranking-based measures: Ranking Loss (a), Coverage (b), Average Recall (c), and Average-F1 (d). The data setting used in this figure is under the protocol of MICS31.