| Literature DB >> 35573261 |
Abstract
Transfer learning has ability to create learning task of weakly labeled or unlabeled target domain by using knowledge of source domain to help, which can effectively improve the performance of target learning task. At present, the increased awareness of privacy protection restricts access to data sources and poses new challenges to the development of transfer learning. However, the research on privacy protection in transfer learning is very rare. The existing work mainly uses differential privacy technology and does not consider the distribution difference between data sources, or does not consider the conditional probability distribution of data, which causes negative transfer to harm the effect of algorithm. Therefore, this paper proposes multi-source selection transfer learning algorithm with privacy-preserving MultiSTLP, which is used in scenarios where target domain contains unlabeled data sets with only a small amount of group probability information and multiple source domains with a large number of labeled data sets. Group probability means that the class label of each sample in target data set is unknown, but the probability of each class in a given data group is available, and multiple source domains indicate that there are more than two source domains. The number of data set contains more than two data sets of source domain and one data set of target domain. The algorithm adapts to the marginal probability distribution and conditional probability distribution differences between domains, and can protect the privacy of target data and improve classification accuracy by fusing the idea of multi-source transfer learning and group probability into support vector machine. At the same time, it can select the representative dataset in source domains to improve efficiency relied on speeding up the training process of algorithm. Experimental results on several real datasets show the effectiveness of MultiSTLP, and it also has some advantages compared with the state-of-the-art transfer learning algorithm.Entities:
Keywords: Group probabilities; Multi-source transfer learning; Privacy-preserving
Year: 2022 PMID: 35573261 PMCID: PMC9077647 DOI: 10.1007/s11063-022-10841-6
Source DB: PubMed Journal: Neural Process Lett ISSN: 1370-4621 Impact factor: 2.565
Fig. 1Leaning from group probability
Notations and descriptions
| Notation | Description |
|---|---|
| Source/target domain | |
| S | |
| Source/target sample set | |
| Source/target class label set | |
| Number of labeled | |
| Number of source domain | |
| Number of target domain samples | |
| Parameters of source linear classifier | |
| Parameters of target linear classifier | |
| Group probability | |
| k-th Groups | |
| Number of group | |
| Weight of source domains | |
| Weight of representative data set in | |
| Weight of samples |
Fig. 2Framework of MultiSTLP
Fig. 3Flowchart of MultiSTLP
MultiSTLP algorithm training
| Steps of MultiSTLP Algorithm | |
|---|---|
| Input: Labeled source domains | |
| the number of sample in | |
| An unlabeled training dataset | |
| the group probability is |
The statistics of 20-Newsgroups
| Domains | Comments | Training | Testing | Positive (%) | Feature |
|---|---|---|---|---|---|
| Books (B) | 6465 | 2000 | 4465 | 50 | 30,000 |
| DVDs (D) | 5586 | 2000 | 3586 | 50 | 30,000 |
| Electronics (E) | 7681 | 2000 | 5681 | 50 | 30,000 |
| Kitchen (K) | 7945 | 2000 | 5945 | 50 | 30,000 |
The statistics of sentiment analysis dataset
| Domains | Comments | Training | Testing | Positive (%) | Feature |
|---|---|---|---|---|---|
| Books (B) | 6465 | 2000 | 4465 | 50 | 30,000 |
| DVDs (D) | 5586 | 2000 | 3586 | 50 | 30,000 |
| Electronics (E) | 7681 | 2000 | 5681 | 50 | 30,000 |
| Kitchen (K) | 7945 | 2000 | 5945 | 50 | 30,000 |
The statistics of spam dataset
| Domain | Number | Positive | Negative | Feature |
|---|---|---|---|---|
| U1 | 4000 | 2000 | 2000 | 206,908 |
| U2 | 2500 | 1250 | 1250 | 206,908 |
| U3 | 2500 | 1250 | 1250 | 206,908 |
| U4 | 2500 | 1250 | 1250 | 206,908 |
Description of source and target domains on TRECVID dataset
| Domain | Source domains | Target domain | ||||
|---|---|---|---|---|---|---|
| Channel | CNN_ENG | MSNBS_ENG | NBC_ENG | CCTV_CHN | NTDTV_CHN | LBC_ARB |
| # Keyframes | 11,025 | 8905 | 9322 | 10,896 | 6481 | 15,272 |
Description of source and target domains on 20-Newsgroups dataset
| Domain | Source domains | Target domain |
|---|---|---|
| rec vs sci(r vs s) | rec.autos & sci.crypt | rec.sport.hockey & sci.space |
| rec.motorcycles & sci.electronics | ||
| rec.sport.baseball & sci.med | ||
| com vs sci(c vs s) | comp.graphics & rec.autos | comp.sys.mac.hardware & rec.sport.hockey |
| comp.os.ms-windows.misc & rec.motorcycles | ||
| comp.sys.ibm.pc.hardware & rec.sport.baseball | ||
| sci vs com(s vs c) | sci.crypt & comp.graphics | sci.space & comp.sys.mac.hardware |
| sci.electronics & comp.os.ms-windows.misc | ||
| sci.med & comp.sys.ibm.pc.hardware |
Description of source and target domains on Sentiment analysis datase
| Domain | Source domains | Target domain | ||
|---|---|---|---|---|
| Sentiment dataset | Books | DVDs | Electronics (E) | Kitchen (K) |
| # Sentiment | 6465 | 5586 | 7681 | 7945 |
Description of source and target domains on email spam dataset
| Domain | Source domains | Target domain | ||
|---|---|---|---|---|
| Emails dataset | U1 | U2 | U3 | U4 |
| #emails | 2500 | 2500 | 2500 | 2500 |
Comparison of average classification accuracy with standard deviation on real-world four transfer datasets
| Datasets | SVM | IC-SVM | TrGNB | ARTL | STL-SVM | TSVM-GP | MultiSTLP | |
|---|---|---|---|---|---|---|---|---|
| 20-Newsgroups | r vs s | 80.14(2.81) | 81.66(2.82) | 91.45(1.32) | 86.35(1.55) | 88.87(1.47) | 90.67(1.56) | |
| c vs s | 74.24(1.98) | 79.24(2.01) | 92.28(1.11) | 82.65(1.32) | 81.86(1.36) | 88.82(1.45) | ||
| s vs c | 75.35(2.48) | 77.73(2.25) | 90.11(1.56) | 85.38(1.48) | 86.15(1.55) | 87.97(1.58) | ||
| Average | 76.58(2.42) | 79.54(2.36) | 91.48(1.33) | 84.79(1.45) | 85.63(1.46) | 89.15(1.53) | ||
| TRECVID 2005 | CN vs L | 79.16(2.87) | 80.29(2.65) | 91.53(1.98) | 86.91(2.17) | 87.32(2.51) | 93.15(1.69) | |
| MS vs L | 77.87(2.13) | 78.26(2.27) | 89.35(1.89) | 83.88(2.12) | 84.62(2.42) | 90.55(1.76) | ||
| NB vs L | 73.97(2.49) | 74.65(2.59) | 86.53(1.85) | 80.27(1.96) | 82.74(2.01) | 88.71(1.81) | ||
| CC vs L | 74.53(2.01) | 75.72(1.96) | 87.62(1.76) | 81.32(2.78) | 83.79(1.81) | 90.95(1.53) | ||
| NT vs L | 69.69(1.93) | 70.66(1.87) | 84.81(1.65) | 78.58(1.76) | 79.85(1.73) | 86.98(1.58) | ||
| Average | 75.04(2.29) | 75.91(2.27) | 87.97(1.83) | 82.19(2.16) | 83.66(2.10) | 90.27(1.68) | ||
| Sentiment analysis | B vs K | 78.46(1.77) | 79.37(1.83) | 90.21(1.87) | 88.75(2.01) | 91.85(1.93) | 89.87(1.71) | |
| D vs K | 75.18(1.69) | 77.25(1.72) | 86.37(1.75) | 84.66(2.46) | 87.95(1.98) | 84.54(1.68) | ||
| E vs K | 77.59(2.11) | 78.27(2.05) | 84.56(1.98) | 80.93(2.15) | 85.11(2.26) | 81.75(1.85) | ||
| Average | 77.08(1.86) | 72.30(1.87) | 87.05(1.87) | 84.78(2.21) | 88.30(2.06) | 85.39(1.75) | ||
Bold represents the results of algorithm proposed in this paper
Comparison of average recall with standard deviation on four real-world transfer datasets
| Datasets | SVM | IC-SVM | TrGNB | ARTL | STL-SVM | TSVM-GP | MultiSTLP | |
|---|---|---|---|---|---|---|---|---|
| 20-Newsgroups | r vs s | 70.87(2.73) | 71.22(2.61) | 75.12(2.42) | 74.24(2.54) | 71.66(3.43) | 76.01(2.31) | |
| c vs s | 62.52(3.32) | 63.68(3.26) | 70.43(2.65) | 69.12(2.87) | 68.12(3.54) | 71.11(2.45) | ||
| s vs c | 74.24(4.05) | 75.78(3.97) | 72.81(2.73) | 71.67(2.95) | 72.33(3.65) | 73.64(2.65) | ||
| Average | 69.21(3.37) | 70.23(3.28) | 72.79(2.60) | 71.68(2.79) | 70.70(3.54) | 73.57(2.47) | ||
| TRECVID 2005 | CN vs L | 60.34(3.65) | 61.32(3.32) | 71.33(2.85) | 66.15(3.25) | 70.26(3.17) | 71.45(2.73) | |
| MS vs L | 70.27(2.64) | 71.12(2.34) | 75.45(2.42) | 74.23(2.53) | 73.52(2.92) | 75.46(2.42) | ||
| NB vs L | 75.68(3.86) | 75.97(3.75) | 73.57(2.53) | 70.57(2.76) | 72.62(2.58) | 73.73(2.51) | ||
| CC vs L | 73.53(3.19) | 74.15(3.04) | 78.23(2.35) | 74.15(3.05) | 75.36(2.44) | 77.28(2.47) | ||
| NT vs L | 78.65(2.48) | 79.26(2.35) | 81.89(2.12) | 79.85(2.24) | 80.14(2.23) | 82.57(2.11) | ||
| Average | 71.69(3.16) | 72.36(2.96) | 76.09(2.45) | 72.99(2.77) | 74.38(3.25) | 76.10(2.45) | ||
| Sentiment analysis | B vs K | 62.45(3.45) | 63.98(3.23) | 73.23(2.43) | 71.65(2.66) | 72.49(2.79) | 72.98(2.47) | |
| D vs K | 61.98(3.11) | 62.76(3.08) | 70.36(2.03) | 70.25(2.19) | 68.17(3.11) | 70.01(2.12) | ||
| E vs K | 50.76(3.85) | 51.52(3.63) | 68.65(2.42) | 67.37(2.76) | 64.56(2.92) | 67.95(2.53) | ||
| Average | 58.40(3.47) | 59.42(3.31) | 70.75(2.29) | 69.76(2.54) | 68.41(2.94) | 70.31(2.37) | ||
Bold represents the results of algorithm proposed in this paper
Comparison of average precision with standard deviation on four real-world transfer datasets
| Datasets | SVM | IC-SVM | TrGNB | ARTL | STL-SVM | TSVM-GP | MultiSTLP | |
|---|---|---|---|---|---|---|---|---|
| 20-Newsgroups | r vs s | 76.87(2.71) | 77.18(2.59) | 82.66(2.12) | 80.12(2.33) | 81.61(2.43) | 81.96(2.11) | |
| c vs s | 69.25(2.26) | 70.12(2.31) | 83.45(2.25) | 78.84(1.78) | 77.66(2.85) | 82.84(2.65) | ||
| s vs c | 72.64(3.23) | 73.76(3.35) | 82.46(2.32) | 77.75(2.05) | 80.51(2.74) | 80.14(2.13) | ||
| Average | 72.92(2.40) | 73.69(2.75) | 82.86(2.23) | 78.90(2.05) | 79.93(2.67) | 81.65(2.30) | ||
| TRECVID 2005 | CN vs L | 65.32(3.12) | 66.21(2.92) | 77.75(1.45) | 73.65(2.02) | 75.84(1.96) | 76.23(1.53) | |
| MS vs L | 74.64(2.55) | 75.36(2.53) | 83.42(1.31) | 80.82(2.11) | 83.41(2.13) | 82.56(1.36) | ||
| NB vs L | 71.25(2.68) | 72.52(2.43) | 79.65(1.26) | 77.57(1.99) | 78.87(1.97) | 80.67(1.37) | ||
| CC vs L | 78.38(2.56) | 79.73(2.45) | 86.43(1.42) | 83.76(2.15) | 83.86(1.92) | 85.32(1.55) | ||
| NT vs L | 84.26(2.44) | 85.32(2.36) | 89.65(1.15) | 85.95(1.93) | 85.49(1.81) | 86.46(1.25) | ||
| Average | 74.77(2.67) | 75.83(2.54) | 83.38(1.32) | 80.35(2.04) | 81.49(1.96) | 82.25(1.41) | ||
| Sentiment analysis | B vs K | 65.65(2.21) | 66.36(2.32) | 76.56(1.32) | 73.75(1.68) | 74.29(2.11) | 75.35(1.38) | |
| D vs K | 64.52(2.38) | 65.55(2.23) | 72.45(1.28) | 70.72(2.11) | 69.25(1.87) | 71.84(1.42) | ||
| E vs K | 54.88(2.55) | 55.48(2.45) | 68.72(1.24) | 69.18(1.87) | 67.68(2.32) | 67.27(1.51) | ||
| Average | 61.68(2.38) | 62.46(2.33) | 72.58(1.28) | 71.22(1.89) | 70.41(2.10) | 71.49(1.47) | ||
Bold represents the results of algorithm proposed in this paper
Comparison of average training time (s) with standard deviation on four real-world transfer datasets
| Datasets | SVM | IC-SVM | TrGNB | ARTL | STL-SVM | TSVM-GP | MultiSTLP | |
|---|---|---|---|---|---|---|---|---|
| 20-Newsgroups | r vs s | 1.22(0.14) | 0.04(0.07) | 3.29(1.13) | 8.75(1.32) | 9.25(1.28) | 3.26(1.14) | |
| c vs s | 1.18(0.15) | 0.06(0.09) | 3.16(1.14) | 8.65(1.33) | 9.11(1.22) | 3.02(1.03) | ||
| s vs c | 1.46(0.15) | 0.08(0.11) | 3.38(1.12) | 8.84(1.34) | 9.36(1.25) | 3.47(1.13) | ||
| TRECVID 2005 | CN vs L | 24.67(1.22) | 1.16(0.82) | 86.57(1.35) | 97.46(1.98) | 100.57(2.11) | 86.58(1.43) | |
| MS vs L | 20.86(1.19) | 1.12(0.78) | 82.84(1.33) | 93.25(1.89) | 94.23(1.96) | 80.47(1.39) | ||
| NB vs L | 21.35(1.21) | 1.13(0.81) | 84.45(1.33) | 94.33(1.92) | 96.54(1.98) | 82.36(1.41) | ||
| CC vs L | 23.47(1.25) | 1.14(0.81) | 85.86(1.34) | 95.45(1.97) | 98.76(2.08) | 84.54(1.42) | ||
| NT vs L | 19.43(1.17) | 1.09(0.77) | 81.53(1.32) | 91.86(1.88) | 92.52(1.94) | 79.55(1.38) | ||
| Sentiment analysis | B vs K | 13.34(0.88) | 0.43(0.12) | 18.57(1.15) | 21.37(1.25) | 23.46(1.34) | 18.44(1.18) | |
| D vs K | 13.16(0.92) | 0.41(0.13) | 18.36(1.13) | 20.87(1.28) | 21.75(1.32) | 18.27(1.16) | ||
| E vs K | 13.53(1.02) | 0.45(0.15) | 18.63(1.16) | 21.65(1.31) | 23.66(1.33) | 18.68(1.15) | ||
| Email spam | U1vsU4 | 1.73(0.12) | 0.06(0.02) | 4.25(1.23) | 6.25(1.67) | 7.87(1.86) | 3.86(1.12) | |
| U2vsU4 | 1.75(0.13) | 0.05(0.02) | 4.37(1.25) | 6.56(1.75) | 8.11(2.15) | 3.87(1.15) | ||
| U3vsU4 | 1.72(0.11) | 0.06(0.03) | 4.36(1.24) | 6.47(1.73) | 7.94(2.01) | 3.65(1.03) | ||
Bold represents the results of algorithm proposed in this paper
Fig. 4Sensitivity of parameter for MultiSTLP
Fig. 5Sensitivity of parameter for MultiSTLP
Fig. 6Sensitivity of parameter for MultiSTLP