| Literature DB >> 29095934 |
Xinyu Yang1, Guoai Xu1, Qi Li1, Yanhui Guo1, Miao Zhang1.
Abstract
Authorship attribution is to identify the most likely author of a given sample among a set of candidate known authors. It can be not only applied to discover the original author of plain text, such as novels, blogs, emails, posts etc., but also used to identify source code programmers. Authorship attribution of source code is required in diverse applications, ranging from malicious code tracking to solving authorship dispute or software plagiarism detection. This paper aims to propose a new method to identify the programmer of Java source code samples with a higher accuracy. To this end, it first introduces back propagation (BP) neural network based on particle swarm optimization (PSO) into authorship attribution of source code. It begins by computing a set of defined feature metrics, including lexical and layout metrics, structure and syntax metrics, totally 19 dimensions. Then these metrics are input to neural network for supervised learning, the weights of which are output by PSO and BP hybrid algorithm. The effectiveness of the proposed method is evaluated on a collected dataset with 3,022 Java files belong to 40 authors. Experiment results show that the proposed method achieves 91.060% accuracy. And a comparison with previous work on authorship attribution of source code for Java language illustrates that this proposed method outperforms others overall, also with an acceptable overhead.Entities:
Mesh:
Year: 2017 PMID: 29095934 PMCID: PMC5667828 DOI: 10.1371/journal.pone.0187204
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Framework overview.
Programming metrics extracted from Java source code files.
| Metrics | Description |
|---|---|
| Ratio of blank lines to code lines | |
| Ratio of comment lines to code lines | |
| Percentage of block comments to all comment lines | |
| Percentage of open braces ({) alone in a line | |
| Percentage of close braces (}) alone in a line | |
| Percentage of variable naming without uppercase letters | |
| Percentage of variable naming starting with lowercase letters | |
| Average variable name length | |
| Ratio of macro variables | |
| Percentage of “for” statements to all loop statements | |
| Preference for cyclic variables | |
| Percentage of “if” statements to all conditional statements | |
| Ratio of branch statements | |
| Average number of methods per class | |
| Ratio of “try” structure | |
| Percentage of “catch” statements when dealing with exceptions | |
| Average number of interfaces per class | |
| Average character number per Java file | |
| Maximum depth of an AST |
Fig 2The flowchart of PSOBP.
Fig 3The frequency distribution histogram of Java files.
Key parameters of PSO algorithm.
| Name | Definition | Note | Value |
|---|---|---|---|
| Population size | Usually 20~40 | 100 | |
| Particle length | Determined by the optimization problem | Design formula as above | |
| Maximum velocity | Maximum velocity limit in each dimension | 1 | |
| Inertia weight | Linear decreasing weight generally from 1.5 to 0.5 | Eq ( | |
| Acceleration constant | Usually both 2.0 | ||
| Random number | Between 0 and 1 | Random number |
The effect of different parameter configurations.
| Single variable | Classification accuracy |
|---|---|
| 89.073% | |
| 88.571% | |
| 88.711% | |
| 87.215% | |
Fig 4The structure of neural network.
Cross validation accuracy of BP and PSOBP neural network.
(percentage %).
| Counter | Mean value | Standard deviation | Mean value | Standard deviation |
|---|---|---|---|---|
| 75.913 | 2.477 | 91.218 | 4.493 | |
| 76.246 | 3.402 | 91.342 | 4.060 | |
| 75.944 | 2.940 | 90.567 | 6.067 | |
| 75.969 | 4.156 | 91.001 | 4.394 | |
| 76.050 | 3.197 | 91.008 | 4.682 | |
| 75.945 | 3.027 | 91.093 | 6.046 | |
| 76.439 | 4.606 | 91.106 | 5.018 | |
| 76.507 | 2.476 | 91.080 | 4.444 | |
| 75.785 | 2.056 | 91.013 | 5.331 | |
| 76.132 | 3.420 | 91.172 | 4.152 | |
Fig 5The classification results of PSOBP and BP neural network in one 10-fold cross-validation experiment.
Comparison to other classifiers.
| Classifier | Accuracy | Running time (s) |
|---|---|---|
| Random Forest | 79.735% | 9.679 |
| Support Vector Machine | 73.642% | 201.220 |
| Naïve Bayes | 49.007% | 11.974 |
| BP | 75.107% | 48.200 |
a Including the time spent in optimization procedure.
Comparison to previous work.
| Related work | # of Programmers | Results |
|---|---|---|
| Ding and Samadzadeh [ | 46 | 67.2% |
| Lange and Mancoridis [ | 20 | 75% |
| Shevertalov | 20 | 75% |
| Frantzeskou | 30 | 96.9% |