University of Konstanz
Graduiertenkolleg / PhD Program
Computer and Information Science

Colloquium of the Department and the PhD Program


Code plagiarism vs. evolutionary process of program source


Prof. Cho Hwan-Gue, Pusan National University, South Korea
Pusan, South Corea

date & place

Wednesday, 13.12.2006, 16:15 h
Room C252


As intelligent softwares have been more and more pervasive, illegal code theft/copying and software plagiarism is widespread. But it is very hard to detect plagiarism manually by comparing all pairs of submitted codes.

In this talk, we propose a code clustering algorithm to be used as plagiarism detecting tool. First we propose an asymmetric measure $pdist(a,b)$ to compute the "evolutionary(plagiarism) distance" from program $a$ to program $b$.

Then we construct the Plagiarism Direction Graph(PDG) using $pdist(a,b)$ as edge weight function. Next we transform PDG into Gumbel Distance Graph (GDG) model, since we believe $pdist(a,b)$ score is of well-known Gumbel distribution. Second, we newly define pseudo plagiarism which is a sort of virtual plagiarism forced by the very strong functional requirement. Separating pseudo plagiarism and real plagiarism is very important in evaluating assignment program fairly and correctly. Therefore the problem of plagiarism detection can be reduced to find a highly condensed subgraph cluster in GDG.

We conducted experiments with 18 groups of program(more than 900 source codes) collected from ICPC(International Collegiate Programming Contest) and KOI(Korean Olympiad for Informatics) programming contests. Experiment showed that most of plagiarized codes are detected successfully. And we can separate the real plagiarism from pseudo plagiarisms. It is interesting that $pdist()$ enables us to reconstruct the phylogenetic tree for program improvement procedure. Finally some open problems will be discussed.