File cloning in open source: the good, the bad and the ugly

How much copying is there in open source projects? According to our recent study soon to be presented at ICSM 2011, more than 10% of files found in open source Java projects are clones of other files. That is a lot. But those clones are only in about 15% of projects, meaning that 85% of projects don’t have clones. And, it turns out, some cloning out there is relatively harmless, but we found some uglies too. Here’s a summary of our analysis of what’s going on with these clones. For the complete details, read the paper.

The clones

In order to understand the nature of these clones we looked into a large number of projects that had clones in them in order to identify the circumstances under which the files ended up being clones. Here’s a categorization of what we found:

Demos/tutorials: these are files that were specifically intended to be used as examples of functionality, usually for a specific library or framework. These
files usually originate from example or demo programs, or are working code fragments included with tutorials. We found files for SWT, JBoss, Java Servlets, Swing, and a number of other well-known libraries.
We think this category is a perfectly legitimate case for file cloning. These are files designed to be copied and executed in new projects, and they do not make up any integral part of the system.
Small library / utility files: these are clones that appeared to come from small third-party libraries or self-contained utility files. Some examples include a single-file Java port of GNU Getopt, a tool for encoding PNG files, a Java connector for Spidermonkey, and a file for converting CVS date strings.
It is difficult to classify this category as strictly good or bad. On the one hand, libraries are being reused through copying, which eliminates the connection to their original source and carries with it a whole host of maintenance issues. On the other hand, these libraries are all rather small and their functionality not overly complicated. Furthermore, especially in the case of copied utility files, there may not have been any reasonable way to reuse the functionality without copying the files. And developers might be hesitant to introduce an external dependency for the use of a handful of files.
Library: this category differs from the previous one only in that the copied libraries are larger and more well known. Examples of this category are split between those projects that copied significant portions of common libraries, and those that copied only a handful of files. The copied libraries include Jython, Apache Beanutils, SWT, JUnit, and the Java Excel API.
The most interesting example we found in this category was this: the version of Apache Lenya in our dataset contained a complete copy of Apache Cocoon, a spring-based framework that Lenya uses. In trying to discover why this copy existed, we looked at the most recent version of Lenya, which no longer contains a copy of Cocoon. Instead, it has been replaced by a script to check out Cocoon and then automatically apply a number of patches. It appears the developers of Lenya needed to modify portions of Cocoon, and originally did this by copying the entire library. Only later did they settle upon the patching mechanism to achieve the same result.
Clones in this category are a bit uglier than those in the previous category. These libraries are larger and more complex, and so are more likely to contain bugs. As seen with Lenya, one possible motivation behind this copying is developers wishing to modify portions of the library. They might also wish to remove aspects that they don’t need. While understandable, we would hesitate to recommend such action except when absolutely necessary. Lenya’s current solution is clearly preferable.
Related Projects: this category includes those clones that occurred between two projects that are related in some way. This includes project forks, sub-projects, and renamed projects. Roughly 1/3 of the cases in this category were due to a developer beginning a new project by copying the entirety of an older project.
This type of cloning can clearly impact the maintainability of a system, but, if handled properly, forms a reasonable part of an open source project’s lifecycle.
Duplicated Project: This category is similar to the previous category, except instead of the projects simply being related, they are exact copies. The most common cause of this is that the project has been simultaneously placed in multiple open source repositories. Usually the actual version control is handled by one repository, while the other contains a package distribution. While projects in this category are not clones in the same sense as the other categories, their duplication can be a source of confusion to those looking to find a project’s real home.
Java Standard Library: this category was quite unexpected. We were surprised to find so many projects containing Java itself. On further investigation, we discovered a large number of applications designed to transform Java code that contained their own, often modified, versions. We found tools for converting Java to Javascript, multiple implementations of Java Virtual Machines, and a few Java compilers. Cloning of this type is necessary, but also very limited in the scope of projects that require it.
We also found a large number of projects including files from org.xml.sax and org.w3c.dom, despite their being included with the JDK. These libraries are something of a special case, they are on a more frequent release schedule than the JDK itself, and were not included in older versions.
Java Extensions: this category contains copies of files from common Java extensions, such as JAXP or JMX. While these extensions are now packaged with the JDK, this has not historically been the case. The sheer number of times that slightly different versions of these extensions appear as source in different projects suggests the importance of a better mechanism for handling extensions before their inclusion in the JDK.
Other: the common theme of this category is their being extremely ugly examples of file cloning. There were two related projects that copied an Apache library, yet renamed every package to something slightly different. In an extreme case, a developer copied someone else’s Java application for animating images and uploaded it to a new repository under his name.

The data

The data consisted of a universe of 13,241 Java projects collected from repositories such as Apache, Java.net, Google Code, and Sourceforge, in a total of 3,237,910 files. The picture on the right shows the distribution of project size in this collection. Read it like this: there are about 5,000 projects with 1 to 50 files (left-most column), etc., all the way to there are about 90 projects with 2001 to 17,893 files (right-most column). Indeed, there are a few massive projects out there!

I’d like to hear people’s personal experiences with file cloning in their projects. Do you copy-and-paste code a lot? At the file level, smaller or bigger?

This study was led by Joel Ossher, with help from Hitesh Sajnani and with me supervising. The work was supported by the National Science Foundation under Grant No. CCF-1018374.

File cloning in open source: the good, the bad and the ugly