Code Golf Style
I have been meaning to put a few of the chapters of my book here. The Code Golf Style is one of my favorites; fun too, and a crowd pleaser! I’ll start with it. Here is the complete Code Golf chapter.
Before I do that, here is the post I wrote some time ago explaining what the book is about; you may want to read it. TL;DR: 33 different styles of writing a term-frequency program — count words in a text file, excluding stop words, print the 25 most-frequently occurring words in decreasing order. The emphasis of the book is on the constraints behind each style, as the task itself is trivial on purpose. All the code is in github. In my courses, my students go through these exercises one by one, rewriting them in their favorite programming language. It’s… a challenge!
Code Golf
Constraints
- As few lines of code as possible
A Program in This Style
1 #!/usr/bin/env python 2 import re, sys, collections 3 4 stops = open(’../stop_words.txt’).read().split(’,’) 5 words = re.findall(’[a-z]{2,}’, open(sys.argv[1]).read().lower()) 6 counts = collections.Counter(w for w in words if w not in stops) 7 for (w, c) in counts.most_common(25): 8 print w, ’-’, c
Commentary
THE MAIN CONCERN of this style is brevity. The goal is to implement the program’s functionality in as few lines of code as possible. This is usually achieved by using advanced features of the programming language and its libraries. When brevity is the only goal, it is not unusual for this style to result in lines of code that are very long, with instruction sequences that are hard to understand. Often, too, textual brevity may result in programs that perform poorly or that have bugs, some of which only manifesting themselves when the code is used in larger or different contexts. Brevity, however, when used appropriately, may result in programs that are quite elegant and easy to read because they are small.
The example program is possibly one of the shortest programs in terms of lines of code that can be written for implementing term frequency in Python. (This program is a slight improvement over the one contributed by Peter Norvig that is available in the github repository for this book) Line #4 loads the list of stop words in just one line. It does so by chaining several file operations together: it opens the stop words file, reads the entire contents into memory, and then it splits the words around commas, obtaining a list of stop words bound to variable stops. Line #5 loads the list of words from the input file into memory in just one line. It does so by opening the file, reading its entire contents, and normalizing all characters to lower case; after that it applies a regular expression to find all sequences of letters (a to z) with length greater than 2, so to automatically eliminate single letters from the input file. The resulting list of words is placed in a list bound to variable words. Line #6 uses Python’s powerful collections library in order to obtain pairs of word-counts for all words that are not stop words. Finally, lines #7 and #8 print the 25 most frequent words and their counts. Line #7 uses, again, the powerful collections API, which provides a most common method.
Brevity in terms of lines of code is related to the use of powerful abstractions that have already been created by someone else. Some languages’ core libraries include a very large collection of utilities that come in handy for writing short programs; other languages’ core libraries are small, and it is expected that utility libraries are provided by third-parties. Python’s built-in library is relatively large and varied. However, we could probably write an even shorter program by using a third-party library for natural language text processing (e.g. TextBlob). Indeed, if there’s a utility program out there for computing term-frequency of a file, we could simply call one function like this:
term_frequency(file, order='desc', limit=25)
While core libraries are usually trusted, when it comes to using third-party libraries, one needs to use some caution. By using an external library, we add a dependency between our code and someone else’s project. It is not unusual for library developers to stop maintaining their code at some point, leaving the users of those libraries in limbo, especially when the source code is not available. Another issue is the stability, or lack thereof, of third-party code, which may introduce failures in our code.
This Style in Systems Design
One of the most popular metrics used by the software industry is Source Lines of Code (SLOC). For better or for worse, SLOC is used pervasively as a proxy for estimating cost, developer productivity, maintainability and many other management concerns. Many other metrics have come and gone over the years, but SLOC has survived them all. The Constructive Cost Model (COCOMO) is an example of a software cost estimation model based on SLOC. Developed in the 1970s, and updated a couple of times since then, this model is still widely used today. (See, for example, Ohloh‘s cost estimates for each project.)
Clearly, at close inspection, SLOC is a poor estimate for some of those management concerns, especially programmer productivity, which, unfortunately, SLOC is still used as a proxy for (yes, there are still companies that assert that more SLOC/day = more programmer productivity!). The example program shown above is an extreme case: no one would say that the person who wrote the program in monolithic style is more productive than the person who wrote this beautiful small program. In general, correlations between SLOC and those higher-level management concerns such as cost and productivity have never been proven empirically; they are simply rough heuristics for making software project plans that may be useful in the beginning of a project.
On the popular culture front, brevity is seen as a sign of programming prowess, and the art of creating the shortest possible programs in various programming languages even has a name: Code Golf. However, trying to shorten programs for the sake of brevity alone is usually not a good idea. Often times, the result is a small program that is very hard to read and that may also have some serious performance issues. Take, for example, the following term frequency program:
1 #!/usr/bin/env python 2 import re, string, sys 3 4 stops = set(open("../stop_words.txt").read().split(",") + list(string.ascii_lowercase)) 5 words = [x.lower() for x in re.split("[ˆa-zA-Z]+", open(sys.argv[1]).read()) if len(x) > 0 and x.lower() not in stops] 6 unique_words = list(set(words)) 7 unique_words.sort(lambda x, y: cmp(words.count(y), words.count(x))) 8 print "n".join(["%s - %s" % (x, words.count(x)) for x in unique_words[:25]])
This program has the exact same lines of code as the first one of this chapter. However, each of these lines is doing a lot more, and is expressed in a manner that is somewhat more difficult to understand. Let’s look at line #5. This line is doing almost the same thing as line #5 of the first program: it’s loading the words of the input file into memory, but filtering the stop words. This is done using a list comprehension, over a regular expression, after a couple of file system operations and a couple of tests! Lines #6 and #7 are even harder to understand: their goal is to produce a sorted list of unique words and their counts; first, line #6 computes the unique words by using the set data structure, which removes duplicates; then line #7 sorts those unique words using an anonymous function (a lambda in Python) that compares the counts of words pairwise.
This second program, though correct, performs very poorly. While the poor performance does not show in small text files, it will show quite dramatically in books like Pride and Prejudice. The reason for the poor performance is that the program keeps counting the words over and over again (line #7) whenever it needs those counts.
Even though brevity is usually a good goal that many programmers strive for, optimizing for LOC, alone, is a misguided goal, and may carry problems down the line that may be very difficult to diagnose.
Historical Notes
Code golfs first emerged within APL (A Programming Language), a language developed in the 1960s by Ken Iverson, which included a large collection of non-standard symbols used as mathematical notation for manipulating arrays. By the early 1970s, a popular game had emerged among APL programmers consisting of coding useful functions in only one line (aka APL one-liners). Those one-liners tended to be relatively incomprehensible.
Code golfs can also be associated with the fewest keystrokes rather than with the fewest lines of code.
Further Reading
Boehm, B. (1981). Software Engineering Economics. Englewood Cliffs, NJ:
Prentice-Hall.
Synopsis: The dark art of software cost estimation, featuring the COCOMO
model.
Glossary
LOC: Lines of Code is one of the most widely used software metrics. Several variations of LOC exist. The most commonly used is Source LOC (SLOC), which counts only the lines with program instructions and ignores empty lines and lines with comments.