Distributed Systems Testing: The Lost World

After failing to find good papers about distributed systems testing for many months, yesterday I asked a question in Twitter:

This got many retweets and some interesting replies, so I’m going to summarize them here. Then I’ll explain a bit more about why I’m interested in this topic.

Interesting Reading Materials

This is a slide deck giving a great overview about how to test distributed systems, specifically micro-service applications. It presents all levels of testing, from unit to end-to-end testing. Needless to say, my original tweet was about end-to-end-ish techniques. As this slide decks says “Due to the difficulty inherent in writing tests in this style, some teams opt to avoid end-to-end testing completely, in favour of thorough production monitoring and testing directly against the production environment.” OK, things haven’t gotten any better since 1980. But this pointed to some interesting projects I didn’t know about related to specifying test cases, namely Concordion and Gauge. Neat!

Next is a link to an article that seems quite interesting but that is completely locked up, even for me — and I’m inside a University network. See if you have better luck!

In googling for a copy of this article, which I didn’t find anywhere, I came across this other relevant article also published in the ACM Queue. TL;DR: end-to-end is hard, the article gives some tips. I also came across this other slide deck by Ines Sombra. Similar message: it’s hard, not much out there.

There are a few frameworks based on trace logs, a relatively old technique that glorifies printf as First-Class Citizen Of The Testing Guild:

Links: Dapper and Zipkin.

I found a great paper about the tracing-based technique of testing distributed systems. (Thank god for Academics and our obsession about writing stuff down!)

Further in the academic world,

Shadow seems to be a network simulator that takes plugins connecting to assorted distributed applications like Tor and Bitcoin. It seems it runs the actual code of those applications as black boxes, which is pretty neat. Here’s the paper about it. It’s black-box, end-to-end testing of the behavior of the application nodes in the presence of assorted network failures.

Also from the academic world,

The Github repo has a link to the SIGMOD paper about Molly. Molly implements “lineage-driven fault injection” which “uses data lineage to reason backwards (from effects to causes) about whether a given correct outcome could have failed to occur due to some combination of faults.” So, it’s the well-known fault injection testing technique adapted with specific ideas coming from the database world related to data lineage. This sounds a bit too database-y, and not generally applicable. In fact,

But definitely worth a read. And even more interesting:

Follow the link to the Netflix blog post, it’s pretty cool.

From the upper levels of the Ivory Tower, but with a gentle introduction by Adrian Coyler,

I must say that, unlike Adrian, I’m as skeptical as ever about upfront verification of complex programs being the silver bullet for bug-free software, much less complex distributed systems. Interesting work in verification, though, but definitely not testing, what I’m looking at.

And that’s about it for reading materials. Not a lot, unfortunately. I was hoping there would be some papers from the testing conferences, but they seem to be completely radio silent on distributed systems testing. (Please prove me wrong!)

Interesting Frameworks-Without-Papers

Jepsen takes the lead:

I heard about Jepsen before, and even saw @Aphyr’s talk at StrangeLoop 2013. I had forgotten all about it, so yes, super neat! Like Shadow, mentioned above, Jepsen is black box end-to-end testing. I don’t know how flexible it is, as I couldn’t find a white paper about it, and all of Aphyr’s [great] talks are about him finding all sorts of bugs in all sorts of popular databases without explaining his tool very well. May be similar to Shadow, but I can’t really tell. Needs more digging to see if it can be used by non-Aphyr mortals, to test concrete usage scenarios of non-database-y applications, and under operational goals other than the effect of network failures.

From the Erlang world,

Here is a link to eqc_temporal. I guess we have to wait for Tom to dig in and tell us how to do testing of Erlang systems with it.


Also, a really neat idea — start with a simulation:

This may be feasible and desirable for big infrastructure-y systems (I’m a big fan of simulations, and I do them too, related to my original question), but it may be an overkill, or even unfeasible, for many distributed applications.

And… that’s all that came through my Twitter feed.


A few months ago, based on our own experiences with developing and testing OpenSimulator, my students and I dug considerably deep on this topic in the research literature, and came out pretty much empty-handed. We found a few great quotes, one of which is the title of a paper written in 1980, “the lost world of software debugging and testing“. Yes, 1980 — 36 years ago! Many of you weren’t even born, and people had lost hope already! In spite of unit testing being a standard practice everywhere, things don’t seem to have gotten any better for testing distributed systems end-to-end. Why? Here some possibilities:

  • Maybe it’s not a problem.
  • Maybe people are so used to the abuse that comes with it, that they don’t even recognize it as a problem.
  • There’s too many things packed into the concept of testing distributed systems, and that is pretty clear in what came into my Twitter feed: avoiding old bugs as the code evolves, finding previously unknown bugs, poking the production system for how it’s doing, monitoring the production system, stress testing, finding out the fault-tolerance behavior, verifying liveness properties… Maybe we need to unpack all these things in some sort of taxonomy (know of one? pointers appreciated) and solve each one separately.

Anyway, we have some ideas for the concrete development problems we have experienced (and continue to experience) with OpenSimulator. Most of them are related to making sure old bugs don’t come back as the code evolves, so along the lines of end-to-end regression testing. As is usually the case in distributed systems, the worst bugs usually pop up non-deterministically, and are not functional but operational in nature — e.g. performance drops inexplicably, things work 8 out 10 times, etc. Sometimes they are easy to find and fix, other times they’re hard. We eventually fix them, so that’s not the problem. (Give me a description of a bug that can be reproduced once in a while, and chances are I can fix it very quickly) The problem is that we have no way right now of writing a regression test for them, so it’s not uncommon for old bugs to show up a year later when we’re not paying attention. That’s the software engineering problem my students and I are trying to solve.

Update [2016-06-07]

This post got another spike of attention, so I thought I’d add more links to interesting work that people sent me.

  • An awesome followup by Colin Scott
  • Rodrigo Fonseca’s PivotTrace: It uses elements of aspect-oriented programming to dynamically instrument distributed systems (good old AOP :-)
  • Raja Sambasivan’s work on request-flow comparison and visualizations (here and here)
  • Fay: execution traces
  • Use of formal methods at Amazon
  • JTorx, model-based testing
This entry was posted in research and tagged , . Bookmark the permalink.