Distributed Systems Testing: The Lost World

After failing to find good papers about distributed systems testing for many months, yesterday I asked a question in Twitter:

Twitter friends: what papers or frameworks do you recommend regarding distributed systems regression or integration testing techniques?

— Crista Lopes (@cristalopes) January 22, 2016

This got many retweets and some interesting replies, so I’m going to summarize them here. Then I’ll explain a bit more about why I’m interested in this topic.

Interesting Reading Materials

@cristalopes https://t.co/wDrnnAXH1x

— James Pudney (@NotQuiteEuler) January 23, 2016

This is a slide deck giving a great overview about how to test distributed systems, specifically micro-service applications. It presents all levels of testing, from unit to end-to-end testing. Needless to say, my original tweet was about end-to-end-ish techniques. As this slide decks says “Due to the difficulty inherent in writing tests in this style, some teams opt to avoid end-to-end testing completely, in favour of thorough production monitoring and testing directly against the production environment.” OK, things haven’t gotten any better since 1980. But this pointed to some interesting projects I didn’t know about related to specifying test cases, namely Concordion and Gauge. Neat!

Next is a link to an article that seems quite interesting but that is completely locked up, even for me — and I’m inside a University network. See if you have better luck!

@cristalopes @caitie wrote a great piece about testing distsys in general but its behind a paywall for the moment https://t.co/EGhqA9GuuS

— Mark Allen (@bytemeorg) January 22, 2016

In googling for a copy of this article, which I didn’t find anywhere, I came across this other relevant article also published in the ACM Queue. TL;DR: end-to-end is hard, the article gives some tips. I also came across this other slide deck by Ines Sombra. Similar message: it’s hard, not much out there.

There are a few frameworks based on trace logs, a relatively old technique that glorifies printf as First-Class Citizen Of The Testing Guild:

@cristalopes Something like Google Dapper or Twitter Zipkin to get distributed trace logs.

— Chad Brewbaker (@cbrewbs) January 23, 2016

Links: Dapper and Zipkin.

I found a great paper about the tracing-based technique of testing distributed systems. (Thank god for Academics and our obsession about writing stuff down!)

Further in the academic world,

@leifwalsh @cristalopes @old_sound @SeanTAllen But if you did want to retrofit… you guys familiar with “Shadow”? https://t.co/8lAsKM3FeH

— Will Wilson (@WAWilsonIV) January 23, 2016

Shadow seems to be a network simulator that takes plugins connecting to assorted distributed applications like Tor and Bitcoin. It seems it runs the actual code of those applications as black boxes, which is pretty neat. Here’s the paper about it. It’s black-box, end-to-end testing of the behavior of the application nodes in the presence of assorted network failures.

Also from the academic world,

@cristalopes also @palvaro wrote his “Molly” testing tool it’s on github now https://t.co/1xfid7boED – worth a look too

— Mark Allen (@bytemeorg) January 23, 2016

The Github repo has a link to the SIGMOD paper about Molly. Molly implements “lineage-driven fault injection” which “uses data lineage to reason backwards (from effects to causes) about whether a given correct outcome could have failed to occur due to some combination of faults.” So, it’s the well-known fault injection testing technique adapted with specific ideas coming from the database world related to data lineage. This sounds a bit too database-y, and not generally applicable. In fact,

@bytemeorg @cristalopes alas, since this molly prototype requires that programs be written in dedalus, probably not what you mean by testing

— new dist sys person (@palvaro) January 23, 2016

But definitely worth a read. And even more interesting:

@bytemeorg @cristalopes but! LDFI can drive failure testing (that could run alongside integration/regression tests): https://t.co/01VtZAgxub

— new dist sys person (@palvaro) January 23, 2016

Follow the link to the Netflix blog post, it’s pretty cool.

From the upper levels of the Ivory Tower, but with a gentle introduction by Adrian Coyler,

@cristalopes @ade_oshineye Obnoxiously unsolved. This paper from last year is related; looks like a ton of work https://t.co/DmSlkL2zEq

— Matthew Mark Miller (@DataMiller) January 23, 2016

I must say that, unlike Adrian, I’m as skeptical as ever about upfront verification of complex programs being the silver bullet for bug-free software, much less complex distributed systems. Interesting work in verification, though, but definitely not testing, what I’m looking at.

And that’s about it for reading materials. Not a lot, unfortunately. I was hoping there would be some papers from the testing conferences, but they seem to be completely radio silent on distributed systems testing. (Please prove me wrong!)

Interesting Frameworks-Without-Papers

Jepsen takes the lead:

@cristalopes See @aphyr‘s work on Jepsen, maybe?

— Sam Tobin-Hochstadt (@samth) January 22, 2016

@cristalopes It’s definitely a very difficult and ignored. (Probably b/c so difficult?) Jepsen is probably the closest thing so far.

— Mark Allen (@bytemeorg) January 22, 2016

@cristalopes @ade_oshineye Jepsen, probably gets close to this.

— Jon Topper (@jtopper) January 23, 2016

I heard about Jepsen before, and even saw @Aphyr’s talk at StrangeLoop 2013. I had forgotten all about it, so yes, super neat! Like Shadow, mentioned above, Jepsen is black box end-to-end testing. I don’t know how flexible it is, as I couldn’t find a white paper about it, and all of Aphyr’s [great] talks are about him finding all sorts of bugs in all sorts of popular databases without explaining his tool very well. May be similar to Shadow, but I can’t really tell. Needs more digging to see if it can be used by non-Aphyr mortals, to test concrete usage scenarios of non-database-y applications, and under operational goals other than the effect of network failures.

From the Erlang world,

@bytemeorg @cristalopes @quviq so I’ve actually been digging into eqc_temporal for this exact purpose. Plan on documenting my success if had

— Tom Santero (@tsantero) January 23, 2016

Here is a link to eqc_temporal. I guess we have to wait for Tom to dig in and tell us how to do testing of Erlang systems with it.

@zeeshanlakhani @tsantero @bytemeorg @cristalopes ok, I see we need to spend an @AdvancedErlang talk on the subject… or a whole day ;-).

— Quviq (@quviq) January 23, 2016

Preach!

Also, a really neat idea — start with a simulation:

@cristalopes @old_sound this is a cool technique but hard to apply to an existing system https://t.co/41B6HsDxGr

— code daddy (@leifwalsh) January 23, 2016

This may be feasible and desirable for big infrastructure-y systems (I’m a big fan of simulations, and I do them too, related to my original question), but it may be an overkill, or even unfeasible, for many distributed applications.

And… that’s all that came through my Twitter feed.

Why

A few months ago, based on our own experiences with developing and testing OpenSimulator, my students and I dug considerably deep on this topic in the research literature, and came out pretty much empty-handed. We found a few great quotes, one of which is the title of a paper written in 1980, “the lost world of software debugging and testing“. Yes, 1980 — 36 years ago! Many of you weren’t even born, and people had lost hope already! In spite of unit testing being a standard practice everywhere, things don’t seem to have gotten any better for testing distributed systems end-to-end. Why? Here some possibilities:

Maybe it’s not a problem.
Maybe people are so used to the abuse that comes with it, that they don’t even recognize it as a problem.
There’s too many things packed into the concept of testing distributed systems, and that is pretty clear in what came into my Twitter feed: avoiding old bugs as the code evolves, finding previously unknown bugs, poking the production system for how it’s doing, monitoring the production system, stress testing, finding out the fault-tolerance behavior, verifying liveness properties… Maybe we need to unpack all these things in some sort of taxonomy (know of one? pointers appreciated) and solve each one separately.

Anyway, we have some ideas for the concrete development problems we have experienced (and continue to experience) with OpenSimulator. Most of them are related to making sure old bugs don’t come back as the code evolves, so along the lines of end-to-end regression testing. As is usually the case in distributed systems, the worst bugs usually pop up non-deterministically, and are not functional but operational in nature — e.g. performance drops inexplicably, things work 8 out 10 times, etc. Sometimes they are easy to find and fix, other times they’re hard. We eventually fix them, so that’s not the problem. (Give me a description of a bug that can be reproduced once in a while, and chances are I can fix it very quickly) The problem is that we have no way right now of writing a regression test for them, so it’s not uncommon for old bugs to show up a year later when we’re not paying attention. That’s the software engineering problem my students and I are trying to solve.

Update [2016-06-07]

This post got another spike of attention, so I thought I’d add more links to interesting work that people sent me.

An awesome followup by Colin Scott
Rodrigo Fonseca’s PivotTrace: It uses elements of aspect-oriented programming to dynamically instrument distributed systems (good old AOP 🙂
Raja Sambasivan’s work on request-flow comparison and visualizations (here and here)
Fay: execution traces
Use of formal methods at Amazon
JTorx, model-based testing

Distributed Systems Testing: The Lost World