Lab Cloudet: IT at Lab Scale
I’ve been a Professor for almost 13 years. During these 13 years, I have seen investment in routine academic IT infrastructures decrease steadily, while at the same time being encouraged/pushed to use commercial infrastructures. Academic administrators seem to be mostly preoccupied with the 1% of academic work that requires high performance computing, large clusters and super high speed networks, and don’t pay much attention to the everyday needs of the other 99% of academic work. This puts research integrity at risk, but hey! — it’s hard to attract money for a vanilla storage system that is used for everyday computing, and for the sys admin that goes with it, especially when so many commercial cloud providers give things for free to Universities; you need to add bells and whistles and throw in a few buzz words like “cloud” and “big data” to get people impressed. Fair enough.
I finally got fed up with commercial solutions, for reasons that I explain later, and decided to set up my own IT infrastructure that fits exactly my needs and those of my group. I thought I’d write down my take on this. I suspect lots of colleagues have the same struggles.
First, let me write down a laundry list of pain points that I’ve suffered over the years:
- File sharing among my several computers. Before Dropbox existed, this was a MAJOR pain. My School provided (and still provides) storage space for everyone on a mountable network drive, but this was always clunky. If you’re offline, you can’t access the drive. That’s a show stopper! Working while traveling (and I travel a fair amount) becomes an error-prone exercise in copying files from the network drive to local disks by hand, and then copying them back, if you remember to do it… Plus you always have to use the VPN in order to mount that drive off campus. Plus, it’s slow when off campus (try to mount the drive from Europe!)
- File sharing with my students and between my students. Again, before Dropbox existed, this was a MAJOR pain. Basically, email was the only solution. Version control servers also tackle this problem, as, along with version control itself, they serve as centralized repositories that we can access for pulling copies and pushing changes. So some time in 2007, I set up an SVN server, but that didn’t last long — see pain point #5.
- File sharing with ad-hoc collaborators. My service to the community and inter-institutional research collaborations often require me to share files with arbitrary people. One more time, before Dropbox and Google Docs existed, email was all there was. Shared folders in Dropbox made ad-hoc sharing possible and painless.
- File sharing with the public. A lot of my projects end up producing datasets that we think are useful for the research community at large. Indeed, some of them are quite popular. Some of them have reasonable sizes, but others are extremely large (>100G) and don’t fit on the School’s network drive. I have been serving one of those datasets to other researchers by asking them to send me a hard drive by mail, copying the data, and mail the drive back! (As a Facebook friend said: “never underestimate the capacity of an A380 full of hard drives!”)
- Backups. Because the School’s network drive was not good enough for my needs, soon I started buying my own servers and storage. Unfortunately that didn’t come with a sys admin, so for a while I ran things without proper backups. Then one day, one of my main drives stopped working and I lost important files. I learned my lesson. Since then, I have at least set up disks in RAID configuration, but I have no idea if things are properly setup, for the simple reason that I haven’t had any new failures yet (knock on wood!). Frankly, I’d rather not spend my time and my students’ time doing systems administration. Backups of assorted drives are still a MAJOR pain, and without proper backups, everything else is at risk.
- Version control. As I said, I had an SVN server at some point. Then the hard drive crashed, and I lost the data. I did’t set up the SVN server again, because I did not want to use something so important, and come to rely on it, on a drive that didn’t have professional backup procedures. Luckily for everyone, Github came to be, and I moved all my projects’ code to Github. Github is good for code, but it’s not good for papers or grant proposals. I need version control on these too, and it needs to be private and secure. Github has a limited number of private repos. I’m not up to placing budget sheets on some 3rd party version control system somewhere on the Interwebz.
- Students’ files and data when they graduate. This is a MAJOR pain, and it was even worse before Dropbox existed, because I didn’t always end up with the latest versions of their papers, much less their data. I have no idea where the students store their dissertations. They keep their working data God knows where! Sometimes I have panic attacks about this, and bombard my students with requests for them to send me the data/papers after they’re done with the projects, or place them in some folder that I can access, but this is always a fragile situation.
In short, an Academic Lab, like mine, has specific needs in terms of code, data and documents that, if not taken care of carefully, may jeopardize the integrity of the research itself. With my own students, and for the first few years, I rarely got to see their working code and data. I got increasingly nervous with this situation, and that’s when I started buying my own hardware and experimenting with taking the IT of my group into our own hands, with mixed results. When commercial clouds came around, I quickly moved some of my group’s files to them, but they have all sorts of problems too. I’ve had collaborations with other faculty where I never got to see any data at all!, just the summaries of data analysis done by their students. While I trust my colleagues, this always makes me very nervous.
I’m not talking about “open science.” There’s a difference between placing the final outputs of projects out in the open and supporting groups of researchers with their workflow as they work on projects. Most projects go through a phase where it would be unwise to place things out in the open, for a variety of reasons. I’m also not talking about synchronous, real-time collaborations, which have their own problems and needs (I’ve solved that by setting up my own virtual lab). This post is about asynchronous collaboration.
This is one elephant in the University room that administrators don’t really know what to do about. Each one of these pain points sounds like something each faculty ought to figure out by themselves and, for the most part, that is correct: each group has their own dynamics, and solutions that work for one group may not work for everyone. With the emergence of commercial clouds that give lots of free stuff to faculty and students, the trend in University administration has been to let Academics use those services as they need, and to stop investing in internal systems.
Commercial Solutions, and Why They Fail at Academic Support
Some of the pain points I list above were greatly eased when commercial clouds entered the scene about 5 years ago. Dropbox was God-sent! They had a “space race” to give large amounts of storage to students and faculty, and that felt like the solution to my file sharing issues. For the past 2 years or so, I have been using Dropbox as the main infrastructure to share files with myself, my students and my collaborators. All the papers we work on are there. All my grant proposals are there, mostly. Unfortunately, I have been hitting the limit of storage for a while, and this has become an issue. When someone wants to share something new with me, I have to remove things from there. Each of my students only shares things with me, and maybe another student, but I share with all of them. Dropbox is mostly a O(1) solution; I have a O(N) situation! It doesn’t scale, unless we pay for it.
Recently I received notification from Dropbox informing me that the storage they gave me for free during the space race is about to be taken back, unless I convert to a paying customer. Sorry, Dropbox, but it’s not cool to give something with so much fanfarre and then take it back. This notification was the last drop.
Commercial solutions are unsustainable for Academics, on an individual basis: one day they’re showering us with perks, the next day the CEO wakes up in a bad mood, or whatever, and the perks are gone. We have no control over the commercial interests of those companies, and therefore we are completely vulnerable to those companies’ whims. Free is never free of worries. I’m waiting for the day Github decides to take back the perks they gave to Academics too!
I would consider paying for all of this, but the pricing schemes don’t work well for me. This is not just a personal problem; this is a problem for my entire group and beyond. I can’t just solve my personal problem, I need to solve the problem of all my students too. I don’t have only N students; at any point in time, I may have some arbitrary number of students working with me. Plus, it’s not just one vendor; Dropbox is good for some things, Github is good for other things, Google Docs is good for something too. The cheap starter prices quickly add up once we start demanding more of it: if I were to pay for services in a way that would solve ALL of my entire, unbounded, group’s problems, this would quickly add up to $1,000/year or more. And I would still have to deal with the problems coming from the limits of ad-hoc collaborators, for whom I can’t pay.
If money needs to be spent, I’d rather consider other options, especially if some of those options address not just 2 or 3 pain points at a time, but all of them in one scoop!
I decided I need my own cloudet that does exactly what I need it to do. Here are my requirements:
- Unlimited storage for everyone: the only limit is money, and disks are cheap. I need all my students to use this storage for everything they do, so that I can look at it whenever need, and that a copy of their data stays here, reliably, when they graduate.
- No limits on file size: we have very large data sets
- Minimal human intervention: no copying files by hand from a computer to another, synchronization must be done automatically and in the background
- Reliable backups: if a storage disk fails, we want to be able to recover the data
- High availability, must work offline: for trips, network outages, Starbucks, etc.
- High performance for data-intensive jobs: for when we run things like the Sourcerer tools
- Lightweight access control: we want to limit the likelihood of people in the group accidentally doing something wrong to someone else’s files
- Version control on anything that needs to be version controlled, including budget and all sorts of confidential documents
- Simple interface for sharing files with ad-hoc collaborators without requiring account creation anywhere.
- Simple interface for public-facing data of any size
I looked around, and realized that it is now possible to set up a cloudet that can support all of this for less than $5,000. Here are the ingredients:
- A Network-Attached Storage (NAS) device. Something like this is more than enough, and costs about $1,000. They have smaller models that cost less.
- As many hard drives as needed. Something like this will do, $100 each. We can buy 5T in RAID for about $1,000.
- A vanilla [Linux] server for running file sharing software. Something like this will do, for about $500.
- BTSync. Free and trivial to install everywhere, including on mobile devices.
- Plain Git or GitLab. Free and easy to install.
- Optionally, additional high-performance servers, if the projects require data- and computing-intensive operations. I don’t consider this to be part of the basic cloudet, but it’s something I need; others may not need it.
Here’s the picture of the cloudet I’m setting up:
Some observations about this:
- BTSync is my new best friend. It basically replaces the functionality of Dropbox, and does some things even better. For example, there are no limits on file size. When transferring large data (such as some of our public datasets) all peers that have a copy are involved in the transfer, so the transfer is much faster. It is also secure, and it doesn’t require any frickin’ centralized login: if you have a folder’s key, you can sync it locally. Keys can be shared by email with anyone without having to create accounts on my cloudet, so I can share folders not just with my students but with any collaborators around the world.
- BTSync is P2P file sharing software. Strictly speaking, my server doesn’t need to be involved when I’m sharing files between my workstations, or between my workstations and my students’ workstations. The server is just another peer in the network of peers, there’s nothing special about it. However, it brings two things to the table: (1) it’s always on, so if I work on my laptop while my desktop is sleeping, then I close the laptop and go work on the desktop, the changes will be available to the desktop, because the server peer got them; it serves as a reliable relay; (2) the server’s hard drive is backed up by people who know what they’re doing, so if disaster happens and data is lost on all workstations, it it possible to recover it with some time lag.
- I’m still not sure whether I’ll install GitLab or just plain Git. I dread software that requires account management. But the issue tracker of GitLab is definitely a big plus, so I may go for it.
- The usage scenarios of BTSync and Git[Lab] have some overlap. They both synchronize files between independent copies more or less transparently. In some cases, version control is an overkill, and sync-ing is all that’s needed. In other cases, version control is necessary. However, the great thing about both of them is that they are designed under the P2P philosophy. Files are always local and remote at the same time, without us having to worry about it. This removes the enormous invisible wall of people having to decide which files they need locally (for performance or availability reasons) and which files they send up to the server side (for backup and sharing). In some cases, we can use both BTSync and git on the same folder.
- An alternative configuration on the hardware front is to replace the NAS and the cheap linux server in front of it with just one Direct-Attach Storage server. I prefer to separate the two, because they serve different purposes; if any of them fails, they fail independently. Plus, in my case, the NAS will be managed by my School’s IT staff, and they need to have complete control over it, whereas I want to have complete control over the software that serves that data storage to my group.
I’m interested to hear about other Academics’ experiences, problems and solutions.