Optimizing OpenSimulator, Part I
Back in September, I blogged about one of the the most amazing conference experiences I ever had, the OpenSimulator Community Conference (OSCC’13). This was a 2-day, purely virtual conference with a total of 360 attendees, held on an OpenSimulator virtual environment hosted here on one of my super-duper servers. I’m one of the core developers of OpenSimulator; I do it partly to keep my work in software research real [by means of being in the trenches of a complex server with a relatively large user community] and partly because I love to use virtual reality environments, and use my vLab on a daily basis.
This post explains some of the optimizations that I made to OpenSimulator last summer so that it could actually support this event. This is the summary of a paper that will be presented at the Summer Simulation Conference, joint work with my student Eugenia. The preprint, pre-revision, version of the paper is available here.
(The work described in the paper and this post focuses on only one of many improvements that were made last summer by several developers.)
Multi-User (Quasi-)Real-Time Systems
A server for multi-user quasi-real-time interactions is a very different beast from a Web server. In order to convey a smooth 3D experience, some of the events (for example movement) need to be communicated quickly and often. For that reason, most game clients, including the Second Life client used by OpenSimulator, don’t even try to be smart: they simply send a continuous stream of agent updates at a steady rate that varies between 5 and 60 per second, depending on the environment — even if the user’s avatar is not moving. This is so that no important user updates are ever late. It’s like streaming video from the users’ computers, except that the messages are much smaller than video frames. To make things interesting, multi-user environments have a nasty N^2 nature, where each user that connects potentially increases the load not linearly but quadratically, because the events they generate need to be distributed to the other (N-1) users. These two things combined (event streams with an N^2 growth on the number of users) make for some interesting engineering challenges!
A normal Second Life server can handle 80–100 users. As points of comparison, a World of Warcraft server can handle roughly 400 players, while a Google Hangout can serve 10 users. There are many interesting reasons for these differences, but they fall out of scope of this post. Back in May, OpenSimulator was nowhere near the Second Life levels. We could serve maybe up to 50 users, but things would already feel very rough at that number. If we were going to host a successful conference with hundreds of people, we needed to get serious about optimizing OpenSimulator. So a few of us, with the help of the OpenSimulator community, rolled up our sleeves and spent the months leading up to the conference doing all sorts of improvements.
OpenSimulator Agent Updates and Server Load
The client viewers that connect to OpenSimulator servers send about 10 agent updates per second, even when the user’s avatar is doing nothing. Hence, with ten clients the server will receive about 100 packets/s, and with a hundred clients it will receive 1,000 packets/s. When the user’s avatar is not moving, most of these updates were already being discarded with a server-side filter that tested for equality with regards to the data from the previous update. It turns out that discarding agent update packets is a good thing for server load.
We had noticed unusually high CPU loads when certain client viewers connected to the server. This happened a lot: a person would come in and suddenly the CPU was much higher than expected. In investigating the issue, I realized that the clients often got into some sort of unstable state where they would send position/rotation data that wasn’t exactly the same values, but was only tiny deltas of difference between one update and the next. This could be a bug in the client (we don’t control the client code, those are separate projects) but could also be a result of client-side physics doing its usual floating-point thing; in the worst case, it could be a malicious client abusing the server. Whatever it was, it was clear that the server needed to take control of the situation, and get rid of those insignificant updates, instead of processing them, which was what it was doing.
Sometime in July, I added filters to get rid of these insignificant updates, and things got a lot better in the weekly load tests. But I didn’t have time to measure this properly; I knew they had an effect, because our weekly load tests (chaotic events where members of the community would come in and load the server as much as possible) started getting less crashy and we were able to get more people in the server. After the conference, Eugenia and I developed a proper experimental setup that finally allowed us to quantify the benefits of the optimizations, and that will hopefully allow us to optimize OpenSimulator even more in time for OSCC’14.
Controlled Experiments
The setup that Eugenia did uses an army of client bots that behave in specific ways, so we can measure and compare things. For this paper, we present results corresponding to sitting bots. That is, we log the bots and immediately have them sit down and do absolutely nothing. This behavior is meaningful, because it emulates the behavior that people had at the conference.
There is also the question of what to measure. During the load tests, we used CPU usage as the main metric for server load. It was clear that the lower the CPU usage was, the more users we could serve, and that after a certain threshold of CPU usage, lag was unbearable. So in our controlled experiments we measured CPU.
Here is the most relevant chart from our controlled experiments:
This chart shows the average CPU consumption of an OpenSimulator server with 10, 50, 100 and 200 sitting bots. The bars in each group represent (1) the baseline stable clients scenario (the clients sending always the same values of position/rotation), (2) the unstable clients scenario before the conservative server filter (i.e. what we had in May), and (3) the unstable clients scenario with the conservative filter (i.e. the optimization I added).
As the chart shows, those insignificant updates were making the server extra busy! That code path from receiving the packet to processing it and acting on it is not too heavy but the many insignificant agent update packets per second really take a toll. The elimination of those packets, therefore avoiding executing that code path, restores server load to baseline levels, as expected.
We are now measuring load with other bot behaviors (standing and walking bots), and we’re finding further opportunities to cut the fat on the server, which will come handy for the next instance of OSCC.
Software Engineering Issues
From a software engineering perspective, two really interesting things stand out about these kinds of servers: (1) testing them and (2) snowball effects. Let me mention them briefly.
Testing
Testing a multi-user quasi-real-time server is not easy, to put it mildly. In the months leading up to the conference, we engaged with the OpenSimulator community for weekly load tests, and that was extremely valuable. There’s nothing like reproducing the real event with real people driving the clients. But that has its drawbacks, too. People connect with all sorts of clients configured in all sorts of different ways; when things go wrong, it is impossible to link cause and effect. Load tests with real people make up for “holistic assessment”, i.e. we know whether the server is up to the task or not — and this is the only way of knowing that! — but if the test doesn’t go well, we don’t necessarily know what the cause of the problem is. Hence the need to make controlled experiments, too, with fake users (aka bots), where we can control everything independently.
There is a gap between user load tests and lab experiments: when things don’t work well, or fail, in the load test, we then need to figure out what to test in the lab, which is to say, we need to come up with hypotheses about the causes of the observed problems/failures. When our hypotheses are way off, we go off in tangents and waste a lot of time. There is no methodology for nailing this; it amounts to knowing the system well enough so to have strong intuitions about what can be happening, and to fail fast when pursuing dead ends.
But while controlled experiments such as the one we report in the paper are absolutely necessary in order to quantify this optimization (and future ones), controlled experiments don’t represent the real thing. There are things that real people do that the bots can’t emulate, because we (the bot writers) can’t even imagine that people would do such things. And in general the user-driven viewers are a lot more demanding than the bots, so the CPU values shown in the chart above will be much higher for user-driven clients. Controlled experiments point us to inefficiencies in the server code. In the case of this experiment, the numbers we got explain fairly well the holistic behavior we saw last summer in the load tests.
Snowball Effects
Snowball effects are tiny pieces of code that can lead to dramatic changes in CPU usage, and consequently affect how many users can be served by one CPU. The chart above is a typical example. The code that handles the agent update packets isn’t doing anything special: no fancy math, and no new packets were sent; the insignificant agent updates where being discarded higher up the stack anyway. But there is some code there. For 10 bots, the difference in CPU between the middle bar and the others is just 7%, and CPU usage is low anyway, so one might be tempted to ignore the problem; but for 200 bots the difference is ~35%, and running that code or not makes all the difference with respect to placing CPU usage above 100% or well below it.
Ending up with a snowball or not depends on where those pieces sit; if they happen sit in a path that results from receiving a very chatty kind of network packet (hundreds/thousands per second), the differences can be huge. Improving performance means, among other things, understanding the code paths that are activated by chatty packets and trimming them down or, better yet, avoiding them completely. So a lot of optimizations have to do with what’s called “interest management”, i.e. defining and maintaining a smaller set of events of interest from/to each user, instead of processing them all from/to everyone. The art here is to trim the events down without negatively impacting the user experience. And since user experience is often a qualitative affair, we’re back to art and good intuitions.