- 10,000 Bunnies hopping along just fine
- But 25,000 bunnies bends time itself
- And even 10,000 bunnies won't work with bandwidth limits
- So why limit bandwidth at all?
- There's not enough bandwidth and my bunnies are starving!
- MMO networking is a whole thing, honestly
Making an MMO is pretty hard. 
One of the things we are the MOST concerned about is the performance of the servers that we have to run to make the game work. A more performant and scalable server means we have to run (and pay for) less servers, and the servers we do run can support more players, more enemies, more gameplay things at one time on a single shard.
We want to support several hundreds of players on a single shard without time dilation, and we do a lot of testing to make sure that we're as on track as we can be to meet this goal. One of those tests is our version of bunnymark...
10,000 Bunnies hopping along just fine
"Bunnymark" is an informal test of a game engine or rendering system that tries to give a rough estimate of performance limits in numbers of simple objects (bunnies) on screen.
Here's 10,000 of them!
Our version of this benchmark is similar to others, but ours is a complete end to end stress test of the shard and client. Each bunny in this video represents a complete, independent networked entity with physics and AI.
Each bunny that exists on the server causes the server to have to do a lot of work! It has to do (admittedly simple) physics and AI simulation of each bunny, but it even more than that it has to do a lot of work to tell the client the state of each bunny that the client can see. The server has to send a message to the client to tell "create this entity, call it entity #2043", then it has to send messages to encode all of the different parts of that entity state, then it has to send streams of updates to the client that say "entity #2043 position is now this", "entity #2043 bunny state is 'jumping'", etc, and it sends these updates for each bunny entity 60 times per second!
There are no tricks here, the point of this test is to get a rough estimate of how much a single shard can handle. How many enemies / npcs etc can the server simulate at one time and still maintain 60 ticks per second? Bunny collision is a simple square, and their AI is also the picture of simplicity, but this is still a realistic test that we run to keep us from making any performance mistakes.
Now, currently the server can currently just barely handle exactly 10k bunnies spawned and maintain 60 ticks per second, and, the bunnies are simple, but still, to us this means that we are on the right track. 10,000 of anything on screen at once is kind of a lot of things! 
And look, because I apparently just hate my laptop's fans, the bunnies even react to damage and will, ahem, reproduce when hit by a projectile. Each bunny hit produces 2 more bunnies, resulting in fun exponential bun reproduction!
But 25,000 bunnies bends time itself
Well, if 10k bunnies just barely works, what happens when we spawn more?
The answer is that the server can't keep up! But still, it would be much much better for nothing to break when this happens, there are lots of reasons that the server might temporarily experience high load.
In the above video you might notice that the client is at < 60fps rendering approximately 25k bunnies, but that's actually not the major problem. The server tickrate at this moment is actually even lower at something like 25 ticks / sec. If I went and ran this on a computer with a better cpu / gpu I might get the client rendering at 60fps, but the rate of time itself has been slowed, because the server can't keep up. We call this effect "time dilation".
Now there are many many things going on in this video because 25,000 things is just a heckuva lot of things and most of them are on screen, so there are some other sources of stutter and jank here, but if I move somewhere else in the world the client will render a smooth 60fps with smooth animations, just everything moving at about 40% speed.
We want to make the whole effect as smooth as we can so that when things start to go south, the game should remain as playable as it can be. Something pretty interesting is probably happening if time dilation occurs, and you might need to get the heck out of Dodge... or maybe get the heck into Dodge, I won't judge.
And even 10,000 bunnies won't work with bandwidth limits
So before with the 10k bunny test when I said there were no tricks, that was a slight lie. Here's what the 10k bunny test currently looks like on an unmodified version of the game:
Hey what gives, that's super janky and juddery! This looks so much worse because I actually turned back on a feature of the server: client bandwidth limits.
You see, our goals here are not to make some extreme example of 10k entities spawned on a single player's screen actually work reasonably, something like that should never realistically be happening outside of some very exciting and interesting bugs.
Instead, what we want to happen is that a single shard should be able to handle 200 players, each with their own set of, say, an average of 20 or 30 entities each. If you look at the above video and know how to read the very obtuse debugging output, you will see that the current network traffic to the client is something around 600KB/s. This is already pretty high, and these limits are purposefully generous for testing purposes, but that is why everything looks so bad, the entity counts have blown way, way, way past the amount of data that is a realistic maximum for one client to receive.
If you were to look at the same debug output for the original 10k test at the top, you'd see that the server is sending something like 8.5MB/s of data to the client, and furthermore the client and server are running on the same machine with near zero latency and zero packet loss. The bunnymark test is not really a realistic test of how the game will perform on the real Internet with real clients, but it is a more realistic test of the limits of shard performance.
So why limit bandwidth at all?
So why even bother limiting the bandwidth to each client? Well, it's much better for 200 players to have an okay experience than it would be for one player to manage to eat all the bandwidth and CPU of the server and starve the others. And the 10k bunnymark is a thoroughly ridiculous test, here's a somewhat more realistic (but still pretty ridiculous) test of what the game looks like when things butt up against bandwidth limits:
Things are a liiitle stuttery but it's hardly even noticeable! So it's vastly better to be fair to each player in case something very interesting starts happening on a shard and things start hitting the bandwidth limits than it would be to try to send far more data to a few "lucky" players. Also, the amount of CPU used to send 8.5MB/s of mostly position updates is non-negligible, so if we're not careful, if something goes wrong that causes this to happen it may cause cascading problems.
So is there anything realistic that could happen to cause similar sorts of shard
performance issues? Yes! Suppose 200+ people warp into a single shard and all
really want to stand in the same place for some reason, what happens? Well,
the shard will have to send each of the 200 players updates for each of the
other 200 players so that everyone can see everyone else on screen, that's
200 * 200 = 40,000 sets of updates. Well, if another 100 players show up to
that place because whatever is happening is just so important, what then? Well
300 * 300 = 90,000 sets of updates! Oh no! We have increased the
player count by 50% and the costs went up by 125%. This is what we in the biz
(the biz is Computer Science) call "quadratic" costs, the server
has to do "things" proportional to some number N (number of players) squared.
Scientifically speaking, it's not awesome.
So no matter what we do, at some point players will find a way to stress the system, and we really want the system to respond by getting progressively, proportionally slower and generally less awesome but never fall over spectacularly. This is why we very scientifically test things by typing in ridonkulously large numbers, we want to know "what fails first?", "what are our bottlenecks?". This is exactly how we found and fixed the next problem...
There's not enough bandwidth and my bunnies are starving!
So actually the video above is not what happened the first time we tried to spawn 1000 entities to see what happened, this is what actually happened:
This is pretty much the same test as before, 1k bunnies with the currently configured bandwidth limits. Why are some stuck in the air? Why are (some of?) the bunnies juddery again? Well, the answer has to do with starvation.
When we first implemented entity networking, we of course tried the simplest thing that could possibly work first, we looped through every entity that the client was aware of, then every component of that entity, and then sent updates for those components to the client. Once we added bandwidth limits, there is suddenly possible "backpressure" in this process, meaning that eventually the server will not be able to send as much data to a given client as it might need to. Each tick, the server loops and tries to send all of the updates for that tick and reaches a point where no more updates will fit into the tubes, and this tends to happen at the end of this loop. So if every frame the server gets through say, around 900 bunnies and then the tubes are full, you see where this is going. Certain entities may get "starved" of updates. In order to fix this, we had to add a fairness system that tries to evenly distribute updates in a round robin fashion to known entities so as to not starve any particular entity of any particular update for longer than necessary.
This is the sort of thing we discover by purposefully trying to stress the systems we build. What happens under perfectly normal conditions is usually not in question, but what happens as things go completely sideways is often very surprising! By building systems that have graceful degradation, we have more of a hope that when everyone just has to be in one place at one time for some reason that everything will still more or less work. Everyone can understand a bit of network slowdown or time dilation when obviously a bunch of things are happening all at once, but a repeatedly crashing or very weirdly behaving shard is not ever going to be fun.
MMO networking is a whole thing, honestly
This post is already kind of long and it has barely scratched the surface of
networking in Shattersong. There are so many other things that we could talk
about at this point: interpolation, prediction, input buffering, our
multi-channel networking system with independent bandwidth limits and
reliability settings, the difficulty of synchronizing time between two
incredible hacks serious engineering work that goes into
getting something resembling UDP working in today's browsers, testing the game
using link conditioning, networking in the aether and how the portal networking
works, the list goes on and on and on. Both of us could talk about video game
networking for literal days.
This has been a pretty technical post, so I hope that even if you aren't as interested in the technical side of things, you're at least maybe a little more relaxed that even though some of these things are pretty hard problems, we at least more or less are thinking about them and know what we're doing .
Let us know if you enjoyed this post or you want to hear more about any particular technical topic! You can reach out to us via Twitter or on Discord or even via Electronic Mail. Links are at the top of the page!
See ya next time!