Engineering a World Record at Magikcraft

Published in

Magikcraft

7 min readFeb 16, 2017

On February 13, 2017, at MS Ignite on the Gold Coast, Australia, we set the world record for “The Most Minecraft Zombies killed using JavaScript lightning in 10-minutes”.

The world record is 13,584 zombies killed in ten minutes.

Magikcraft teaches kids to code in JavaScript using Minecraft and the metaphor of magic. We took on the challenge of optimizing a system to tackle the world record as a means of building our system to handle extreme scale issues.

To set the record, we built the most performant Minecraft server and logging system in the world. Here is what that looks like.

Nashorn Engine

To run JavaScript in Minecraft we make use of the Nashorn engine included in Java 8. Nashorn is a JSR-223 scripting engine that can execute JavaScript code.

Prior to the world record attempt we used a single Nashorn engine to evaluate all users’ code. The advantage of this is that it is memory-efficient, and also allows us to access the same global memory space in JavaScript. The downside is that it is single-threaded, and execution of one user’s program blocks the execution of another. Since the kids would ultimately be dropping massive blasts of lightning, we needed to take advantage of multi-threading to have their spells executing in parallel.

So we implemented an “engine-per-user” model, where each user gets their own Nashorn scripting engine. That meant that we lost the shared memory-mapped space, so we reimplemented that in Java.

Server Size

For the world record attempt, we made use of a bare metal server in IBM’s Bluemix cloud with 48 physical cores and 256GB of memory, running four 600GB SAS drives in a RAID 10 configuration.

That’s not a server that we can afford to run continuously in production at the moment, so our testing on that server prior to the event was limited.

Dockerization

We use Docker in development and in production. Our system is built using CI and we use a micro-services approach with composition into an appliance.

The Magikcraft system before we optimized it for the world record

There were two issues that we had to address for the world record attempt, both related to logging.

The first was related to intra-container communication, and the second with logging out of the system. We wanted to log every spell cast and every entity death, which lead to a lot of communication between our micro-services and our logging endpoints.

Optimizing intra-container communication

We optimized the intra-container communication of our dockerized microservices using Torusware’s Speedus. Speedus loads pre-emptively under libc and causes your containerized apps to communicate via a socket rather than via the TCP stack. No application code needs to be modified — it’s all transparent.

Speedus transparently transports TCP traffic over sockets

This required a switch from Alpine Linux (which uses musl) to Ubuntu (which uses glibc) for our container OS. It meant that our container images were larger, but our composed applications were an order of magnitude faster at communicating when deployed on the same host.

One of the composed microservices was the Zombie Scoreboard, a retro 80's arcade-inspired piece built by Tim Marwick. The Zombie Scoreboard receives a real-time feed of spells cast and Zombies killed. It received up to 700 messages per second solidly during the session. In tests before we added Speedus it would continue updating seven minutes after a load test, due to backed up requests. With Speedus integrated it updated in real-time throughout the two hours of the world record attempt, without skipping a beat.

Logging

We partnered with Brisbane-based logging company Datalust, who produce Seq, a world-class structured logging server.

We sat down with Seq founder Nick Blumhardt , and spent time optimizing our logging messages for maximum performance. Batching log messages using a structured logging library like Bunyan is one way to reduce the amount of traffic.

Brisbane-based logging guru Nick Blumhardt

We couldn’t predict the load that our system would experience — this was literally an unprecedented engineering project, so we went for a conservative approach and moved potential bottlenecks and failure points away from the core system. We had observed a single badly-behaved HTTP endpoint cause the Minecraft server to core dump. So we also opted to ship all data out of our system over UDP, and use HTTP from an off-system bridge.

Shipping Log Data out via UDP

At normal Magikcraft volumes of data — for example in a classroom setting — HTTP works fine for shipping log data out of the system. When you increase the volume of data to world record setting levels, however — like killing an average 22 Zombies a second with lightning bolts (with spikes of hundreds) — it’s a different story.

As well as using Speedus for intra-container communication, we switched to UDP to ship all logging data out of the system.

Prior to the world record attempt we used logging to Slack (we’re into ChatOps). However, the sheer volume of logging made this impractical for two reasons: Slack is not designed for, and doesn’t allow more than a couple of messages per second; and the overhead of HTTP synchronisation is a cost that we want didn’t our system paying.

So we shipped data out of the Magikcraft system via UDP using the Docker GELF driver, to two separate locations: to an ElasticSearch / Logstash / Kibana (ELK) machine, and to a GELF bridge we built called Gandelf.

Gandelf is designed to consume the UDP output of Docker containers using the GELF logging driver, and proxy it over HTTP to Slack or Seq, and we extended it to support Azure Table Storage Queue.

Our initial naive implementation worked OK for classroom-scale, but wasn’t going to be a match for the Zombie Apocalypse

We grappled with integrating Azure Table Storage for record keeping, in parallel with the ELK store and our Seq instance. Azure Table Storage Queues have a rate limit of 2000 msg/minute. We were doing that in 3 seconds under load. We eventually addressed this by down-sampling the data going to Azure to 1 msg/second, passing in a stats summary only.

This design isolates the Minecraft system from HTTP overhead and failing responses

This worked well — up until the moment that it didn’t. It performed well for us under load-testing, but suffered a critical failure during the world record attempt. A bug in the Azure Table Storage Queue Node library killed the entire logging process 60 minutes into the event (we submitted a patch following the world record).

Unfortunately, we were also logging to our Seq instance from there. Here you see our Seq data ingestion flat-lining after that Gandelf instance took one for the team:

On the upside — if that had happened during an HTTP request from inside Minecraft, it’s probable that the server would have thread-locked. So it’s good that we isolated the system from that failure, and just a bummer — and a bad design decision — that Seq was on the same pathway.

We do have discrete data from ELK for the entire session, to the tune of 2 million messages over two hours:

What we learned from that lesson is to keep all pathways isolated. For the future we’ll be developing a UDP fan-out solution, and bridging to HTTP consumers in parallel.

Setting a World Record

As of Thursday 16 Feburary, 2017, our World Record submission is pending evaluation from RecordSetter.com. You can view it and leave a comment here. [Update: It’s official! We hold the world record!!]

We’re going back to the drawing board with the lessons that we’ve learned to address the points in the system where we can optimize to beat this record, and we’re coming back to break the record that we set.

For us this is Magikcraft’s Bathurst. Every year in Bathurst, Australia’s car manufacturers race highly optimized next-generation versions of their street cars, as a way to push the envelope of performance. The innovations they develop while doing this are rolled back into production cars.

We are doing the same thing. The engineering optimizations we develop while breaking this world record are rolled back into Magikcraft — the world’s most advanced platform for teaching kids to code in JavaScript using Minecraft.

We’ll be at NDC Oslo in June, and also visiting Denmark at that time, and we’ll be looking to set a new world record while in Scandanavia. See you there!