Back in 2012 Stripe launched Capture the Flag 2.0 Web Edition. It was a security challenge that allowed participants to learn about security vulnerabilities in a sandboxed environment. The challenge was divided into 9 levels. Each level was a web application containing a particular security vulnerability that a participant had to successfully exploit in order to gain access to the password for unlocking the next level. The further the participant got the harder the levels became. Once all levels had been completed the participant had successfully captured the flag. The competition was immensely fun, and I learned a great deal during it. The challenge lasted for a week before it concluded, but thankfully Stripe open-source all the levels. A few weeks ago I wanted to run through it again as a refresher, and also allow colleagues who missed it the first time to experience it.
Greg Brockman, Stripes CTO, wrote an excellent article about the system design that provided isolation between participants and supported the security challenge. I am not going to rehash it here, if you are curious I recommend reading the original article. I, however, wanted to try a different approach. An approach that consisted of four main components: Mesos, Marathon, Docker, and HAProxy.
Mesos is a cluster manager that provides resource isolation and sharing across a cluster of machines. It can be thought of as an application scheduler for the data center. My idea was to leverage Mesos for scheduling and managing the levels. The problem is that Mesos provides a fairly low-level interface and does not have an out-of-the-box way of managing long-running applications. Luckily Mesosphere has released a solution in the form of Marathon.
<div class="caption">Mesos uses ZooKeeper for high availability, which means that the system can tolerate both master and individual slave failures</div>
Marathon is a Mesos framework for long-running services, and comes with a great Web UI and a REST interface for launching, scaling, and destroying applications. As soon as a participant needs a new level a request is made to Marathon that then handles the interaction with Mesos and makes sure that the level gets scheduled. Mesosphere has also open-sourced a way for Mesos to launch and interact with Docker.
<div class="caption">Like Mesos, Marathon can use ZooKeeper for high availability</div>
Docker is an open-source project to easily create lightweight, portable, self-sufficient containers from any application. Packaging the levels together with their respective dependencies as Docker containers allows me to isolate participants from each other and makes deployment much easier.
<div class="caption">Examples of how levels are packaged as Docker containers</div>
HAProxy is a robust battle-tested load-balancer/proxy. Using Mesos means that I do not have direct control over which machine is running a particular container so HAProxy is used to redirect the user to the correct container regardless of where it is running.
<div class="caption">A REST application, written in Go, enables HAProxy configuration changes over HTTP</div>
Full System Design
Additionally there is also a participant-facing Clojure web application that keeps track of progress and coordinates with Marathon and HAProxy when levels need to be started and stopped. Lets go through what happens when a user has just signed up and needs to have their first level started.
<div class="caption">Starting a level after signup</div>
The flow is as follows:
- The user sends a POST request to HAProxy.
- HAProxy passes the request on to the Clojure web application. HAProxy is used here to shield the web application from malicious clients and for load balancing purposes.
- The web application creates a new account and instructs Marathon to launch the first level.
- Marathon uses Mesos to allocate sufficient resources to launch the level.
- Mesos will, via the Mesos-Docker executor, instruct Docker to launch the correct container. It will also make sure to open the right ports, as specified in the Docker file, and assign them to random ports on the host system. Finally it feeds information about those ports back to Marathon for service discoverability.
- Docker fetches the container and launches it. Once the container is launched, the Clojure web application will fetch information about hostname and ports for the newly launched level from Marathon.
- The Clojure web application sends a request to a simple web application that presents a REST interface over the HAProxy configuration file.
- The HAProxy REST application creates a custom URL for accessing the newly launched level, which can then be returned to the user.
The process for launching subsequent levels is similar, with the main difference being that the previous level is stopped and its entry removed from HAProxy.
The system design is somewhat unusual, but it has proven to be very stable and allows for a great deal of elasticity. As user demand fluctuates I can easily add and remove Mesos slaves and automatically balance the load without worrying about dependencies or distribution. Routing all requests through HAProxy also means that the user is completely unaware of which machine is running the level.
When I started this project I was unsure of whether it would be feasible to run the security challenge this way, but so far the technologies involved have exceeded all my expectations and I look forward to learning more about them!