How I Built a Planet-Scale Failure

May 3, 2020

A few days ago I built a new website, though calling it such might be a touch too generous. It’s called 500 as a Service or 500aaS for short. You can visit it on 500asaservice.com. Don’t sue me if you find it disappointing. It’s meant to be a failure after all.

I figured, with the wealth of things available to consume as a service nowadays, it felt just appropriate to altruistically offer a piece of failure on-demand, free of charge, to my fellow humans. I know, I may have probably peaked with this idea right here.

Take it for what you may, but I probably got more out of building this website than you, dear reader, after contemplating it in befuddlement. I didn’t set out to build a lazy failure, mind you. I wanted to build a massively scalable one. And this poses a somewhat more interesting challenge. How did I do it?

The ingredients of a scalable failure

The vision for 500aaS is as follows:

Provide a planet-scale, elastic, resilient, secure and low-latency on-demand HTTP 500 service.

Planet-scale, elastic… those couple of buzzword bingo entries hint at a cloud service… AWS maybe? Correct! (it was the elastic part that gave it away, wasn’t it?). In times past, the simplest way to deploy a site like this would’ve been to get hold of a server box somewhere, set up Apache or Nginx on it and configure the web server to always return 500 errors, regardless of the URL pattern it receives. Open this server up to the public Internet via a static IP address and voila: you got yourself a homemade failure.

# Minimal nginx config that will get you 500 responses forever
# /etc/nginx.conf
events {}

http {
    server {
        location / {
            return 500;
        }
    }
}

But there is a big caveat here. This is just one physical server we’re talking about. What if 500aaS took off big time and people started swarming onto my site, anxiously seeking their daily fix of foobar? The server could become overwhelmed, unable to even muster the processing power to serve a faux HTTP 500, and start returning real ones, if at all. You could argue this is technically still OK, as the whole point of 500aaS is to fail, but I’m a bit of a purist, so I coudn’t accept that possibility. The question remains then: how do I deploy a service like this so that it can serve endless botched responses in a controlled manner, to anyone, under any circumstances? By taking it to the cloud, of course!

The easiest, most scalable way to host a site on AWS is to build it on top of their serverless stack: Lambda, DynamoDB… My application doesn’t need any state to remember it should be always serving a 500 response back so all I need is a simple Lambda function to run it. One like this maybe:

const fs = require('fs');

exports.handler = async () => {
    const htmlBody = `
<!doctype html>
<html>
    <head>
        <title>500 Internal Server Error</title>
    </head>
    <body>
        <h1>Internal Server Error</h1>
        <p>There was an error processing your request.</p>
    </body>
</html>
    `;
    const response = {
        status: '500',
        statusDescription: 'Internal Server Error',
        headers: {
            vary: [{
                key: 'Vary',
                value: '*',
            }],
            'last-modified': [{
                key: 'Last-Modified',
                value: '2017-01-13',
            }],
            'content-type': [{
               key: 'Content-Type',
               value: 'text/html',
            }],
        },
        body: htmlBody,
    };

    return response;
};

I want to return an error with the minimum amount of complexity and effort possible. Turns out it’s actually pretty hard to cause a Lambda function to truly crash, so I manually craft the 500 status codes instead. Is this cheating? Maybe, but it’s not like a user of this service would care. They just want to see a 500 error page, for God’s sake!

Despite the simplicity of its implementation, this approach would still require me setting up an API Gateway as a frontend, which is good but not very cheap in the long run, and not entirely hassle-free. There is another option, and that is to serve the content as close as possible to the location it was requested from, and generating said response directly where it’s served. Does this sound like I’m talking about a CDN? Because that’s exactly what I’m talking about.

If you’ve never come across this approach before, several cloud vendors and CDN providers let you ship your code directly to the servers at their edge points of presence, which means the client-server exchange journey is remarkably shortened. Instead of having the CDN as the middle-man that caches the content served from the actual web servers, the CDN now becomes the server. Wait a minute, aren’t CDNs just dumb caches serving static Internet files all over the planet? Well, not anymore! You can now run arbitrary code in them too which allows them to modify Internet payloads running through them on the fly, as well as generating new content dynamically!

The first major vendor I know of that started offering this were Cloudflare, with Cloudflare Workers. Workers have evolved a fair bit as a technology since they were first launched a few years ago. You can deploy pretty useful applications straight to their CDN using JavaScript or WASM, which unlocks Rust and even COBOL! The technology that enables this is pretty interesting but beyond the scope of this article. Anyway, getting started with Cloudflare Workers is fairly easy nowadays. Here’s a sample Worker JS script I put together:

addEventListener('fetch', event => {
  event.respondWith(handleRequest(event.request));
});
/**
 * Respond with HTTP 500
 * @param {Request} request
 */
async function handleRequest(request) {
  const htmlBody = `
<html>
    <head>
        <title>500 - Internal Server Error</title>
    </head>
    <body>
        <h1>500</h1>
        <h2>Internal Server Error</h2>
    </body>
</html>
`;
  return new Response(htmlBody, {
    status: 500,
    headers: { 'content-type': 'text/html' },
  });
}

Deploying it was easy too. The problem came shortly after when I tried to add my new Worker URL to my AWS Route53 DNS records so that I could use the 500asaservice.com domain for it. The bad news is that Cloudflare won’t allow you to do this, unless you pay them a shed load of money. So that was the end of my adventure with Cloudflare Workers. At this point I decided to return to AWS to see what they could do for me. And easy enough, they have something pretty similar to Cloudflare Workers. It’s called Lambda@Edge and it allows you to run Lambda functions within CloudFront itself.

With barely no changes to my original Lambda code, I set up a new CloudFront distribution. The origin for the distribution is inconsequential since every single response will be generated within the Lambda so I just gave it a made-up one. Then, all I had to do was to set up a CloudFront viewer-request event as the trigger for my Lambda and deploy the distribution. Once I got everything working, I encoded the configuration in a serverless.yml so it was easier to change and deploy. And that was pretty much it. I now have a Lambda function which runs atop Amazon’s ubiquitous and nearly infallible CDN. It costs me almost nothing to run it (provided it doesn’t start serving huge amounts of traffic) and requires no maintenance at all. I’m so confident of the performance and uptime (downtime?) of my application that I even published an SLA for it.

There are still a couple of bugs in my application. Excuse me, bugs in an app that was built to fail? Yup, it turns out, 500aaS does not always return a HTTP 500 status code. It can still be susceptible to malformed HTTP requests, which will force CloudFront to step in and return a HTTP 400 error instead, bypassing the Lambda altogether (this is why my SLA does not promise 100% downtime). This is something I could perhaps fix by overriding the custom error responses CloudFront returns, but they seem to be set up as a function of the origin response, so I don’t know if they would work with Lambda@Edge. Still a work in progress.

If you’re interested in checking out how 500aaS was built, you can browse the repository on GitHub. Pull requests and suggestions welcome. I even set up a GitHub Actions pipeline to run a test to ensure it always fails. Because I have standards, you know?

Do you have any questions, comments or feedback about this article to share with me or the world?

You can message me on Mastodon. You can also reach out to me in a couple of other ways, if you'd prefer. I would love to hear your thoughts either way!

Segmentation Fault

How I Built a Planet-Scale Failure

The ingredients of a scalable failure

Articles from friends and people I find interesting

Snipes Everywhere

A more robust raw OpenBSD syscall demo

Best Simple System for Now