For the last few months I have been maintaining fnbr.co, a Fortnite cosmetics directory.
My job has been to make sure it stays online, responds fast and update the backend code when needed, ironically I have never actually played Fortnite, so I couldn't tell you why or what people go on fnbr.co for, but trust me they do...
So fnbr.co, or just 'fnbr' if you prefer has been going more or less since Fortnite Battle Royale launched last year. Although I only started working on it a few months ago.
I was brought in to rewrite the site to use a database for storing cosmetics vs static html files, almost immediately the site randomly started getting more and more visitors - it may be SEO related, I am still not quite sure.
The Fortnite item shop changes daily at midnight UTC (or 1am UK Summer Time) which means every night at 1am there is a huge peak in traffic, at first it was small, about 2 or 3 thousand concurrent visitors and steadily grew from there.
Everything was going well, we had a simple $5 droplet from DigitalOcean (referral link, gives you $10 free sign up credit and gives me free credit too) which had 1GB of Ram and 1 vCPU, this was enough for a while until we started wanting to expand the site and add new features.
The fnbr.co discord server has over 2,000 members and we thought "why not create a Discord bot to post the updated item shop each day", so I created a bot and it, a bit like the site, started to gain traction quite quickly. it is in nearly 3,000 servers at the time of writing.
Pretty soon after launching the bot, it started to cause issues for the website, so I set about separating it from the website code and moving it to it's own VPS to avoid the bottleneck occurring.
This bought a little time but pretty soon after I split the two, the site had grown to getting nearly 10,000 concurrent visitors at 'reset' (aka. 1am)
This was the start of my scaling mission...
Running everything through DigitalOcean (referral link, as above) meant it was easy to scale, and split off components and utilise the private network DO provide to communicate between all the droplets for free (no public bandwidth usage) and with low latency.
I started by moving the Redis and MongoDB instances to another droplet, they actually don't need much resources and run quite happily on another $5 server.
I doubled the resources of the main web server giving it 2 vCPUs and 2GB of ram, again, this bought time, but you guessed it, the site grew once again! We were now getting over 15,000 concurrent visitors during the peak period!
Things started to get more complicated for me from here, Cloudflare started returning SSL handshake errors, which implied that the lonely web server wasn't able to SSL terminate all those connections as well as serving up requests from the node.js application.
One option I tried was a DigitalOcean load balancer, this had several advantages on the surface:
The aim was to offload CPU intensive SSL termination from the web server as much as possible, this seemed like the perfect solution.
I deployed a load balancer and immediately ran into issues, it struggled to cope at reset, this was disappointing as I had high hopes for it.
It had two main issues:
I contacted DO's support, but couldn't resolve the issues and they also couldn't tell me if there were any concrete limits with the LB I may have been hitting, after searching on their forums, I found a few users hitting similar bottlenecks with a high number of requests per second.
The frustrating thing throughout the experience was that the peak period was only about 5 minutes at 1am, for the rest of the day everything coped fine and the load balancer was completely redundant yet costing $20 a month.
Unlike most websites that need to scale, this website had a predictable peak so I could prepare for it and didn't need to overprovision.
I needed a way of scaling for that period to a high enough degree which was cost efficient and effective to solve the problem.
By this time I had written a small node.js script which used Cron to create a cluster of "boost" servers through DigitalOcean's API at 12:30am and destroy them at 1:29am meaning they are active for 59 minutes - just short of an hour in case of clock drift.
This was great, I could scale up app server capacity quickly, automatically and cheaply - each of those servers cost less than 1 cent per hour.
However, the bottleneck was still the load balancer, more specifically SSL termination.
I was sure of this for a number of reasons, mainly the fact that through NewRelic APM I could see that response times did not increase at all throughout the peak period, despite requests per minute increasing hugely.
The app server itself is very optimised and using Redis an awful lot, which isn't a problem for data freshness, as the data can stay cached for hours and easily be purged if anything changes early.
DigitalOcean itself doesn't yet offer an autoscaling service, but they do make it very easy to do through the use of their API, snapshots and tags.
All I had to do to add a droplet to the load balancer is give it a tag, and the health checks built into the load balancer wait for the droplet to come online and start sending traffic to it seamlessly.
It really is a shame their load balancer seemingly has bottlenecks, which aren't documented anywhere official but appear quite easy to hit it.
Now we're getting over 20,000 concurrent visitors at 1am and nearly 200,000 unique visitors a day, scaling is even more important and a top priority.
As a result I have rewritten my scaling script to be more of a management script including health checks and more comprehensive interaction with DigitalOcean's API.
I decided to spin up a temporary 'High CPU' droplet for the same 1 hour period and run a load balancer on that, the health checks in my management script will wait for this load balancer to fully come online and then change a Floating IP to point to the new droplet.
Originally my plan was to update Cloudflare DNS, but then I realised Floating IPs are basically designed for this - switching traffic between droplets at a moments notice.
If I ever need more than 1 load balancer in the future I will change to edit Cloudflare DNS into a round robin setup as Floating IPs are limited to one droplet as far as I know.
After looking at various load balancers, SSL terminators and reverse proxies I settled with HAProxy (the underlying technology in DigitalOcean's LB) the only downside is dynamic backends aren't natively supported so I have had to create my own workaround for that.
The droplet I chose was "c-2" which has 2 dedicated vCPUs and 4GB of ram, I was shocked to see the difference in performance between "Standard" and "CPU Optimised" droplets - this performance boost does however come at a large increase in monthly cost if you run for the whole month.
Like the web boost servers, this new load balancer is created from a snapshot, meaning deployment is quick and uniform.
To get around the limitation of dynamic backends in HAProxy, after the management script creates the new droplets it will generate a new HAProxy configuration, which is then automatically pulled from the management node when the load balancer starts up.
All droplets are tagged so only those which are web servers are added to this config file.
During the day when the extra capacity isn't needed, the main web server handles SSL termination and reverse proxies to itself with no issues at all.
And the cost of all this? Less than $15 a month.
Permanent web infrastructure:
Both of these have 1 vCPU and 1GB of ram.
Temporary servers (~31 hours a month)
Due to the nature of a cloud platform it is very easy for me to create an even bigger droplet for the peak when that need arises, thanks to my new script I just change one line of code!
Hopefully my new plan will work well and provide adequate room for growth in the future.
DigitalOcean are planning to introduce per-second billing at some point in 2018. This will make my process even cheaper, as we only need the extra capacity for about 10-15 minutes per day, I am sure with even more optimisation of the startup times I could reduce that down by a few minutes.
Of course there are other ancillary costs such as bandwidth, S3 storage and other services like Cloudflare but for the pure server side of things, this is a really cost effective solution!
Now that I have resolved the bottleneck at reset, I now want to improve performance in other regions than the US, such as in Europe and APAC. I plan to do this with Cloudflare workers and hopefully write another blog post about it!
I hope you found this interesting and if you have any questions or feedback let me know below!
It was a long post and if you're still here, thank you.