How I scaled to 10 million users in a minute without crashing

·

5 min read

How I scaled to 10 million users in a minute without crashing

We usually define our company as a pure tech agency, but most of our customers are treating us like IT firefighters: they call us when the s**t is hitting the fan, and the house is burning.

Introduction

Almost a year ago, I got one of these middle-of-the-night phone calls that I'm ambivalent about:

  • I love it because it means rush, and I'm an intellectual adrenaline junky

  • I hate it because my wife wants to kill me

Back to track, a company was launching a new project, a fundraising platform for a charity, in a few hours, and they were going to fail because their system was not scalable. Why? Let's see.

The website had 6 pages:

  • Home page, where people are landing, in it, you should also have how much was raised

  • Donation page, where people can choose how much they want to give

  • Success page, where people are landing after a successful payment

  • Error page, where people are landing after an unsuccessful payment

  • Contact page, I think it's self-explanatory

  • About us page

As you can see, over all these pages, only a few are dynamic:

  • Home page with the donation

  • Success and Error page

The original website

The entire website was based on PHP, with a MySQL database, sessions inside MySQL, and a lot of dynamic content without any reason (all translations, etc...).

Of course, we helped as much as we could by:

  • Moving sessions to Redis

  • Modifying code to make it more efficient

  • Adding DB cache

  • Pop new servers, and have a bigger cluster

Unfortunately, it still could not handle the load, even if we had some big improvements, and it cost a fortune to host!!!

So this year, they asked us to deliver a brand new solution, and we did!

Kalvad's Power

We had some metrics from the previous launch, so we knew what to expect.

We decided to not follow the same pattern of programming at all:

  • We hate PHP, we think it's outdated, and that nobody should use it in 2021

  • We love the planet Earth, so we don't want to spend more energy and money than required

  • We wanted something fast enough to answer in less than 10ms even with a high-traffic

Changing paradigm

Our first idea was to change the paradigm: no more big PHP cluster, welcome Hugo!

Hugo is one of the most popular open-source static site generators. With its amazing speed and flexibility, Hugo makes building websites fun again.

Why? Because we didn't need the big guns:

  • The donation amount could just be a separate API

  • The payment platform that we used was amazing (SmartDubai), as it's only based on redirections

  • If you want to reduce your exposure for security, why would you have more dynamic content, when you can just generate some HTML, CSS and JS without any security holes?

We still needed an API, so we chose an amazing language and framework: Elixir and Phoenix.

Elixir is a dynamic, functional language for building scalable and maintainable applications.
Elixir leverages the Erlang VM, known for running low-latency, distributed, and fault-tolerant systems. Elixir is successfully used in web development, embedded software, data ingestion, and multimedia processing, across a wide range of industries

Why we didn't choose Go, Rust, or Java?

  • Elixir has a very clear syntax for a functional programming language (Hello Rust)

  • OTP aka Erlang VM is, according to us, one of the most impressive pieces of software in our industry

  • Phoenix is inspired by Ruby On Rails, so the syntax is very easy and you can go from prototype to production very fast

  • OTP is super reliable, even under heavy load

This schema represents the default configuration, but as we deployed all our applications on Clever Cloud, we had auto scalability in place (we were able to go to up to 40 servers per cluster, each with 16 CPU and 32GB RAM).

Load Testing with Locust: Everything You Need to Know - Decentro

We wanted to prove the performance of the system, so we did a load test with an amazing tool: locust (an article is already in progress to explain how we use it, which should be released soon TM).

Long Story Short: We were able to handle 10 million users during our load test, doing each request per second on the homepage, without any downtime.

Real Numbers

Static Website (Hugo)

  • We got 22 million visitors in the first 10 minutes.

  • We got over 85 million unique visitors.

  • We got 7.934 billion requests.

  • The average answer time was 4 ms.

  • The cost for the static hosting during 45 days was 22 USD

API (Elixir)

On the elixir side, during the entire 45 days of fundraising, we got:

  • Some attacks, yes, some people are going after charity websites (no harm done).

  • We got 132 million requests.

  • Only 2 HTTP 500 (detected through Sentry, and fixed in 10 minutes).

  • The average time per request was 243ms (as most of it had to communicate with the payment gateway).

Conclusion

We could have launched a 200-node Kubernetes cluster to solve the original issue, like most people are doing these days, but I love Earth, I love efficiency, and I hate fixing my issues.

Furthermore, we, at Kalvad, think that elegance is important, even with code and infrastructure, and our mantra is clear: Excelsior.

If you have a problem, if no one else can help, and if you can find them, maybe you can hire the Kalvad Team.

Source: https://blog.kalvad.com/the-future-of-website-is-static
Resource:
1. http://testingtoolsguide.net
2. https://performancemanagementnewtrendsraikihi.blogspot.com/2019/02/microservices-performance-testing.html