At one company I work for, we have 3 servers doing only frontend work (serving HTTP requests) and 4 other servers doing only background jobs. Those are very powerful servers with many cores and tens of gigabytes of RAM. The problem with dedicated servers is that background job servers are underutilized most of the time and although our frontends could handle 2-3x the traffic, there’s never too much capacity for handling bugs in frontend code (causing a lot of requests coming in) or DoS attacks.
Problem: we can’t do rate limiting. Our application is special in a way, because our frontend requests are very computational heavy and we don’t do much caching (which would make us do a lot of precalculations, causing us to invest multiple times into hardware). Due to the majority (but not all) of requests being very slow (300-400 ms) and most clients coming from the same IP address, we can’t really do rate limiting. Or at least, do it in a simple way. Say, we rate limit requests per user, but then either the limit is too small (and the fast requests like logging in or API calls (at some times we can have more fast requests than slow ones) become slow, as they are delayed) or too large (and the slow requests can easily kill the whole cluster, as we don’t have the capacity to handle more than 2-3x of normal load).
In other words, we don’t want to deploy extra hardware just to have the capacity to handle DoS attacks and nothing more (user experience does not change directly because of that, in our case). But we want to be able to mitigate such attacks automatically. The striking thing is that we already have unused hardware for most of the day – background job servers are not under high load most of the time.
The solution with haproxy
This is where haproxy comes in. We decided to add the extra capacity by giving out the frontend tasks to all servers instead of just the frontend ones. We changed nothing from background job implementation side, we simply launched frontends on all servers and started serving requests from all of them. But that wouldn’t be ideal without haproxy’s agent-check ability.
On each frontend server, there is a simple systemd maintained haproxy agent:
This script is just an example of what you could do with agent-checks in haproxy. I’ve seen some people doing haproxy management panels through agent-checks, but it was even better to automate everything, in our case. This program responds to TCP connections with a percentage of how much an haproxy backend weight should be modified (which we set to 100 as a baseline for all backend servers (instead of default 1), so that the end result would be a percentage). When calculating weight (the percentage), it takes the amount of Sidekiq (background job daemon) process count and busy Sidekiq process count. Then, if the busy count of Sidekiqs is more than the number of physical cores in the system (e.g. 24, hardcoded), it removes the system from the pool by setting weight to 0%. If there are less than 24 busy Sidekiqs, it reduces the weight proportionally.
haproxy is configured to round robin requests, assuming that they are all more or less equally the same (and we also assume that IO is not a problem for our application, which is most of the time true), thus, load is always distributed more or less equally across the pool. We never send requests to servers doing lots of background job processing at the time and since some of the servers don’t have as many Sidekiqs as CPU cores, we always have some frontend processing power, regardless of how many Sidekiqs are actually busy. Last, but the best – Sidekiq’s scheduling a bit vague to me and I can see that it sometimes schedules x jobs to worker A, z jobs to worker B and etc. One irrelevant weird thing is that x or z is sometimes >1, even though there are other available workers. The other issue, very relevant to me, is that sometimes Sidekiq can schedule some jobs for the worker on frontend servers and background job servers would be under less load than frontends – thus, being a better choice for serving frontend requests at the time. Such agent-checks automate everything, giving us a lot more capacity to handle client-side anomalies while using same hardware.