Some talk on cgi.d in benchmarks

Posted 2020-09-21

This week was primarily focused on the newborn baby, but while I could barely type most the days, I did get some reading and planning done.

Core D Development Statistics

In the community

Community announcements

See more at the announce forum.

on cgi.d performance

On this http bench thing: https://github.com/tchaloupka/httpbench you'll find cgi.d listed near the bottom. No surprise there, since my focus is on reliability and ease of use rather than raw speed, but if you look at the details, you'll find it actually does have a consistently strong median performance at about 0.3ms even as the others drop off. What slaughters it in the average is the very poor upper 1% and the number of outright dropped requests at high concurrency.

The dropped requests is somewhat by design - if it is too busy, it doesn't respond, which means TCP takes over with a retry algorithm that actually works pretty well in practice (but no benchmark really captures), at least if you have a burst of activity that subsides (which is the case 99.99% of the time). And the poor upper 1% is a consequence of this: those represent requests that actually did manage to resend and successfully complete before the benchmark's timeout. (This is why no max is > 10000 on the table - that's the benchmark's timeout. If you waited longer, more would probably eventually come in, but nobody wants to wait that long anyway.)

However, while I stand by the principle behind dropping like this - accepting a request you can't actually handle will break the back-pressure feedback and make your flooding worse - the server itself is obviously not actually overloaded, since you can see the other implementations handling 10x or even 20x the number of requests. So what's going on here?

Well, it is pretty simple actually: cgi.d's process server has a configurable, but modest default number of workers and it's thread server has a hard-coded size of worker threads. These workers distribute incoming *connections*... but not *requests*. Thus, a HTTP keep-alive connection will simply put a worker thread to sleep, for possibly up to several seconds, as the connection remains open yet no request is coming in (it might be only a millisecond behind on the network buffer, but a millisecond is time it spent sleeping when it could have been handling 3 other requests). Other pending connections may find no workers available at all before they time out, or it may become available at the last possible moment and cause those 99th percentile max spikes to show up on the chart.

So for the connections and requests at the top of the queue, cgi.d does a surprisingly good job answering them quickly, giving it that very solid median response time. But there's several more it just will never get around to, killing its overall rps at high concurrency.

What do we do about it?

Well, "nothing" is a perfectly acceptable answer, and there's two big reasons:

1) YAGNI. Seriously, even those "poor" numbers of 60,000 requests per second... is a lot. That's almost certainly not going to be your bottleneck in a real application which needs to actual work to answer those requests instead of returning a static string. And, let's be honest, do you have enough userbase to generate that much traffic? If so, monetize some small fraction of that and budget one of your developers a couple weeks of work time to implement some optimizations with me.

and 2) this problem also doesn't exist in different production scenarios. Recompile with scgi instead of embedded_httpd. None of your code has to change - cgi.d provides a unified API for these models. Then, let the front end server manage keep-alive connections and just send semi-processed request events back. Heck, you could even have a frontend server load balance over several backend machines if really needed. This also brings likely security benefits and more deployment flexibility. An option that should be seriously considered even if the (theoretical; I haven't actually proven this in practice) performance change didn't pan out.

But, OK, these defenses don't *quite* pan out because the worker monopolization problem can be triggered in less extreme examples too. Consider a user slowly uploading a file. Or a browser opening several keep-alive connections for a single user. These things happen, even with lower traffic sites, and we should handle them well anyway.

That's why I started writing an event loop handler inside the thread handler a pretty long time ago... but I never finished it. Why? Because cgi.d uses immutable class members, and immutable class members must be initialized in the constructor. So it was awkward to try to retrofit an external event pump into it; it'd have to be able to yield a constructor.

yield... hmmm... where have I heard that term before? Why yes, if I were to put the constructor inside a fiber, I could make this work! And then I might be able to keep the fiber around for other tasks.

I've been hesitant to embrace fibers in here before because I don't want to sacrifice my compatibility with existing third-party APIs. It is very nice to be able to just call random blocking functions without needing to have a special library compatible with my custom event loop.

But I don't actually have to sacrifice this! I can keep my spare worker threads and spread the fibers amongst them. In fact, I'm pretty sure I can spread one fiber across whatever random thread happens to be available to pick it up, since the Cgi api enforces some degree of data locality too. (Though I have claimed embedded_httpd_threads as an option to loosen that requirement, it is still an explicit opt in choice, called out as a special circumstance. So I have flexibility to change the implementation.)

If I do it right, I should be able to handle these higher concurrency upper percentile situations without sacrificing that current strong median performance.

And if I do it wrong... well, I'll comment the code back out and maybe come back to it later.

We might have a new embedded_httpd_hybrid at some point that combines processes, threads, and fibers in a way to get the biggest strengths out of each of them, just like how I have found some success forking off add-on servers for specialized tasks without sacrificing my current strengths. I just need to find the time to experiment with it.

Well, the baby let me actually type for a solid hour! But now she is waking up, so time for me to go again. Don't expect too much exciting code from me in the coming weeks - I gotta catch up on a lot of stuff and she will surely keep me busy for some time still. But fear not, I'll be back.