Thoughts on async io

Posted 2022-04-18

I'll write a few thoughts on async io.

Core D Development Statistics

In the community

Community announcements

See more at the announce forum.

Thoughts on async io

Windows Overlapped I/O

On the forum last week, someone asked about async io in phobos, and I said the Windows Overlapped I/O functions are actually quite nice, and I'd suggest basing interfaces on it.

Of course, there was also a discussion about if it should be in Phobos or in dub. Let me just briefly say there's value in interoperability by putting interfaces in the standard library, but it indeed isn't required and comes with possible downsides, especially during the beta phase. But I want to talk more about the Windows api itself than the politics of D development today.

The way the Windows functions work is you open the file with the overlapped flag, then call the ReadFileEx/WriteFileEx (or parallels for sockets) functions to issue your command, giving them a buffer to use. The operating system tells you when it is done. I already like that basic outline - it is pretty simple to use, but we also need to get into the details.

So, how does the operating system tell you when it is done? There's a few options:

  • If you pass it a synchronous buffer, and the data is available immediately (e.g. already pending in a kernel buffer, or already in the RAM cache), it will return it immediately into that buffer. This can actually give best performance, as it gives the data as soon as possible without further followup. The return value of the function tells you if this was the case.
  • You can call GetOverlappedResultEx at any time, which can wait inside a specified timeout for it to finish and return the result.
  • If, when calling the work request function, you pass a pointer to a function with some optional user data, the system can queue an asynchronous procedure call to your thread. When you indicate you are ready to process it by calling a variety of functions including, but not limited to, SleepEx, which puts your thread to sleep for a time and calls these functions as they come and MsgWaitForMultipleObjectsEx which is a multi-purpose message loop that informs you of the arrival of these (giving you a chance to SleepEx to process them) these as well as a variety of other things.

    The reason I like this more than other async i/o things I've seen is that it avoids cross-thread issues. Your same thread is processing the result when you know it isn't busy with anything else, so it is pretty predictable and easier to use without bugs.

    Emulating this on posix could be done with the select family of functions behind the scenes.

    This option is often the easiest to use in the middle of other work.

  • If, when calling the work request function, you pass a handle to an event object, the system will set the event when it is done. Other threads can wait on the event and thus be woken up when the work is ready. Please not that not all functions actually use this, in which case you can pass custom data through that handle member of the struct.

    This is also pretty nice for managing your own worker thread system. The Windows event object is another really nice facility for lightweight sync work. Druntime has a thin wrapper on Windows and emulation of it on posix already too, so that'd be usable out of the box for a hypothetical D api too.

    This option gives a lot of custom flexibility. Of course, a callback function could (and sometimes must) trigger an event, but letting the OS do it directly might be quicker than waiting for the original thread to be ready to call a function when it only has this one job.

  • You can also associate the file object (not an individual operation like with the others) with a system object called an "I/O completion port", which will issue the completion notification to a worker thread pool, which the OS manages automatically to maintain optimal cpu affinity and shared workload.

    If your work can be be done by any arbitrary thread, this option gives excellent performance. It is the hardest to emulate, but something akin to a Linux epoll one shot comes close. I think this is what those event driven io libs typically do, but I haven't looked at their sources (but copying the Windows API is a good idea so no surprise if that's what they did!).

As you can see, there's a few options, which can look complicated at first sight, but each one has its uses in different circumstances, and each option is pretty easy to put into a cross-platform api and pretty easy to use, since the threads processing data are all explicit, so no surprises, and flexible to integrate into other event loops (you can trigger things on those custom events, or on Windows, the system functions all support the various options anyway - the MsgWaitForMultipleObjectsEx is particularly nice to work in some async io to an existing gui application).

Worth noting by the way that turning callbacks into things like promises or async/await syntaxes, or into pseudo-synchronous execution in fibers is all pretty trivial. See my conceptual overview of fibers here: http://arsd-official.dpldocs.info/arsd.fibersocket.html#conceptual-overview and notice how if the "on complete" callback is simply fiber.call() and you immediately fiber.yield() after issuing the read/write command, you've achieved the fiber illusion.

For this reason, I don't think making the fiber an explicit goal for the basic API. If you build the pieces correctly, the fiber is a trivially easy add-on, as well as other useful things. If you don't do the pieces right, you'll find trouble later. When in doubt, copy something successful in the real world - even if Windows API isn't ideal, it is well-known through years of experience.

Additionally, please note that Windows' pending async i/o can be cancelled as well. Cancelling async things and dealing with other interruptions is something easy to forget when doing initial designs, but important to have as things mature.

Aside - other apis

I want to briefly describe some other options and why I like the Windows way better.

One option I used once was Boost's async I/O. This spawns a thread to do some work and calls your callback from there. This is relatively hard to use because your data can easily fall to race conditions, and any objects with thread affinity (including D's TLS by default variables) might cause trouble.

But the main alternative is the lower level functions we see on Linux and friends. The key difference is that the Windows way is you ask the OS to do something, it takes care of the queuing and tells you when it is done. The Posix way is it won't accept the command until it is able to complete; it will either block until it can accept it, or you can have the OS tell you when a file descriptor is prepared to accept a command. At that point, you can issue it and it completes immediately.

So the big difference is the Windows way lets the OS manage the command queuing whereas the Posix way needs you to manage it. Since Windows manages more work for you, I say this makes it a bit easier to use. And it is a little more abstracted, so it is easier to emulate on other systems.

(There's some performance advantages to the Windows way too, merging system calls, but I'm not as concerned about that here, especially since there's comparable alternatives on the other platforms anyway, and besides, the performance difference isn't that great regardless. And the posix way might be able to use stack buffers more easily but async tends to need longer lifetime buffer anyway so meh.)

Hypothetical D wrapper

The best wrappers are often not that much different than the things they wrap. Adding things can be performance problems and removing things can harm flexibility, and besides, being similar to what is already known means it is easier to learn since the existing documentation can be adapted directly.

But, that said, you do want to use your language features to reduce the possibilities of errors using the api, and we'll want some helpers and supporting infrastructure to make it all work, especially given differences on other platforms.

I'd probably suggest making the api something along the lines of:

struct IoBuffer {
	private OVERLAPPED os_data;
	private other data too as needed;

	// you provide the actual buffer
	this(ubyte[] buffer);

	int errorCode();

	// for reads, this returns the slice of the buffer
	// actually filled by the read.
	ubyte[] usedBuffer();

	// for writes, this would be the part of the buffer
	// that was left over after a partial write
	ubyte[] remainingBuffer();
}

/++
	This is slightly abstracted but also public because you want
	the user to understand how these things work for maximum integration.
+/
typedef OsHandle FileHandle;

/++
	You read and write by passing the buffers you've populated.

	You cannot touch the IoBuffer or anything it points to until
	the delegate is called, except for passing it to the cancel
	function.

	The delegate gets the pointer to the completed buffer, which
	you use to get the actual data read / remaining data to write.

	These forward straight to WriteFileEx and ReadFileEx. The
	delegate's context pointer is laundered through the hEvent
	member in the OVERLAPPED struct, which is a member of IoBuffer.
+/
int read(FileHandle, IoBuffer*, void delegate(IoBuffer*));
int write(FileHandle, IoBuffer*, void delegate(IoBuffer*));

/// Forwards to CancelIoEx
int cancelPendingOperation(FileHandle, IoBuffer*);

/// Forwards to GetOverlappedResultEx
int waitForOperationToComplete(FileHandle, IoBuffer*, Duration timeout);

/// Forwards to SleepEx
int processPendingIoCompletions();

You might also want a sleep function, but you really shouldn't call sleep in an event-driven application anyway; instead you'd want to set a timer event with a function. Windows has functions for this too which queue the async procedure call, just like the I/O notifications, and they can be simulated in other environments.

The tricky thing is integrating with other possible existing event loops. Of course, we could just try to provide a standard loop that works for everyone, but even on Windows, there's a difference between ui message pumps and io heavy worker threads, so it is hard to make one that works for everybody, and besides, we still have outside systems and legacy code to think about.

This is really the hard part. One idea is to have a function along the lines of:

/++
	This function doesn't have an analog on Windows; in fact,
	there, it does nothing since it is already integrated with
	the system event loop.

	But this would provide sufficient info for you to call inside
	another loop to set up the events.
+/
void getIntegrationsWithEventLoop(
	scope void delegate(FileHandle, int flags)
);

Or perhaps there could be an interface that provides all the functions the system calls, and you provide an implementation of the loop itself.

But Windows might provide inspiration for this too. The way those asynchronous procedure calls work is actually fairly similar to Phobos' std.concurrency - it posts messages to the originator thread's mailbox, which are processed when that thread checks its messages. Indeed, we could even actually use std.concurrency itself for one implementation (though I'm meh on that, I'm not a fan of it for various reasons perhaps I'll write about some other day). And other ones just need to trigger an Event when something arrives.

But Event does have one problem: it works beautifully in Windows integrations, but on Posix, it is implemented with pthread_condition which, as far as I know at least, doesn't play well with functions like epoll. On Linux, it would be much easier to use an event_fd, but that is Linux-specific, so I don't know if there's other implementations. I think we might want to switch Event over to this though if there are other ways, since triggering a single file descriptor really does simplify this kind of integration work - you'd just listen for reads when it is triggered, then call the process function.

That's how I'd most likely want to do it.