Thoughts on error handling

Posted 2021-08-16

Walter tweeted that exception handling was a mistake. I don't agree. But I also don't think D's exceptions are as good as they could be.

Core D Development Statistics
In the community

Community announcements

My thoughts on exceptions

Core D Development Statistics

7 bugs fixed
12 bugs and enhancement requests opened
19 pull requests merged into the language: 12 into DMD, 5 into Phobos, and 2 into druntime.
4 pull requests merged into the website.

There's a few ideas out there for error handling. Here, I want to talk about C-style return values, exceptions, type-system return values (like return nodisard Result|Error), errors being included in object state, and some kind of passed in handler.

C-style return values are a pain. They're easy to forget about entirely, hard to attach details to, and just tedious to handle if you do want to. You've gotta remember to propagate everything.

Exceptions came around to fix this: they propagate automatically, have extended info right in the error object itself, and you can handle whole blocks at once. I have my problems with D's exceptions too in that they encourage the use of strings instead of extended information in the object. I'll talk more about this later.

But anyway, plenty of people people complained that because an exception propagates automatically, it is too easy to ignore. Documenters ignore it, leaving them out of the api, users ignore it and don't handle cases they should handle. So they went back to the C style return values, but now got the type system to force you to handle it. This often comes with some syntax sugar to make propagating it a bit easier, though you still have to decide what to do at each point rather than as blocks so you can't so easily ignore it entirely.

Walter argued that poisoned state should indicate errors. The examples he gave are null pointers and nan floats, and dmd uses __error__ AST nodes. This works reasonably well for those cases, but isn't great elsewhere: it has most the same downsides of C style checks, but since it also poisons states, it can make for more action at a distance! Though, that can be beneficial in async situations where the object is passed around instead of a code. Just I don't think it would work that well in general outside cases like these; people have trouble with null and nan too.

What I really want to talk about here though is recovery-centric errors, where some kind of handler is set to handle various conditions.... or something. The basic idea is you think of various things that can happen and how you want the program to proceed for each of them.

For example, a function to open a file. This can succeed and give you the open file. At this point, you surely want to read or write to it. It might fail because the file doesn't exist, in which case you may either create it or choose a different file. It might fail because the file exists but you don't have permission to access it, which you can handle by choosing a different file or changing your access. Or, perhaps, the file is there and you should be able to read it, but couldn't due to an i/o error or a signal interruption or something like that, and your reply is to again choose a different file, to just try again and wait out the problem, or to forget about the whole thing.

If C-style code, these decisions look like if:

string filename = "file.txt";
try_again:
FILE* fp = fopen(filename, "rt");
if(fp is null) {
	switch(errno) {
		case EINTERRUPTED:
			// if the interruption is because the user
			// decided to cancel, we need to clean up and
			// exit. but if not, just try again to finish
			// the task.
			if(terminating)
				return_with_cleanup;
			else
				goto try_again;
		case EACCESS_DENIED:
		case EFILE_NOT_FOUND:
			final switch(dialog_with_user("What do you want to do to resolve " ~ errno ~ " ?")) {
				case new_filename:
					change_file_dialog();
					goto try_again;
				case change_user:
					change_user_dialog();
					goto try_again;
				case abort:
					// forget about the whole task
					return_with_cleanup;
				case retry:
					// maybe they fixed it externally
					goto try_again;
				case fail:
					// perhaps the rest of the task can be salvaged
					return_with_error;
			}
		default:
			return_with_error;
			// you don't know how to handle it
			// so the decision is moved up a level
	}
} else {
	// success, can now read from fp
}

As you can see, the concept is doable, but quite a pain. Whatever task this was part of needs to handle things similarly at the next level, the blocking calls need to be handled somehow (though making some kind of resumeable task object - like a D fiber - I do think can make this kinda nice), and the fact that the error details are in this separate errno instead of actually enumerated as part of the return value - ideally a limited set of well-defined options, so you can both recognize them kinda generically, but also final switch all of them, something like an enum with values selected from a general list - adds to the hassle. This individual function isn't bad, but if this function is part of a greater whole, you start to feel some pain.

The exception version alleviates that pain by batching everything. Your handlers are in catch blocks at the top-level of a task you are trying to do. With automatic propagation, this looks fairly ok, but the problem with this concept is it doesn't help much on recovery. You can restart the entire task from scratch or abandon the whole thing, but you can't fix something and retry very well from the top level - suppose your tasks opens two files and you get file not found on the second. An exception would close the first file as part of the stack unwinding, then the catch can change the filename... and if it retries, it will re-open the first file to get back there. It can work but it is a bit awkward. Especially for the "interrupted, but not canceled" case, where you just want it to retry the most recent thing!

So I'd want some way to pass in recovery procedures from the top-level of a task to actually run in the middle of a task and possibly resume or try again. Hardware exceptions tend to allow this, a page fault, for example, can load the page then rerun the instruction. But it is relatively rare in higher level code; they were deemed useless in practice in the 80's and the only language I know of that still keeps it around is Common Lisp. I'm not sure I agree because a lot of these tasks indeed can be resumed, at least if the context hasn't already been destroyed by a stack unwind. Does make me wonder though: what did the C++ designers see that I'm missing?

While that tempts me to abandon this, I'm gonna ignore that exception and just keep going :)

To implement this, delegates are certainly one option. The task could pass it down as part of the formal parameter list to handle each category. I think this would work reasonably well and the delegate can decide via a return value to say what to do, but this gets tedious again. It needs to be passed all the way down and each use needs to inspect the return and do what needs to be done.

Here's what it might look like:

FILE* workWithFile(string filename, string mode, ResumeCommand delegate(scope string* filename, scope string* mode, FileError error) onError) {
	try_again:
	FILE* fp = fopen(filename, mode);
	if(fp is null)
		final switch(onError(&filename, &mode, FileErrorFromErrno(errno))) {
			case ResumeCommand.retry:
				goto try_again;
			case ResumeCommand.abort:
				throw new FileException(filename, mode, errno);
			case ResumeCommand.ignore:
				break;
		}

	// do whatever with fp now
}

// and the call site

workWithFile("test.txt", "rt", (filename, mode, error) {
	if(error == FileError.FileNotFound) {
		auto result = new_file_dialog();
		if(result is null)
			return ResumeCommand.abort;
		*filename = result;
		return ResumeCommand.retry;
	}

	throw new FileException(....);
});

Or something like that. You can see I'm still using the exception to cancel the task, but most recovery is done without it. The delegate lets you move the decision to a higher level without the control flow actually going anywhere.

BTW notice how there's argument forwarding here. Another wish for my argument tuple from a few weeks ago!

This is still awfully verbose, but it is worth noting that you can reuse the error handling delegates, which mitigates it a little, but I think if you wanted to reduce the verbosity you'd probably end up going with something like a D fiber, attaching a little error handling state and being able to pause and resume the task. It'd probably still require a mixin in D to do the flow control though. I don't think it will ever look nice without at least some help from the language.

Anyway, I'm going to leave it at this half-baked-idea stage for now. It does still concern me that Common Lisp's condition system seems to be the closest this concept has ever actually gone in mainstream. Makes me fairly certain I missed something important as to why this either doesn't actually work or is otherwise useless.

But if D is going to abandon exceptions - which I think is a big mistake - I do think we should step back and try to think about what would ACTUALLY be better instead of just falling back to the past, which let's remember, was found to have real reasons to dislike it.

Perhaps next week - assuming I have time to write - I will talk about some smaller improvements we can make to D's exceptions as is, then hopefully I can find more time to experiment with this delegate/fiber concept in practice.

D's strength btw has been that it is a bit of an "all of the above" approach. I think it would be nice to strengthen exceptions. And make the type system handle the return values better. And make poisoned objects easier to do. And maybe look into something new. Even if we did decide something sucks, D still needs to integrate with code and people that do a different style. We'd be fools to throw away our big advantages of flexibility.

Blog Articles

Thoughts on error handling

Core D Development Statistics

In the community

Community announcements

My thoughts on exceptions