simpleaudio dev work, rasp pi gpio module, static foreach rant, gcc 10's D support upped

Posted 2020-05-11

I didn't write blog text last week, so a lot to talk about this week. Meanwhile, gcc 10's D updates can be read here (though the changelog isn't formally merged, the code is!), including static foreach support: https://gcc.gnu.org/pipermail/gcc-patches/2020-May/545326.html

Core D Development Statistics
In the community

Community announcements

What Adam is working on

simpleaudio.d
gpio.d
game.d and gamehelpers.d
script.d

dmd and static foreach

Conclusion

Another dmd thought
Last thought - string switch

Core D Development Statistics

6 bugs fixed
22 bugs and enhancement requests opened
32 pull requests merged into the language: 22 into DMD, 5 into Phobos, and 5 into druntime.
6 pull requests merged into the website.

In the community

Community announcements

See more at the announce forum.

What Adam is working on

simpleaudio.d

arsd.simpleaudio is one of those modules that I did the bare minimum years ago for a quick task, then just kinda left there. I have always intended to do more with it, but never actually got around to it.

Until now. simpleaudio's design calls for 7 main pieces, each of which is supposed to work on both Linux and Windows (except one):

PCM raw out. This is probably what first comes to mind - playing back digital audio given a waveform. My API there wasn't very good, but it worked.
PCM out helpers. This builds on the raw output to do some common tasks in a "just make it work for me" style. This was pretty massively changed recently.
PCM raw in. This is the building block for recording from a microphone or line in or whatever. I just implemented this recently. With a file header, this can be made into a .wav file.
MIDI raw out. You feed the system or some hardware midi events. I was fairly happy with my implementation already.
MIDI raw in. Just wrote this a couple days ago, this is to read midi messages from a piece of hardware like an electronic piano keyboard. You should be able to record this to a .midi file with some light processing.
MIDI helper, a higher level object that lets you play a midi file in the background. I haven't implemented this yet.
Sound mixer. This is the exception to my cross-platform commitment, all this does is control the volume on my personal computer.

Of those seven goals, only one isn't implemented yet. My pcm out helper still needs more convenience methods (for example, it has playOgg, but I want to add playMp3 as well), and I will probably do the file format support pieces too, but this is almost done! Contrast to a couple weeks ago when I was only at half the goals... and the half that were done just weren't done all that well.

For example, the old high-level interface looked something like this:

auto audio = new AudioPcmOutThread(0);
scope(exit) { audio.stop(); audio.join(); }

And now it is wrapped up in a refcounted struct, so it is just:

auto audio = AudioOutputThread(0);

And I had an ugly bug in there where if the GC triggered while the output thread was active, it was liable to mess up your whole program due to masked signals. That's better now (though runs the risk of GC hiccups in the sound, that hasn't bothered me... yet. When/if it does I'll tweak this again and be more strictly @nogc in all the callbacks.)

Input, on the other hand, doesn't assume a new thread. You can just give it a callback then call .record and it passes you slices into a buffer to process.

I also - per my "simple" designation - previously forced a particular format on you, and it would throw if the system couldn't provide it. It now allows you to configure that through the constructor, but otherwise still throws if it can't match, and the device names remain hardcoded (in theory the 0 you pass to the ctor there is to select different devices, but I still haven't implemented that. You can just modify the source if your needs differ right now.)

simpleaudio exists primarily for my game and secondarily for my custom volume control application, but now I also made a little home intercom/baby monitor for my raspberry pi with it - a simple program that on one side, runs record and shoves the data into udp packets, and the other that receives those packets and puts them in the speaker. (it is my LAN so that let me simplify it a lot lol)

I'm tempted to add a simplecamera.d and throw in video capabilities too... perhaps...

gpio.d

For my little raspberry pi intercom, I wanted to stick on a LED status indicator and a push button to activate it, so I wired up a very simple circuit to the GPIO pins then set off to process that in D. This led to arsd.gpio.

I considered a few options and almost went with the libgpiod library that is maintained by the kernel developers, but I found the new kernel interface itself easier for me. (That happens a lot - to me to understand a library, I need to know how it is implemented anyway and then I might as well just implement my own version.)

This interface is quite new indeed (the C library abstracts this and the older interface for broader compatibility but since this is primarily for me I can just ignore that), so you might have to update the kernel on your raspberry pi to use it, but as you can see in the source if you look, it is pretty simple. You request pin access or events, the kernel spits back a file descriptor. For pin access, you send it back to the library to ioctl query/command it, and for events, you can do regular fd poll/read/etc operations on it, so that will be easy to integrate with other Linux code.

But with the code there, I was able to do the simple task I wanted to do in D. I am tempted to make a little character cell lcd display driver with it too, perhaps I will, but I'll decide later.

Cool fact about gdc on the raspberry pi: I just installed it from the system's package manager. It is an old version, but was good enough for me since the modules I needed to use compiled easily anyway!

game.d and gamehelpers.d

A while ago, I wrote about my arsd.gamehelpers, which used to be a collection of a framework class and some stand-alone algorithm functions. Well, I wanted the algorithm functions without the framework baggage, so I decided to separate those two parts.

Now arsd.gamehelpers has the low-dependency helpers while the high-dependency (it imports arsd.joystick, arsd.simpleaudio, and arsd.simpledisplay!) framework has been moved into a new module, arsd.game. I'm also expanding that out to be more easy to use as I discover patterns I like while making my game.

Fun fact btw: my game right now compiles to a 1 MB exe on Windows, with no dll requirements. Still a bit large, but not too bad! Nicer than my older ones that needed a few megs of SDL to run.

script.d

I wrote about this last time, with the new subclassable facility, but I've also fixed a few more bugs in that, added support for default arguments in class methods (but NOT in freestanding things because default args are lost through pointers, so when you do var.foo = &foo;, it is currently impossible in D to get that information out. You have to pass by alias which means using a different mechanism. The class already reflects over an alias so it can do it, but otherwise you assign through the pointer/delegate. I'll probably add a helper thing later.

arsd.script now also, on master branch, supports const parameters better than before. The way I accomplished this was through a static_foreach expression function:

mixin(static_foreach(fargs.length, 1, -1,
`__traits(getOverloads, obj, memberName)[idx]
(`,``,` < vargs.length ? vargs[`,`].get!(typeof(fargs[`,`])) :
ParamDefault!(__traits(getOverloads, Class, memberName)[idx], `,`)(),`,`)`));

Yikes, I know. But my old approach of foreach(ref param; Parameters!fn) hit several problems: if something was const, I'd have to cast it away on the lhs. Error-prone hassle... and if it is ref moreover, it might be impossible!

All of this just works if you can pass the tuple directly, without the intermediate. Manu and Stefan's static_map... proposal could help with that, but library staticMap can't (regardless of its other problems) since template args cannot process runtime variables... though hmmm... maybe I could static map it to a lambda call. idk, even if it can be done through clever tricks, it is non-trivial however you do it now.

But my static_foreach there, while kinda ugly (and I can probably improve that, this is my first draft), does have performance benefits. I'll go into this more in the dmd and static section up next.

Fun fact: simpleaudio is also scriptable with this module. I'm using this fact to play with making new ad-hoc sound effect functions. Though the playOgg function, being a template to hide the dependency so you don't pay for it if you don't use it, cannot be auto-scripted. I might add some new facility to handle those cases, but I don't know what yet.

dmd and static foreach

static foreach, as implemented today, has quadratic time complexity and eats a lot of compiler RAM. In dmd, it eats time according to (number_of_iterations * foreach_body_size)^2. Yes, a longer body OR more iterations ends up being squared. On ldc (or dmd with codegen disabled, but that's not useful irl), it is just number_of_iterations^2. I didn't try gdc because the version I have doesn't have the feature implemented (though the new gcc 10 version does, I don't have that version installed).

This is remarkable. It is OK for small loops, but it grows very quickly, and combined with the work being done in its body, it can lead to slow builds fast.

And what's crazy about this is it doesn't have to be this way! I compared five programs: one using foreach over a tuple, one using static foreach over a tuple, one over the code written out long-form, one with a naive string mixin, and one with an optimized string mixin. This is all in function scope, and I also tried one in declaration scope (though the tuple foreach is impossible there so it was excluded).

All would have the same effect as this:

void foo() {}

void main() {
        static foreach(i; 0 .. 10000)
                foo();
}

or for the declaration, defining functions foo1 ... foo9999.

The static foreach was the worst of the bunch, taking almost five seconds to compile. Even the naive string mixin was better, but not by much. By contrast, all the others took under one half of one second, a full 10x faster and using 1/10 the memory too! That proved to me it wasn't an intrinsic problem; it wasn't slow because the compiler had to do that much work to compile the code, it was just an implementation issue with static foreach.

And the optimized string mixin using 1/10th the memory of the naive one while being much faster was also an interesting observation. Let me show you the two versions:

// slow impl
string helper2()(string[] t...) {
        import std.conv;
        string code;
        foreach(i; 0 .. 10000)
        foreach(idx, part; t) {
                if(idx)
                        code ~= to!string(i);
                code ~= part;
        }
        return code;
}

5.2s compile, 11,749,440 KB RAM.

1 // fast impl
2 string helper(string[] t...) pure {
3         assert(__ctfe);
4         int tlen;
5         foreach(idx, i; t) {
6                 if(idx)
7                         tlen += 5;
8                 tlen += i.length;
9         }
10 
11         char[] a = new char[](tlen * 10000);
12 
13         int loc;
14         char[5] stringCounter = "00000";
15 
16         foreach(i; 0 .. 10000) {
17                 foreach(idx, part; t) {
18                         if(idx) {
19                                 a[loc .. loc + stringCounter.length] = stringCounter[];
20                                 loc += stringCounter.length;
21                         }
22                         a[loc .. loc + part.length] = part[];
23                         loc += part.length;
24                 }
25 
26                 auto pos = stringCounter.length - 1;
27                 while(pos > 0) {
28                         pos--;
29                         if(stringCounter[pos] == '9') {
30                                 stringCounter[pos] = '0';
31                         } else {
32                                 stringCounter[pos] ++;
33                                 break;
34                         }
35                 }
36                 while(pos > 0)
37                         stringCounter[--pos] = ' ';
38         }
39 
40         return a;
41 }

1.3s compile, 138,376 KB RAM. Barely more RAM and slightly less time [!] (perhaps less I/O time) than the minimum I could achieve (having an external script generate the file I compile): 1.4s compile 127,760 KB RAM. An excellent result.

By pre-allocating the string buffer, the CTFE engine spent 1/4 the time, and ate, yes, I had to double check that too, 1/100th of the RAM than the naive ctfe verion. Remarkable.

(I also found some bugs in here, including this fun one: https://issues.dlang.org/show_bug.cgi?id=20811 but that can be fixed without invoking an allocation either.)

static foreach on the same test eats 6.2 seconds and 606,400 KB of RAM. Loses on both metrics to the optimized string mixin

struct A {
        static foreach(o; 0 .. 10000)
                void array(int[o] arg) { int[o] a = arg;};
	// vs
	/+
        mixin(helper( // or helper2 for the slow version
                `void array(int[`,`] arg) { int[`,`] a = arg;}`
        ));
	+/
}

void main() {
        A a;
}

I did the overloads of static array lengths to avoid having to mix in a name to make a more pure comparison. If you have to mixin a string for the identifier in static foreach, it hurts but not too bad - concats only really get bad when you are doing lots of them, I imagine it is intermediate garbage not being cleaned up.

In a quick test here, it upped the static foreach time only to 7.2s vs 719,000 KB. So not too bad, most the cost is indeed static foreach itself.

That tracks with my tests in function scope too, where you can easily just remove the static keyword and iterate over a tuple to get the same result, but in a fraction of the time.

Conclusion

static foreach, as implemented today, is very slow and you should probably avoid it for any large body OR number of iterations. If it is inside a function, try just removing the static keyword - if your are iterating over a tuple, it will probably just work, giving the same result but at improved compile speeds (if large. if it is small, it doesn't matter anyway so don't stress that).

When writing CTFE string generation code, try to precalculate your allocation and do it all at once instead of using the ~ operator. It can reduce uncollected CT garbage significantly. I might put my static_foreach thing, once cleaned up, online so you can copy/paste it too.

Lastly, factoring things out of templates can help generate less code too by making their bodies smaller, and may improve reusability of data structures in the compiler while shortening symbols too. It depends on your specifics, but there really is a benefit to small, simple templates in today's D compiler.

Another dmd thought

dmd does a lot of work it doesn't have to do, and this wastes time and can push more unnecessary work on the linker. Cutting this off at the source can sometimes help a lot.

My PR here is one step to reduce some of that junk by skipping it in cases where it is obviously not needed. In most my projects, it made small to no difference, but in one significant project I work on, it cut 3 GB off the build RAM requirements, leading me to believe there's potential in improving the quality of our current implementation.

I'd like to go even further and free intermediate AST nodes too, but was unable to do that yet without introducing bugs. But that could help even more with CT RAM use - which is my #2 concern (after error message quality) using D day to day right now.

Last thought - string switch

It was brought to my attention recently that string switches are lowered to a list of strings passed to a template and this can lead to a large symbol name. Yikes! Normally, this doesn't really matter, but in some cases it can explode. And since dmd uses symbol names to compare times these lengths can hit in the compiler too.

We've gotta be careful in these lowering implementations not to introduce pathological cases.

Blog Articles