How NO_SCAN makes shorter GC pauses

Posted 2022-11-07

Blocks of GC memory for types without pointers don't add to gc pause times. A chat lecture turned into a blog post after the stats.

Core D Development Statistics
In the community

Community announcements

GC optimization tips in today's implementation

Strategic GC.disable
The test rig
Scanned vs unscanned heap allocations
Conclusion
Homework

Core D Development Statistics

16 bugs fixed
13 bugs and enhancement requests opened
Bugs closed WONTFIX:
Bugs closed INVALID:
- DMD CLI bug when using -I
27 pull requests merged into the language: 20 into DMD and druntime and 7 into Phobos
5 pull requests merged into the website.

GC optimization tips in today's implementation

One of the worst case performances for D's GC right now is when you have a big, active heap. It is close to the edge of memory, collects (over the whole thing), takes a bit down, allocates, gets back close to the edge, etc. This means you end up spending the majority of time in the GC and not even collecting successfully! You end up scanning all the memory over and over again just to free a little bit. Of course, concepts discussed last week, like the generational GC, enabled by write barriers, give a solution to this. Perhaps the GC could also detect that it is not freeing much in a run and defer its next run for a while - this might be easy to work into the current implementation, though hard to get it just right.

But, today, I don't want to talk too much about new implementations, but instead look at what you can do in your code to avoid these cases in today's implementation.

Strategic GC.disable

Something similar can come up in smaller scale things too. Consider a straight up allocation loop, like while(true) array ~= stuff;. the GC time will grow to use more and more of the total runtime but it will rarely actually free anything (maybe some array fragments as they are copied to new, larger blocks).

So if it could detect it is in this situation and stop trying to collect for a while it might actually improve overall performance and just give the application all the memory it can for a while before trying to collect again - and with some luck, it will be done with the burst work by then and you can reclaim things. It can't detect it right now... but you can.

The programmer can know they're in this situation and set GC.disable for a while themselves. Do your pure allocation loop, then GC.enable again at the end. You can improve time, sometimes significantly, by just avoiding scans you know are unnecessary and deferring the work.

Of course, it can be a bit harder to know you are in a pure loop if there's multiple threads running, but you can still consider the option with your knowledge of the program's big picture.

The test rig

Here's a test program that we'll modify for a few tests today:

class A {
	Object[64] x;
}

void main(string[] args) {
	import core.memory;
	import core.time;
	import std.stdio;
	A[] array;
	import std.conv;

	// this is a pure allocation loop
	MonoTime allocStart = MonoTime.currTime;
	GC.disable();
	array.length = to!int(args[1]);
	foreach(ref a; array)
		a = new A;
	GC.enable();
	writeln("Alloc time: ", MonoTime.currTime - allocStart);

	// and this is a pure GC collection, and with nothing freed,
	// it is essentially measuring the scan time
	MonoTime start = MonoTime.currTime;
	GC.collect();
	writeln("Collect time: ", MonoTime.currTime - start);
}

If we look at the alloc time with 100,000 items, you can cut the time in half (at least I could on my computer lol) by using the GC.disable/enable block vs commenting that out. Why? Because we stopped doing unnecessary scans. The GC implementation doesn't know it didn't have any work to do, but we did, so we can help it with selective disables.

Scanned vs unscanned heap allocations

Next, let's focus solely on the collect times. I'm going to run the program with various sizes and just show the collect time output here:

$ ./test3 30000
8 ms and 623 μs
$ ./test3 40000
11 ms, 570 μs, and 2 hnsecs
$ ./test3 50000
14 ms, 361 μs, and 3 hnsecs
$ ./test3 60000
17 ms, 368 μs, and 2 hnsecs
$ ./test3 70000
20 ms, 222 μs, and 9 hnsecs
$ ./test3 80000
23 ms, 42 μs, and 2 hnsecs

Notice how it grows linearly with scanned heap size. This is where much of the worst case performance comes from: if your heap was actually growing, you'd scan as it grows, and each scan would take longer, leading to quadratic performance.

But let me show you something very important: edit class A to have long[64] as its member instead of Object[64]. And run the test again:

$ ./test3 10000
369 μs
$ ./test3 20000
336 μs and 4 hnsecs
$ ./test3 50000
673 μs and 7 hnsecs
$ ./test3 100000
1 ms, 383 μs, and 4 hnsecs

There's still growth - the array is longer - but it is considerably less pronounced since the class itself no longer needs to be scanned, just the array of classes. Sure, this is still linear growth, but the constant multiplier is much smaller; it is doing less than 5% of the work of the original example!

Lesson learned: if you want to reduce the GC pause, you want to reduce scanned heap size specifically (and actually stack size too, though stack size is generally not as large as the heap regardless).

Proof of the stack size concern can be tested with code like this:

A[64000] sarr;
A[64000] sarr2;
import std.conv;
array.length = to!int(args[1]);
foreach(ref a; array)
	a = new A;
foreach(ref a; sarr)
	a = new A;
foreach(ref a; sarr2)
	a = new A;

All that adds about 1ms even with an array length of like 9, which is about doubling it; the stack is conservatively scanned too (tho the compiler's dead code eliminator has an easier time defeating the benchmark so be careful drawing conclusions if you see nothing).

I believe that number of live individual objects also has an impact, but the raw amount of memory scanned is much larger as far as I know. Check this out, going back to the original test with just long[640] instead of Object[64]:

$ ./test3 1000000
16 ms, 545 μs, and 5 hnsecs

That's 1 million class objects in an array being scanned, each on being about 2 kilobytes, so over a gigabyte of memory used. That 16ms you might notice as being a full frame in game terms. Each object there eats a good chunk of memory, but if you change it back to Object[640] and you end up with a pause of over a second!

Why? because it now has to scan the stuff inside A too, so it is a lot more work.

So this is why image and video assets don't really matter. Sure, having arrays of a million references to them will add up time, but the size of the asset themselves isn't important. They don't contain pointers, so they don't add significant work to the GC. It knows it doesn't have to be scanned, so it doesn't look at them.

Now lemme show you something:

class A {
	//Object y;
	long[640] x;
}

$ ./test3 1000000
14 ms, 992 μs, and 2 hnsecs

uncomment Object y:

$ ./test3 1000000
1 sec, 608 ms, 980 μs, and 6 hnsecs

Wow, it exploded in time. Why does this happen? When you allocate memory in the GC, the implementation will set some flags on the block. One of these is NO_SCAN, which is passed it they type contains zero members that can point back into GC memory. new ubyte[](n) gets NO_SCAN since ubytes aren't pointers. The original class A contains only a pointer to typeinfo, which isn't GC, so the compiler passes NO_SCAN. But once you add Object a; to the type, there is a potential pointer in there, so NO_SCAN is defeated.

And since these flags apply to the whole memory block, a pointer anywhere is a scan everywhere.

Please note that you don't have to ever say the words NO_SCAN in your code (well, unless you are calling GC.malloc directly), it is figured out automatically from the type you are allocating with new.

And this is important: the precise collector option doesn't actually make much of a difference on this point! You can test it yourself by tuning it on with the --DRT-gcopt switch all D programs have (at least those that use druntime):

$ ./test3 --DRT-gcopt=gc:precise 1000000
1 sec, 516 ms, 797 μs, and 4 hnsecs
$ ./test3 --DRT-gcopt=gc:conservative 1000000
1 sec, 600 ms, 766 μs, and 9 hnsecs

why isn't there much of a difference? The precise scan is still a scan. It knows part of the object has to be scanned, so it can't mark the whole block NO_SCAN. Precise scanning just changes how the scan is done, not if it is done at all. A precise scan will skip over part of the memory block as it goes. But due to this access pattern just not being fast, the gains are small, if they exist at all. Indeed, it is possible for a precise scan to be slower than a conservative scan.

Precise scanning is more about eliminating false pointers, ints that just happen to look like they point into a block, than they are about making the scan faster.

Conclusion

By far, the biggest benefit you can tweak in D's gc is making your bulk allocations be completely pointer-free, since then it is entirely removed from the scan job, saving potentially significant amounts of time. Sometimes this is very easy to do, by just separating out asset memory from the rest of the object.

A smaller benefit that sometimes helps is disabling the GC when you know it will not be able to collect much, then re-enabling it when you're done with that block.

Always weigh the complication of the code in balance to your performance requirements.

Homework

As a take-home exercise, consider this: some game programmers talk about arrays of structs and structs of arrays. With what i just said about no_scan blocks, which do you hypothesize might give better performance with the D gc? how will you test that? Were you right? (Bonus question: is the code still easy to work with? Part of the benefit of a GC of course is keeping your code a bit simpler, so you always want to balance manual tweaks considering the hassle vs benefit.)

And finally, in the same vein, you might see me pushing back on people trying to reduce template instances or GC allocation, reminding them to confirm the actual goal. Can you think of a case where multiple GC allocations might actually work better overall than reducing them to just one? Test it. Were you right? What if you made it zero GC allocations; all manually managed. Is the performance benefit worth the code complication?

Blog Articles