Benchmarking My Game Engine • Ritchie

It's something I wish I started doing earlier... because profilers drive me crazy.

We’ve all been there, right? You’re writing an app, it feels sluggish, something feels wrong. It’s not a bug, the app works fine, it’s just…there’s no reason typing in a text field should be this laggy on this hardware. You know what I’m talking about, right?

My game UIs feel that way a lot. So does my engine’s editor. I’ve always wanted to figure out why, and fucking fix it. I never could.

My Specs

I’m not just posting my PC specs to brag. Hardware is just as important to understanding what’s going on with slow code as it is understanding what the code is doing. A slow memory leak is going to take longer on a supercomputer with over 1TB of RAM than it would take to cause problems on your phone. CPU and GPU usage is also going to be less of an issue on a game developer’s workstation than it might be on your garden variety cash register.

With that in mind, this PC ain’t slow.

Hardware	What I have
CPU	AMD Ryzen 7 3700X
GPU	AMD Radeon RX 7600XT
Memory	64 GB available
Disks	2x 1TB NVMe SSDs, 2x 4TB spinning disks

I’m not on the bleeding edge because I don’t need to be. But this isn’t a slow computer either. It handles any 1080p game I throw at it, and I’m too blind for 4K, HDR or raytracing to matter. I also don’t need to take advantage of AI upscaling because my eyes do a better job of screwing up text readability. Point being, if something is slow, I’m either compiling Plasma or something in the app is poorly optimized.

Having an extremely fast computer is a double-edged sword however. If something feels slow on my PC, it’s slow. But I may also do something that feels perfectly fast on my PC but kills your laptop in minutes.

And I can’t debug what I can’t see.

Profiling with dotTrace

This has been generally unsuccessful for me. I would typically build my engine’s editor for release, get it running on the profiler, and try to do things in the UI that feel slow. Things would either stop feeling slow under the profiler or… I just couldn’t figure out what the problem really was.

That’s not a dock against JetBrains’ tools. I just struggle to navigate them. They don’t display things in a way I know how to read.

Sometimes, I would get lucky and find out that something in my code genuinely is problematic. I’d rewrite it, and be worse off - or not feel a difference.

What I Care About

My engine, Inertia, is written in C#. I will die on that hill. I love C# as a language. I find it fun to write, easy to read, and I do not feel limited by it. I can make it do scary things.

The problem with C# and performance is how the language used to be used. C# started with a lot of inspiration from Java and object-oriented programming. This means that the runtime has a garbage collector because that’s what we did back then. Most objects are stored on the heap because that’s how we did it back then. Newer versions of the language, and of the runtime, allow you to do things differently.

GC Hits

The number one concern I have when tracking down slow code in my engine is GC hits. If you’ve played a Unity game and ever felt it hitch every now and then, every few seconds/minutes, GC hits are very likely what you are feeling.

The runtime knows you have a finite amount of RAM and the garbage collector makes sure things get cleaned up. Every now and then, the GC will decide to clean up any unused memory in the application, freeing space for new heap allocations. Your code cannot execute while the GC is collecting. No matter how fast your code is in other languages, if you trigger GC, it’s moot.

You can’t stop GC hits from happening outright. The only thing you can do is allocate less memory on the heap, and keep references active until you’re genuinely done with them. Keep the memory you need but don’t allocate more than you do.

Frame Time

I find most people who talk about C# are thinking about its use on the web with ASP.NET. If that’s all you’ve ever done, let me introduce you to something really annoying. Frame time is how much time the UI thread in an application spends working on a single frame in one window. We generally want it to be as little time as possible. In some cases, high (or unstable) frame time can make someone feel ill. As someone who cares about accessibility, I do not want that happening.

Knowing frame time is helpful for understanding that there’s a problem. But it’s not very helpful to measure it if your brain is already telling you it’s too high. What’s more important is understanding what is going on in that frame and how long it takes to process each thing going on. If you’re rendering lots of text and the engine feels sluggish when there’s lots of text, you probably want to know how long it takes to render text.

I have a display with 120hz refresh rate. I would prefer my games be able to build frames at that refresh rate. Most of the time I run the display at 60hz, which means I’m willing to accept 60 frames per second. Anything below that in my own game and we’re in trouble.

My engine, especially in editor mode, supports multiple windows. For best results, each window should be able to fully update and repaint in less than 16.7ms. That’s one second divided over 60 frames. This is how much frame time the engine has to work with.

Runtime Performance

This is why I was having trouble with dotTrace. It doesn’t matter how fast my code is, if I’m using the runtime APIs incorrectly or they’re just slow.

Maybe a dictionary isn’t the best way to track inventory? I should probably measure.

It Started With This Video

I’ve been listening to Nick Chapsas videos while trying to sleep.

In this video, he uses BenchmarkDotNet to compare different ways to iterate a collection in .NET. This is where I decided to start in my engine, because I have an entity component system. You will typically be iterating through an ECS every frame, making collection iteration operations a hot path. Prime real estate for performance gains.

Ways to Iterate Lists in C#

Benchmarking is about comparison. If I want to know how fast my code is, I need a frame of reference to measure it against. And there are quite a few ways to iterate collections in C#.

With foreach loops
With for and while loops
With LINQ
In parallel
With a Span<T>
With a ReadOnlySpan<T>
With a raw pointer
With fancy unsafe runtime methods

To make matters worse, there are different kinds of collections in C#. Arrays, lists, hash sets, dictionaries, custom collection types… even a Span can be thought of as a collection if it represents a stack-allocated buffer.

So I Wrote A Tool

I wrote a tool for my engine that benchmarks it using BenchmarkDotNet and a series of common tasks my code might perform. I started with collection iterations.

I decided to test with both a standard int[] array and a List<int>. This allows me to compare things like foreach on a list vs. foreach on an array. Each collection is filled with random data, and each test sums up the value of each element in the collection. I also tested with collections with varying sizes, starting with powers of two and then a torture test of 100 thousand items. Gradually, the test would grow the collections larger in size to see how bigger collections affect the measurements.

After my tool runs the benchmark, it prints it out to a fancy-ish Markdown table.

And here’s a report.

Method	ElementCount	Mean	Error	StdDev	Median	Allocated
ListForEachMethod	4	758.40 ns	17.448 ns	35.247 ns	760.00 ns	88 B
ListEnumeration	4	258.70 ns	10.686 ns	30.489 ns	260.00 ns	-
ListIndexing	4	144.28 ns	7.098 ns	20.252 ns	140.00 ns	-
ArrayEnumeration	4	95.76 ns	5.509 ns	15.717 ns	94.50 ns	-
ArrayIndexing	4	57.32 ns	4.529 ns	13.283 ns	55.00 ns	-
ArraySpanIndexing	4	85.70 ns	5.092 ns	13.854 ns	80.00 ns	-
ListSpanIndexing	4	110.29 ns	6.202 ns	17.186 ns	105.00 ns	-
ArraySpanEnumeration	4	136.65 ns	7.926 ns	23.120 ns	130.00 ns	-
ListSpanEnumeration	4	157.24 ns	6.188 ns	18.050 ns	160.00 ns	-
UnsafeSpanIterationOverArray	4	148.07 ns	9.543 ns	27.226 ns	149.00 ns	-
UnsafeSpanIterationOverList	4	140.42 ns	6.304 ns	18.189 ns	140.00 ns	-
ListForEachMethod	64	1,095.43 ns	79.855 ns	234.202 ns	1,025.00 ns	88 B
ListEnumeration	64	632.28 ns	14.883 ns	15.925 ns	629.50 ns	-
ListIndexing	64	504.56 ns	11.184 ns	10.985 ns	501.00 ns	-
ArrayEnumeration	64	381.32 ns	9.158 ns	15.798 ns	380.00 ns	-
ArrayIndexing	64	419.53 ns	10.819 ns	18.948 ns	419.50 ns	-
ArraySpanIndexing	64	469.56 ns	11.462 ns	17.845 ns	470.50 ns	-
ListSpanIndexing	64	460.67 ns	11.385 ns	21.661 ns	460.00 ns	-
ArraySpanEnumeration	64	447.27 ns	10.258 ns	9.595 ns	440.00 ns	-
ListSpanEnumeration	64	450.79 ns	10.940 ns	15.690 ns	445.00 ns	-
UnsafeSpanIterationOverArray	64	387.31 ns	9.792 ns	14.353 ns	385.00 ns	-
UnsafeSpanIterationOverList	64	422.63 ns	9.875 ns	10.976 ns	420.00 ns	-
ListForEachMethod	256	1,431.36 ns	30.789 ns	57.829 ns	1,425.00 ns	88 B
ListEnumeration	256	1,774.31 ns	28.270 ns	27.765 ns	1,780.00 ns	-
ListIndexing	256	1,616.08 ns	15.308 ns	12.783 ns	1,620.00 ns	-
ArrayEnumeration	256	1,383.40 ns	22.728 ns	21.260 ns	1,380.00 ns	-
ArrayIndexing	256	1,418.36 ns	18.865 ns	16.723 ns	1,415.50 ns	-
ArraySpanIndexing	256	1,426.33 ns	16.600 ns	15.527 ns	1,425.00 ns	-
ListSpanIndexing	256	1,454.45 ns	31.634 ns	36.429 ns	1,445.00 ns	-
ArraySpanEnumeration	256	1,430.43 ns	13.062 ns	11.579 ns	1,430.50 ns	-
ListSpanEnumeration	256	1,450.00 ns	20.753 ns	18.397 ns	1,445.00 ns	-
UnsafeSpanIterationOverArray	256	1,313.85 ns	17.989 ns	15.021 ns	1,310.00 ns	-
UnsafeSpanIterationOverList	256	1,355.33 ns	25.191 ns	23.563 ns	1,360.00 ns	-
ListForEachMethod	4096	9,856.92 ns	134.013 ns	111.907 ns	9,840.00 ns	88 B
ListEnumeration	4096	12,783.20 ns	272.413 ns	754.855 ns	12,610.00 ns	-
ListIndexing	4096	12,878.00 ns	191.451 ns	169.716 ns	12,885.00 ns	-
ArrayEnumeration	4096	11,104.36 ns	223.734 ns	198.334 ns	11,070.50 ns	-
ArrayIndexing	4096	9,212.45 ns	209.589 ns	566.637 ns	9,050.00 ns	-
ArraySpanIndexing	4096	10,224.40 ns	277.808 ns	792.602 ns	9,910.00 ns	-
ListSpanIndexing	4096	10,089.07 ns	339.951 ns	991.652 ns	9,711.00 ns	-
ArraySpanEnumeration	4096	12,085.38 ns	614.500 ns	1,792.526 ns	12,255.00 ns	-
ListSpanEnumeration	4096	10,211.55 ns	247.290 ns	705.531 ns	10,089.50 ns	-
UnsafeSpanIterationOverArray	4096	8,239.89 ns	164.241 ns	426.884 ns	8,230.00 ns	-
UnsafeSpanIterationOverList	4096	8,472.12 ns	256.992 ns	733.214 ns	8,285.50 ns	-
ListForEachMethod	16384	35,095.62 ns	376.907 ns	314.734 ns	35,091.00 ns	88 B
ListEnumeration	16384	23,245.47 ns	215.295 ns	201.387 ns	23,180.00 ns	-
ListIndexing	16384	19,277.77 ns	224.528 ns	187.491 ns	19,370.00 ns	-
ArrayEnumeration	16384	15,680.08 ns	183.495 ns	153.227 ns	15,650.00 ns	-
ArrayIndexing	16384	14,256.98 ns	276.218 ns	538.742 ns	14,180.00 ns	-
ArraySpanIndexing	16384	15,741.64 ns	232.688 ns	206.272 ns	15,730.00 ns	-
ListSpanIndexing	16384	15,957.92 ns	210.260 ns	175.577 ns	15,931.00 ns	-
ArraySpanEnumeration	16384	14,579.69 ns	465.352 ns	1,327.675 ns	14,145.00 ns	-
ListSpanEnumeration	16384	14,329.85 ns	248.622 ns	513.447 ns	14,310.50 ns	-
UnsafeSpanIterationOverArray	16384	13,323.21 ns	208.472 ns	184.805 ns	13,304.50 ns	-
UnsafeSpanIterationOverList	16384	11,643.54 ns	218.602 ns	532.106 ns	11,630.00 ns	-
ListForEachMethod	65536	136,913.00 ns	464.877 ns	362.946 ns	136,820.50 ns	88 B
ListEnumeration	65536	59,566.67 ns	638.983 ns	597.705 ns	59,650.00 ns	-
ListIndexing	65536	43,622.77 ns	775.087 ns	647.233 ns	43,790.00 ns	-
ArrayEnumeration	65536	33,895.24 ns	665.930 ns	888.998 ns	33,495.00 ns	-
ArrayIndexing	65536	34,435.62 ns	330.000 ns	275.565 ns	34,500.00 ns	-
ArraySpanIndexing	65536	32,694.86 ns	1,072.178 ns	3,144.512 ns	32,620.00 ns	-
ListSpanIndexing	65536	34,444.92 ns	602.428 ns	503.055 ns	34,560.00 ns	-
ArraySpanEnumeration	65536	34,462.54 ns	393.631 ns	328.699 ns	34,590.00 ns	-
ListSpanEnumeration	65536	34,462.69 ns	320.949 ns	268.007 ns	34,420.00 ns	-
UnsafeSpanIterationOverArray	65536	24,682.77 ns	492.278 ns	751.760 ns	24,160.00 ns	-
UnsafeSpanIterationOverList	65536	25,561.85 ns	439.303 ns	366.838 ns	25,410.00 ns	-
ListForEachMethod	100000	213,432.96 ns	4,237.520 ns	5,940.411 ns	211,737.00 ns	88 B
ListEnumeration	100000	85,497.21 ns	873.409 ns	774.254 ns	85,666.00 ns	-
ListIndexing	100000	60,998.97 ns	1,200.999 ns	1,334.907 ns	60,940.50 ns	-
ArrayEnumeration	100000	46,505.93 ns	958.795 ns	2,766.343 ns	47,311.00 ns	-
ArrayIndexing	100000	47,451.87 ns	575.375 ns	538.206 ns	47,291.00 ns	-
ArraySpanIndexing	100000	45,763.64 ns	909.068 ns	2,212.798 ns	45,870.00 ns	-
ListSpanIndexing	100000	46,116.79 ns	913.460 ns	1,985.786 ns	46,340.00 ns	-
ArraySpanEnumeration	100000	46,233.10 ns	924.958 ns	2,593.682 ns	46,381.00 ns	-
ListSpanEnumeration	100000	47,106.38 ns	699.318 ns	583.962 ns	47,286.00 ns	-
UnsafeSpanIterationOverArray	100000	34,374.02 ns	904.218 ns	2,550.363 ns	33,150.00 ns	-
UnsafeSpanIterationOverList	100000	37,568.35 ns	734.672 ns	1,005.627 ns	37,889.50 ns	-

If you just scrolled past that and went “jesus fuck that’s a large table,” I agree. Jesus fuck, that’s a large table. But it’s giving me a lot of insight.

Allocations

The Allocations column tells you how much memory has been allocated on the heap during the test (not counting any allocations while preparing the test). None of the measured list iteration methods allocate heap memory, except for one: ListForEachMethod. That would be code that looks like this.

var list = new List<int>();
list.Add(1);
list.Add(2);
list.Add(3);
list.Add(4);

var sum = 0;
list.ForEach(value => sum += value);

I never knew this function existed on List<T> until I measured it. Allegedly, people think it looks elegant to use. But you are paying an 88-byte heap tax when you use it. So…I wouldn’t use this method.

Execution time

The most important columns in my opinion to read are Mean compared with ElementCount. You statistician folks can read the other numbers, I don’t really understand them yet.

Each test in the report was executed several (likely thousands of) times by BenchmarkDotNet. The Mean column is telling me, on average, how long each test run took. It is in fact directly telling me how much frame time I would need to iterate a collection of that size in that way.

My Code’s The Fastest

I can tell you for certain that my code, out of all tests, is the fastest. Here’s how I know.

Enumeration and Iteration (Indexing)

In the chart, Enumeration is any time I use a foreach loop. There is an enumeration test for arrays, lists, and spans.

Iteration is any time I’m using a for loop and a loop counter. There is a test for lists, arrays, and spans.

The Unsafe Span Iteration tests are my code.

If we look at a 4-element array, iteration with a for loop is almost twice as fast on average as enumerating with foreach. If you compare enumeration vs. iteration of List, Array Span, and List Span, you will notice iteration is generally faster on average for a 4-element collection.

We generally don’t iterate over collections that small, so what matters more is if this pattern scales with larger element counts. Even in the torture test, iteration generally wins over enumeration. In most tests, the difference is negligible. Enumeration does win over iteration of an array, but iteration wins by tens of microseconds over enumeration on a list.

Unsafe Iteration

Across all tests, my Unsafe Span Iteration method is generally fastest. It is also faster on array spans than on list spans.

But it looks like this.

[Benchmark]
public int UnsafeSpanIterationOverArray()
{
    var span = array.AsSpan();
    var sum = 0;

    ref var reference = ref span[0];
    for (var i = 0; i < ElementCount; i++)
    {
        sum += reference;
        Unsafe.Add(ref reference, 1);
    }

    return sum;
}

[Benchmark]
public int UnsafeSpanIterationOverList()
{
    var span = CollectionsMarshal.AsSpan(list);
    var sum = 0;

    ref var reference = ref span[0];
    for (var i = 0; i < ElementCount; i++)
    {
        sum += reference;
        Unsafe.Add(ref reference, 1);
    }

    return sum;
}

I’m accessing the span of the collection, taking a reference to its first element, and using Unsafe.Add to add one element worth of offset to the reference every time I iterate. This is effectively like doing this in C.

int length = 100000;
int* array = malloc(sizeof(int) * length);
int* reference = array;
int sum = 0;

for (int i = 0; i < length; i++) {
}
    sum += *reference;
    reference++;
}

…And exactly as dangerous, too. If the array memory is ever moved or freed during iteration, this is going to be problematic. But it is the fastest, and if I can guarantee I won’t lose the memory mid-iteration, I’m inclined to use this in my engine’s ECS to squeexe as much performance out of it as possible.

Should I Be Writing C#?

If I care enough about performance to benchmark different ways of iterating over memory at a low level in the language, why am I not using C, Zig, Rust, or something more low-level for my engine?

The answer is because I love C#. It gives me the freedom to choose how I write my code. It doesn’t fail to compile because I mis-typed a space, nor does it yell at me about borrowing. And besides…

Method	ElementCount	Mean	Error	StdDev	Median	Allocated
UnsafeSpanIterationOverMatrix4x4Array	4	1,696.88 ns	32.863 ns	32.28 ns	1,694.50 ns	-
UnsafeSpanIterationOverMatrix4x4Array	64	19,622.29 ns	46.542 ns	41.26 ns	19,620.00 ns	-
UnsafeSpanIterationOverMatrix4x4Array	256	77,102.17 ns	103.041 ns	80.45 ns	77,080.50 ns	-
UnsafeSpanIterationOverMatrix4x4Array	4096	323,963.70 ns	3,547.935 ns	3,318.74 ns	322,388.50 ns	-
UnsafeSpanIterationOverMatrix4x4Array	16384	111,209.92 ns	321.903 ns	251.32 ns	111,230.00 ns	-
UnsafeSpanIterationOverMatrix4x4Array	65536	423,498.53 ns	5,247.143 ns	4,908.18 ns	425,014.00 ns	-
UnsafeSpanIterationOverMatrix4x4Array	100000	661,515.67 ns	12,987.558 ns	18,206.74 ns	652,446.00 ns	-

The language ain’t gonna stop you from doing 100,000 matrix multiplications on the CPU every frame and completely neglecting that poor GPU over there just dying to do this instead. And you know what? It’s still decently fast for a one-off.

Inertia Benchmarking Tool: Coming Soon.

What this is missing is fancy graphs and a GUI. I plan on integrating BenchmarkDotNet into the Inertia editor, so you can benchmark your own game code within the editor and get a detailed report of how things are doing.