It's something I wish I started doing earlier... because profilers drive me crazy.
We’ve all been there, right? You’re writing an app, it feels sluggish, something feels wrong. It’s not a bug, the app works fine, it’s just…there’s no reason typing in a text field should be this laggy on this hardware. You know what I’m talking about, right?
My game UIs feel that way a lot. So does my engine’s editor. I’ve always wanted to figure out why, and fucking fix it. I never could.
My Specs
I’m not just posting my PC specs to brag. Hardware is just as important to understanding what’s going on with slow code as it is understanding what the code is doing. A slow memory leak is going to take longer on a supercomputer with over 1TB of RAM than it would take to cause problems on your phone. CPU and GPU usage is also going to be less of an issue on a game developer’s workstation than it might be on your garden variety cash register.
With that in mind, this PC ain’t slow.
| Hardware | What I have | |:--------:|:----------------------------------------} | CPU | AMD Ryzen 7 3700X | | GPU | AMD Radeon RX 7600XT | | Memory | 64 GB available | | Disks | 2x 1TB NVMe SSDs, 2x 4TB spinning disks |
I’m not on the bleeding edge because I don’t need to be. But this isn’t a slow computer either. It handles any 1080p game I throw at it, and I’m too blind for 4K, HDR or raytracing to matter. I also don’t need to take advantage of AI upscaling because my eyes do a better job of screwing up text readability. Point being, if something is slow, I’m either compiling Plasma or something in the app is poorly optimized.
Having an extremely fast computer is a double-edged sword however. If something feels slow on my PC, it’s slow. But I may also do something that feels perfectly fast on my PC but kills your laptop in minutes.
And I can’t debug what I can’t see.
Profiling with dotTrace
This has been generally unsuccessful for me. I would typically build my engine’s editor for release, get it running on the profiler, and try to do things in the UI that feel slow. Things would either stop feeling slow under the profiler or… I just couldn’t figure out what the problem really was.
That’s not a dock against JetBrains’ tools. I just struggle to navigate them. They don’t display things in a way I know how to read.
Sometimes, I would get lucky and find out that something in my code genuinely is problematic. I’d rewrite it, and be worse off - or not feel a difference.
What I Care About
My engine, Inertia, is written in C#. I will die on that hill. I love C# as a language. I find it fun to write, easy to read, and I do not feel limited by it. I can make it do scary things.
The problem with C# and performance is how the language used to be used. C# started with a lot of inspiration from Java and object-oriented programming. This means that the runtime has a garbage collector because that’s what we did back then. Most objects are stored on the heap because that’s how we did it back then. Newer versions of the language, and of the runtime, allow you to do things differently.
GC Hits
The number one concern I have when tracking down slow code in my engine is GC hits. If you’ve played a Unity game and ever felt it hitch every now and then, every few seconds/minutes, GC hits are very likely what you are feeling.
The runtime knows you have a finite amount of RAM and the garbage collector makes sure things get cleaned up. Every now and then, the GC will decide to clean up any unused memory in the application, freeing space for new heap allocations. Your code cannot execute while the GC is collecting. No matter how fast your code is in other languages, if you trigger GC, it’s moot.
You can’t stop GC hits from happening outright. The only thing you can do is allocate less memory on the heap, and keep references active until you’re genuinely done with them. Keep the memory you need but don’t allocate more than you do.
Frame Time
I find most people who talk about C# are thinking about its use on the web with ASP.NET. If that’s all you’ve ever done, let me introduce you to something really annoying. Frame time is how much time the UI thread in an application spends working on a single frame in one window. We generally want it to be as little time as possible. In some cases, high (or unstable) frame time can make someone feel ill. As someone who cares about accessibility, I do not want that happening.
Knowing frame time is helpful for understanding that there’s a problem. But it’s not very helpful to measure it if your brain is already telling you it’s too high. What’s more important is understanding what is going on in that frame and how long it takes to process each thing going on. If you’re rendering lots of text and the engine feels sluggish when there’s lots of text, you probably want to know how long it takes to render text.
I have a display with 120hz refresh rate. I would prefer my games be able to build frames at that refresh rate. Most of the time I run the display at 60hz, which means I’m willing to accept 60 frames per second. Anything below that in my own game and we’re in trouble.
My engine, especially in editor mode, supports multiple windows. For best results, each window should be able to fully update and repaint in less than 16.7ms. That’s one second divided over 60 frames. This is how much frame time the engine has to work with.
Runtime Performance
This is why I was having trouble with dotTrace. It doesn’t matter how fast my code is, if I’m using the runtime APIs incorrectly or they’re just slow.
Maybe a dictionary isn’t the best way to track inventory? I should probably measure.
It Started With This Video
I’ve been listening to Nick Chapsas videos while trying to sleep.
In this video, he uses BenchmarkDotNet to compare different ways to iterate a collection in .NET. This is where I decided to start in my engine, because I have an entity component system. You will typically be iterating through an ECS every frame, making collection iteration operations a hot path. Prime real estate for performance gains.
Ways to Iterate Lists in C#
Benchmarking is about comparison. If I want to know how fast my code is, I need a frame of reference to measure it against. And there are quite a few ways to iterate collections in C#.
- With
foreachloops - With
forandwhileloops - With LINQ
- In parallel
- With a
Span<T> - With a
ReadOnlySpan<T> - With a raw pointer
- With fancy unsafe runtime methods
To make matters worse, there are different kinds of collections in C#. Arrays, lists, hash sets, dictionaries, custom collection types… even a Span can be thought of as a collection if it represents a stack-allocated buffer.
So I Wrote A Tool
I wrote a tool for my engine that benchmarks it using BenchmarkDotNet and a series of common tasks my code might perform. I started with collection iterations.
I decided to test with both a standard int[] array and a List<int>. This allows me to compare things like foreach on a list vs. foreach on an array. Each collection is filled with random data, and each test sums up the value of each element in the collection. I also tested with collections with varying sizes, starting with powers of two and then a torture test of 100 thousand items. Gradually, the test would grow the collections larger in size to see how bigger collections affect the measurements.
After my tool runs the benchmark, it prints it out to a fancy-ish Markdown table.
And here’s a report.
| Method | ElementCount | Mean | Error | StdDev | Median | Allocated |
|---|---|---|---|---|---|---|
| ListForEachMethod | 4 | 758.40 ns | 17.448 ns | 35.247 ns | 760.00 ns | 88 B |
| ListEnumeration | 4 | 258.70 ns | 10.686 ns | 30.489 ns | 260.00 ns | - |
| ListIndexing | 4 | 144.28 ns | 7.098 ns | 20.252 ns | 140.00 ns | - |
| ArrayEnumeration | 4 | 95.76 ns | 5.509 ns | 15.717 ns | 94.50 ns | - |
| ArrayIndexing | 4 | 57.32 ns | 4.529 ns | 13.283 ns | 55.00 ns | - |
| ArraySpanIndexing | 4 | 85.70 ns | 5.092 ns | 13.854 ns | 80.00 ns | - |
| ListSpanIndexing | 4 | 110.29 ns | 6.202 ns | 17.186 ns | 105.00 ns | - |
| ArraySpanEnumeration | 4 | 136.65 ns | 7.926 ns | 23.120 ns | 130.00 ns | - |
| ListSpanEnumeration | 4 | 157.24 ns | 6.188 ns | 18.050 ns | 160.00 ns | - |
| UnsafeSpanIterationOverArray | 4 | 148.07 ns | 9.543 ns | 27.226 ns | 149.00 ns | - |
| UnsafeSpanIterationOverList | 4 | 140.42 ns | 6.304 ns | 18.189 ns | 140.00 ns | - |
| ListForEachMethod | 64 | 1,095.43 ns | 79.855 ns | 234.202 ns | 1,025.00 ns | 88 B |
| ListEnumeration | 64 | 632.28 ns | 14.883 ns | 15.925 ns | 629.50 ns | - |
| ListIndexing | 64 | 504.56 ns | 11.184 ns | 10.985 ns | 501.00 ns | - |
| ArrayEnumeration | 64 | 381.32 ns | 9.158 ns | 15.798 ns | 380.00 ns | - |
| ArrayIndexing | 64 | 419.53 ns | 10.819 ns | 18.948 ns | 419.50 ns | - |
| ArraySpanIndexing | 64 | 469.56 ns | 11.462 ns | 17.845 ns | 470.50 ns | - |
| ListSpanIndexing | 64 | 460.67 ns | 11.385 ns | 21.661 ns | 460.00 ns | - |
| ArraySpanEnumeration | 64 | 447.27 ns | 10.258 ns | 9.595 ns | 440.00 ns | - |
| ListSpanEnumeration | 64 | 450.79 ns | 10.940 ns | 15.690 ns | 445.00 ns | - |
| UnsafeSpanIterationOverArray | 64 | 387.31 ns | 9.792 ns | 14.353 ns | 385.00 ns | - |
| UnsafeSpanIterationOverList | 64 | 422.63 ns | 9.875 ns | 10.976 ns | 420.00 ns | - |
| ListForEachMethod | 256 | 1,431.36 ns | 30.789 ns | 57.829 ns | 1,425.00 ns | 88 B |
| ListEnumeration | 256 | 1,774.31 ns | 28.270 ns | 27.765 ns | 1,780.00 ns | - |
| ListIndexing | 256 | 1,616.08 ns | 15.308 ns | 12.783 ns | 1,620.00 ns | - |
| ArrayEnumeration | 256 | 1,383.40 ns | 22.728 ns | 21.260 ns | 1,380.00 ns | - |
| ArrayIndexing | 256 | 1,418.36 ns | 18.865 ns | 16.723 ns | 1,415.50 ns | - |
| ArraySpanIndexing | 256 | 1,426.33 ns | 16.600 ns | 15.527 ns | 1,425.00 ns | - |
| ListSpanIndexing | 256 | 1,454.45 ns | 31.634 ns | 36.429 ns | 1,445.00 ns | - |
| ArraySpanEnumeration | 256 | 1,430.43 ns | 13.062 ns | 11.579 ns | 1,430.50 ns | - |
| ListSpanEnumeration | 256 | 1,450.00 ns | 20.753 ns | 18.397 ns | 1,445.00 ns | - |
| UnsafeSpanIterationOverArray | 256 | 1,313.85 ns | 17.989 ns | 15.021 ns | 1,310.00 ns | - |
| UnsafeSpanIterationOverList | 256 | 1,355.33 ns | 25.191 ns | 23.563 ns | 1,360.00 ns | - |
| ListForEachMethod | 4096 | 9,856.92 ns | 134.013 ns | 111.907 ns | 9,840.00 ns | 88 B |
| ListEnumeration | 4096 | 12,783.20 ns | 272.413 ns | 754.855 ns | 12,610.00 ns | - |
| ListIndexing | 4096 | 12,878.00 ns | 191.451 ns | 169.716 ns | 12,885.00 ns | - |
| ArrayEnumeration | 4096 | 11,104.36 ns | 223.734 ns | 198.334 ns | 11,070.50 ns | - |
| ArrayIndexing | 4096 | 9,212.45 ns | 209.589 ns | 566.637 ns | 9,050.00 ns | - |
| ArraySpanIndexing | 4096 | 10,224.40 ns | 277.808 ns | 792.602 ns | 9,910.00 ns | - |
| ListSpanIndexing | 4096 | 10,089.07 ns | 339.951 ns | 991.652 ns | 9,711.00 ns | - |
| ArraySpanEnumeration | 4096 | 12,085.38 ns | 614.500 ns | 1,792.526 ns | 12,255.00 ns | - |
| ListSpanEnumeration | 4096 | 10,211.55 ns | 247.290 ns | 705.531 ns | 10,089.50 ns | - |
| UnsafeSpanIterationOverArray | 4096 | 8,239.89 ns | 164.241 ns | 426.884 ns | 8,230.00 ns | - |
| UnsafeSpanIterationOverList | 4096 | 8,472.12 ns | 256.992 ns | 733.214 ns | 8,285.50 ns | - |
| ListForEachMethod | 16384 | 35,095.62 ns | 376.907 ns | 314.734 ns | 35,091.00 ns | 88 B |
| ListEnumeration | 16384 | 23,245.47 ns | 215.295 ns | 201.387 ns | 23,180.00 ns | - |
| ListIndexing | 16384 | 19,277.77 ns | 224.528 ns | 187.491 ns | 19,370.00 ns | - |
| ArrayEnumeration | 16384 | 15,680.08 ns | 183.495 ns | 153.227 ns | 15,650.00 ns | - |
| ArrayIndexing | 16384 | 14,256.98 ns | 276.218 ns | 538.742 ns | 14,180.00 ns | - |
| ArraySpanIndexing | 16384 | 15,741.64 ns | 232.688 ns | 206.272 ns | 15,730.00 ns | - |
| ListSpanIndexing | 16384 | 15,957.92 ns | 210.260 ns | 175.577 ns | 15,931.00 ns | - |
| ArraySpanEnumeration | 16384 | 14,579.69 ns | 465.352 ns | 1,327.675 ns | 14,145.00 ns | - |
| ListSpanEnumeration | 16384 | 14,329.85 ns | 248.622 ns | 513.447 ns | 14,310.50 ns | - |
| UnsafeSpanIterationOverArray | 16384 | 13,323.21 ns | 208.472 ns | 184.805 ns | 13,304.50 ns | - |
| UnsafeSpanIterationOverList | 16384 | 11,643.54 ns | 218.602 ns | 532.106 ns | 11,630.00 ns | - |
| ListForEachMethod | 65536 | 136,913.00 ns | 464.877 ns | 362.946 ns | 136,820.50 ns | 88 B |
| ListEnumeration | 65536 | 59,566.67 ns | 638.983 ns | 597.705 ns | 59,650.00 ns | - |
| ListIndexing | 65536 | 43,622.77 ns | 775.087 ns | 647.233 ns | 43,790.00 ns | - |
| ArrayEnumeration | 65536 | 33,895.24 ns | 665.930 ns | 888.998 ns | 33,495.00 ns | - |
| ArrayIndexing | 65536 | 34,435.62 ns | 330.000 ns | 275.565 ns | 34,500.00 ns | - |
| ArraySpanIndexing | 65536 | 32,694.86 ns | 1,072.178 ns | 3,144.512 ns | 32,620.00 ns | - |
| ListSpanIndexing | 65536 | 34,444.92 ns | 602.428 ns | 503.055 ns | 34,560.00 ns | - |
| ArraySpanEnumeration | 65536 | 34,462.54 ns | 393.631 ns | 328.699 ns | 34,590.00 ns | - |
| ListSpanEnumeration | 65536 | 34,462.69 ns | 320.949 ns | 268.007 ns | 34,420.00 ns | - |
| UnsafeSpanIterationOverArray | 65536 | 24,682.77 ns | 492.278 ns | 751.760 ns | 24,160.00 ns | - |
| UnsafeSpanIterationOverList | 65536 | 25,561.85 ns | 439.303 ns | 366.838 ns | 25,410.00 ns | - |
| ListForEachMethod | 100000 | 213,432.96 ns | 4,237.520 ns | 5,940.411 ns | 211,737.00 ns | 88 B |
| ListEnumeration | 100000 | 85,497.21 ns | 873.409 ns | 774.254 ns | 85,666.00 ns | - |
| ListIndexing | 100000 | 60,998.97 ns | 1,200.999 ns | 1,334.907 ns | 60,940.50 ns | - |
| ArrayEnumeration | 100000 | 46,505.93 ns | 958.795 ns | 2,766.343 ns | 47,311.00 ns | - |
| ArrayIndexing | 100000 | 47,451.87 ns | 575.375 ns | 538.206 ns | 47,291.00 ns | - |
| ArraySpanIndexing | 100000 | 45,763.64 ns | 909.068 ns | 2,212.798 ns | 45,870.00 ns | - |
| ListSpanIndexing | 100000 | 46,116.79 ns | 913.460 ns | 1,985.786 ns | 46,340.00 ns | - |
| ArraySpanEnumeration | 100000 | 46,233.10 ns | 924.958 ns | 2,593.682 ns | 46,381.00 ns | - |
| ListSpanEnumeration | 100000 | 47,106.38 ns | 699.318 ns | 583.962 ns | 47,286.00 ns | - |
| UnsafeSpanIterationOverArray | 100000 | 34,374.02 ns | 904.218 ns | 2,550.363 ns | 33,150.00 ns | - |
| UnsafeSpanIterationOverList | 100000 | 37,568.35 ns | 734.672 ns | 1,005.627 ns | 37,889.50 ns | - |
If you just scrolled past that and went “jesus fuck that’s a large table,” I agree. Jesus fuck, that’s a large table. But it’s giving me a lot of insight.
Allocations
The Allocations column tells you how much memory has been allocated on the heap during the test (not counting any allocations while preparing the test). None of the measured list iteration methods allocate heap memory, except for one: ListForEachMethod. That would be code that looks like this.
var list = new List<int>();
list.Add(1);
list.Add(2);
list.Add(3);
list.Add(4);
var sum = 0;
list.ForEach(value => sum += value);
I never knew this function existed on List<T> until I measured it. Allegedly, people think it looks elegant to use. But you are paying an 88-byte heap tax when you use it. So…I wouldn’t use this method.
Execution time
The most important columns in my opinion to read are Mean compared with ElementCount. You statistician folks can read the other numbers, I don’t really understand them yet.
Each test in the report was executed several (likely thousands of) times by BenchmarkDotNet. The Mean column is telling me, on average, how long each test run took. It is in fact directly telling me how much frame time I would need to iterate a collection of that size in that way.
My Code’s The Fastest
I can tell you for certain that my code, out of all tests, is the fastest. Here’s how I know.
Enumeration and Iteration (Indexing)
In the chart, Enumeration is any time I use a foreach loop. There is an enumeration test for arrays, lists, and spans.
Iteration is any time I’m using a for loop and a loop counter. There is a test for lists, arrays, and spans.
The Unsafe Span Iteration tests are my code.
If we look at a 4-element array, iteration with a for loop is almost twice as fast on average as enumerating with foreach. If you compare enumeration vs. iteration of List, Array Span, and List Span, you will notice iteration is generally faster on average for a 4-element collection.
We generally don’t iterate over collections that small, so what matters more is if this pattern scales with larger element counts. Even in the torture test, iteration generally wins over enumeration. In most tests, the difference is negligible. Enumeration does win over iteration of an array, but iteration wins by tens of microseconds over enumeration on a list.
Unsafe Iteration
Across all tests, my Unsafe Span Iteration method is generally fastest. It is also faster on array spans than on list spans.
But it looks like this.
[Benchmark]
public int UnsafeSpanIterationOverArray()
{
var span = array.AsSpan();
var sum = 0;
ref var reference = ref span[0];
for (var i = 0; i < ElementCount; i++)
{
sum += reference;
Unsafe.Add(ref reference, 1);
}
return sum;
}
[Benchmark]
public int UnsafeSpanIterationOverList()
{
var span = CollectionsMarshal.AsSpan(list);
var sum = 0;
ref var reference = ref span[0];
for (var i = 0; i < ElementCount; i++)
{
sum += reference;
Unsafe.Add(ref reference, 1);
}
return sum;
}
I’m accessing the span of the collection, taking a reference to its first element, and using Unsafe.Add to add one element worth of offset to the reference every time I iterate. This is effectively like doing this in C.
int length = 100000;
int* array = malloc(sizeof(int) * length);
int* reference = array;
int sum = 0;
for (int i = 0; i < length; i++) {
}
sum += *reference;
reference++;
}
…And exactly as dangerous, too. If the array memory is ever moved or freed during iteration, this is going to be problematic. But it is the fastest, and if I can guarantee I won’t lose the memory mid-iteration, I’m inclined to use this in my engine’s ECS to squeexe as much performance out of it as possible.
Should I Be Writing C#?
If I care enough about performance to benchmark different ways of iterating over memory at a low level in the language, why am I not using C, Zig, Rust, or something more low-level for my engine?
The answer is because I love C#. It gives me the freedom to choose how I write my code. It doesn’t fail to compile because I mis-typed a space, nor does it yell at me about borrowing. And besides…
| Method | ElementCount | Mean | Error | StdDev | Median | Allocated |
|---|---|---|---|---|---|---|
| UnsafeSpanIterationOverMatrix4x4Array | 4 | 1,696.88 ns | 32.863 ns | 32.28 ns | 1,694.50 ns | - |
| UnsafeSpanIterationOverMatrix4x4Array | 64 | 19,622.29 ns | 46.542 ns | 41.26 ns | 19,620.00 ns | - |
| UnsafeSpanIterationOverMatrix4x4Array | 256 | 77,102.17 ns | 103.041 ns | 80.45 ns | 77,080.50 ns | - |
| UnsafeSpanIterationOverMatrix4x4Array | 4096 | 323,963.70 ns | 3,547.935 ns | 3,318.74 ns | 322,388.50 ns | - |
| UnsafeSpanIterationOverMatrix4x4Array | 16384 | 111,209.92 ns | 321.903 ns | 251.32 ns | 111,230.00 ns | - |
| UnsafeSpanIterationOverMatrix4x4Array | 65536 | 423,498.53 ns | 5,247.143 ns | 4,908.18 ns | 425,014.00 ns | - |
| UnsafeSpanIterationOverMatrix4x4Array | 100000 | 661,515.67 ns | 12,987.558 ns | 18,206.74 ns | 652,446.00 ns | - |
The language ain’t gonna stop you from doing 100,000 matrix multiplications on the CPU every frame and completely neglecting that poor GPU over there just dying to do this instead. And you know what? It’s still decently fast for a one-off.
Inertia Benchmarking Tool: Coming Soon.
What this is missing is fancy graphs and a GUI. I plan on integrating BenchmarkDotNet into the Inertia editor, so you can benchmark your own game code within the editor and get a detailed report of how things are doing.