Throw Away Your Copy Engine

(with apologies to the Red Hot Chili Peppers)

Pop quiz! It’s possible to use each of the following engines to implement image copies (vkCmdCopyImage2) on nvidia hardware.

Which will be the fastest?

The copy engine
The 2D engine
The 3D engine
The compute engine

(see below)

Did you answer “1. The copy engine”? Congratulations, that’s what we assumed for NVK!

It’s also the wrong answer.

I think gfxstrand and I had both assumed that the copy engine was just the obvious choice. It’s pretty general and can handle things like tile layouts and row strides that you need for proper 2D image copies. The copy engine is what gfxstrand used in her original implementation and I didn’t question it until karolherbst mentioned it on irc:

(log from #nouveau on 2026-04-07)

karolherbst:

ohh right.. at that topic.. do we use copy for VRAM to VRAM copies?

...

mhenning:

karolherbst: yes

karolherbst:

okay, because we shouldn't do that

copy is fast enough to hide PCIe latencies, but not to saturate VRAM bandwidth.

mhenning:

what do we use instead?

karolherbst:

2D

mhenning:

really? I had assumed that 2d was slower than copy

karolherbst:

nah, 2D acts with the same access as 3D or compute

it got even faster on Ada

or ampere?

anyway, dma-copy is mostly good for transfers over PCIe, for anything else we should use 2D or something else

and that kicked off the following work.

Let’s look at some plots of data from a microbenchmark. We’ll be using mareko’s gpu-ratemeter project, which provides benchmarks for a variety of operations. Specifically, we’ll be looking at the benchmarks that measure vkCmdCopyImage2 on 2D and 1D copies. I’m going to make scatter plots where the measured memory bandwidth (higher is better) for one implementation is on the x axis and the other implementation on the y axis.

Try the 2D engine?

Karol thinks the 2D engine could be a good fit, and so I wrote a new image copy implementation using NV902D_SET_PIXELS_FROM_MEMORY. Let’s compare that to the copy engine. Look at this graph:

Since most of the points here are on the upper-left side of the y=x line (graphed as a dotted purple line), the 2D engine is the pretty clear winner. The copy engine seems to do better with some of the Linear2Linear points (in purple), but I think the implementation I wrote for the 2D engine has some room to be optimized there.

What about the 3D engine?

We can also do an image copy by using the 3D engine. This implementation uses mesa’s vk_meta framework which draws triangles on the destination image like any user application could. Since the 2D engine won the last round, let’s compare the 3D engine to the 2D engine.

The points are clustered closely enough around the y=x line (graphed in purple) that it’s hard to call this in one direction or the other. The 3D engine seems to be pulling ahead in some 1D image test cases, but the 2D engine seems to pull ahead in a cluster of multisampled Optimal2Optimal points (in blue at the lower right). I was looking at this data and thinking that we should maybe call this one a tie or use a mix of the two implementations, until Karol chimed in again:

(log from #nouveau on 2026-05-18)

karolherbst:

_but_ I think on higher end GPUs meta _might_ be faster

...

karolherbst:

what GPU did you test on?

mhenning:

5060

karolherbst:

mhh okay there I'd expect 2D and meta to not make any difference really

mhenning:

Yeah, they seem pretty close

karolherbst:

might be worth to check on higher end GPUs

I know that 2D can push out enough work for X number of GPCs or something

So let’s test on someone else’s machine. The above graphs are all for data on my RTX 5060. karolherbst ran some testing on his A6000, which is a larger gpu:

and we see the 3D engine starting to pull ahead.

RedSheep volunteered to do some runs on his 4090, which is an even larger gpu:

and we see an even larger win for the 3D engine. So, the hypothesis is correct; the 2D engine cannot saturate the memory system. My best guess is that this is because the 3D engine has seen either more engineering resources or more die area dedicated to it.

A bonus graph

For fun, let’s also look at the copy engine vs the 3D engine on the 4090. As a reminder, the copy engine implementation is the one that we’re shipping right now.

This is a staggering win for the 3D engine.

The compute engine?

I prototyped an implementation for the compute engine, also using vk_meta, but it was kind of slow and I haven’t figured out why yet. Also, my understanding is that writes from compute shaders don’t get to participate in lossless memory compression the way that 2D/3D engine color targets do, which means they’re likely not great for the image copy case (but I still need to double check that assumption). I’ll revisit compute when I get around to buffer copies.

I want to play gaymes, not microbenchmarks

phomes benchmarked his usual set of games on my branch, and the results look good:

So this is a pretty major win for X-com 2 and a more minor win on a few other titles. This might not look huge from the graph above, but recently I’ve been having trouble making driver changes translate into real-world performance improvements, so I’m happy with how this turned out.

What now?

The 3D engine seems to be the clear choice in the data above. I’ve published an MR that switches a lot of vkCmdCopyImage cases to copies on the 3D engine. I’m still working on generalizing this for more cases; depth/stencil, compressed formats, and YCbCr are all still copy engine operations. For now, copy queues are also still implemented on the copy engines, since that’s required for those queues to be properly asynchronous under the nouveau kernel driver right now. We have more copy engines than 3D engines so it might make sense to keep async copies on the copy engine long-term, but we should double-check that assumption.

But anyway, all of this is a good reminder that just because you have a dedicated hardware block for something doesn’t mean you should necessarily use it.