So it turns out I was doing something dumb in that Vulkan bechmark post I shared yesterday (not considering that I may have been hitting a perf bottleneck that was skewing results...)

I've since updated the tests, re ran everything, and posted an updated article on my site:

Hoooorraaaayy for extremely public mistakes!

@khalladay Nice! It's interesting that push constants are the slowest for the bistro scene. Maybe it scales badly with the number of draw calls/push constant calls?

I see you're resetting the existing command buffer, so it should keep enough allocated space to hold all the push constant data.

What's the reason for choosing HOST_CACHED memory btw? Wouldn't HOST_VISIBLE be better for CPU->GPU data? That's what I am using at least, but I could be doing things wrong.

@khalladay One thing that may affect performance in a bad way is that you're binding a separate vertex and index buffer for every draw call. If you have many draw calls, this can become expensive.

You could allocate one (or 2) huge buffer for all vertex and index data, and then pass offsets into that buffer in vkCmdDrawIndexed (firstIndex and vertexOffset arguments).

@Gohla Totally agreed, my vertex buffer handling could be improved. For this test it didn't feel super important, because I wasn't going for raw performance, but rather, trying to see perf differences across different vertex shaders, so as long as the scene stayed constant, I was ok with the reduced performance.

Ideally I'd like to organize my data to avoid as many bind calls as possible.

@khalladay Right, I don't think it affects the experiment with different binding strategies.

@Gohla I went with HOST_CACHED because then I could update the transform data directly on the mapped pointer, and then flush all the changes at once. I only needed write access to that pointer, so I *think* that HOST_VISIBLE is unnecessary (is that correct? that feels like an unfounded assumption I made)

I actually found that some cases performed better with HOST_VISIBLE | HOST_CACHED, while other cases performed better with HOST_CACHED. I'm not sure of the reasons for that yet.

@Gohla It seems like perhaps that perf difference might be specific to NVidia cards (which have a lot of extra memory types hidden behind the scenes), since I think that according to the spec, the only HOST_CACHED or HOST_COHERENT memory types you can find are also HOST_VISIBLE

(as per:

@Gohla Or it could also just be another mistake I've made somewhere in the program that's making it look like a perf difference there XD

@khalladay You're right! On my Nvidia card all host visible memory is either coherent, or coherent and cached (, so I use one of those types.

I think you only need to request HOST_VISIBLE to be able to map/flush on the host, but HOST_CACHED also caches the memory on the host (which may improve performance?).

I was under the impression that you'd only need caching when reading GPU->CPU data, but have no data to support that claim :)

@Gohla Interesting, so this would suggest that instead of using HOST_CACHED | HOST_VISIBLE for the matrix data, the better choice would have been HOST_VISIBLE | DEVICE_LOCAL

@khalladay Yes but only on AMD, Nvidia cards do not have that memory type :(

Sign in to participate in the conversation
Gamedev Mastodon

Game development! Discussions about game development and related fields, and/or by game developers and related professions.