InterlockedAdd HLSL 潛在優化 (InterlockedAdd HLSL potential optimization)


問題描述

InterlockedAdd HLSL 潛在優化 (InterlockedAdd HLSL potential optimization)

我想知道是否有人知道 HLSL InterlockedAdd 是否會進行某種優化,特別是當它用於單個全局原子計數器時(所有線程的附加值是恆定的) 由大量線程。

我在網上找到的一些信息表明,原子添加會產生嚴重的爭用問題:https://developer.nvidia.com/blog/cuda‑pro‑tip‑optimized‑filtering‑warp‑aggregated‑ atomics/

當然,上面的文章是為 CUDA 寫的(也有點舊,可以追溯到 2014 年),而我對 HLSL InterlockedAdd 感興趣。為此,我寫了一個虛擬用於 Unity 的 HLSL 著色器(據我所知,通過 FXC 編譯為 d3d11),我在單個全局原子計數器上調用 InterlockedAdd,這樣所有著色片段的附加值始終相同。有問題的片段(在 http://shader‑playground.timjones.io/ 中運行,編譯通過 FXC,優化 lvl 3,著色模型 5.0):

**HLSL**:
RWStructuredBuffer<int> counter : register(u1);
void PSMain()
{
    InterlockedAdd(counter[0], 1);
}
‑‑‑‑
**Assembly**:
ps_5_0
dcl_globalFlags refactoringAllowed
dcl_uav_structured u1, 4
atomic_iadd u1, l(0, 0, 0, 0), l(1)
ret 

然後我稍微修改了代碼,而不是總是添加一些常量值,我現在添加一個在片段之間變化的值,所以像這樣:

**HLSL**:
RWStructuredBuffer<int> counter : register(u1);
void PSMain(float4 pixel_pos : SV_Position)
{
    InterlockedAdd(counter[0], int(pixel_pos.x));
}
‑‑‑‑
**Assmebly**:
ps_5_0
dcl_globalFlags refactoringAllowed
dcl_uav_structured u1, 4
dcl_input_ps_siv linear noperspective v0.x, position
dcl_temps 1
ftoi r0.x, v0.x
atomic_iadd u1, l(0, 0, 0, 0), r0.x
ret 

我在 Unity 中實現了上述片段的等價物,並將它們用作我的片段著色器來渲染全屏四邊形(當然,沒有輸出語義,但這無關緊要)。我使用 Nsight Grphics 分析了生成的著色器。可以說兩次繪製調用之間的差異是巨大的,基於第二個片段(InterlockedAdd 與變量值)的片段著色器要慢得多。

我還使用 RenderDoc 進行了捕獲以檢查程序集,它們看起來與上面顯示的相同。彙編代碼中沒有任何內容表明存在如此巨大的差異。然而,差異是存在的。

所以我的問題是:在單個全局原子計數器上使用 HLSL InterlockedAdd 時是否進行了某種優化,使得添加的值是一個常數?GPU驅動程序是否有可能以某種方式重新排列代碼?

系統規格:


  • 參考解法

    方法 1:

    The pixel shader on the GPU runs pixels in simd groups, called wavefronts. If the code currently executing would not change based on which pixel is being rendered the code only has to be run once for the entire group. If it changes based on the pixel then each of the pixels will need to run unique code.

    In the first version, a 64 pixel wavefront would execute the code as a single simd InterlockedAdd<64>(counter[0], 1); or might even optimize it into InterlockedAdd(counter[0], 64); In the second example it turns into a series of serial, non‑simd Adds and becomes 64 times as expensive.

    This is an oversimplification, and there are other tricks the GPU uses to share computing resources. But a good general rule of thumb is to make as much code as possible sharable by every nearby pixel.

    (by haykoandriGeorge Davison)

    參考文件

    1. InterlockedAdd HLSL potential optimization (CC BY‑SA 2.5/3.0/4.0)

#hlsl #direct3d11 #gpgpu #gpu






相關問題

如何將幾何著色器與輸出流一起使用? (How do you use Geometry Shader with Output Stream?)

поўнаэкранны квадрат у піксельным шэйдары мае экранныя каардынаты? (fullscreen quad in pixel shader has screen coordinates?)

硬編碼 HLSL 著色器 (Hardcoding HLSL Shader)

GLSL和HLSL之間的模型視圖區別? (Modelview Difference between GLSL and HLSL?)

未定義的 TEXCOORD 數量 (Undefined number of TEXCOORDs)

像素著色器總是返回白色 (Pixel shader always returning white)

GLSL / HLSL 著色器中的星球大戰全息效果 (Star Wars holographic effect in GLSL / HLSL shader)

在 GLSL 中混合多個紋理 (Blending multiple textures in GLSL)

警告 X4000:使用可能未初始化的變量 (warning X4000: use of potentially uninitialized variable)

著色器中的點積與直接向量分量總和性能 (Dot product vs Direct vector components sum performance in shaders)

Unity Compute Shader 中調用 numthreads 和 Dispatch 的區別 (Difference Between Calling numthreads and Dispatch in a Unity Compute Shader)

DirectX 11 曲面細分著色器不工作 (DirectX 11 Tesellation Shader Not Working)







留言討論