SSE：如果不為零則倒數 (SSE: reciprocal if not zero)

問題描述

How can I take the reciprocal (inverse) of floats with SSE instructions, but only for non-zero values?

Background bellow:

I want to normalize an array of vectors so that each dimension has the same average. In C this can be coded as:

float vectors[num * dim]; // input data

// step 1. compute the sum on each dimension
float norm[dim];
memset(norm, 0, dim * sizeof(float));
for(int i = 0; i < num; i++) for(int j = 0; j < dims; j++)
    norm[j] += vectors[i * dims + j];
// step 2. convert sums to reciprocal of average
for(int j = 0; j < dims; j++) if(norm[j]) norm[j] = float(num) / norm[j];
// step 3. normalize the data
for(int i = 0; i < num; i++) for(int j = 0; j < dims; j++)
    vectors[i * dims + j] *= norm[j];

Now for performance reasons, I want to do this using SSE intinsics. Setp 1 et step 3 are easy, but I'm stuck at step 2. I don't seem to find any code sample or obvious SSE instruction to take the recirpocal of a value if it is not zero. For the division, _mm_rcp_ps does the trick, and maybe combine it with a conditional move, but how to get a mask indicating which component is zero?

I don't need the code to the algorithm described above, just the "inverse if not zero" function:

__m128 rcp_nz_ps(__m128 input) {
    // ????
}

Thanks!

參考解法

方法 1:

__m128 rcp_nz_ps(__m128 input) {
    __m128 mask = _mm_cmpeq_ps(_mm_set1_ps(0.0), input);
    __m128 recip = _mm_rcp_ps(input);
    return _mm_andnot_ps(mask, recip);
}

Each lane of mask is set to either b111...11 if the input is zero, and b000...00 otherwise. And-not with that mask replaces elements of the reciprocal corresponding to a zero input with zero.

(by Antoine、Stephen Canon)

參考文件

SSE: reciprocal if not zero (CC BY-SA 3.0/4.0)

SSE：如果不為零則倒數 (SSE: reciprocal if not zero)

問題描述

參考解法

方法 1:

參考文件

相關問題

留言討論

SSE：如果不為零則倒數 (SSE: reciprocal if not zero)

問題描述

參考解法

方法 1:

參考文件

相關問題

SSE：如果不為零則倒數 (SSE: reciprocal if not zero)

使用 SSE2 模擬 packusdw 功能 (Simulating packusdw functionality with SSE2)

什麼會導致 _mm_setzero_si128() 到 SIGSEGV？ (What would cause _mm_setzero_si128() to SIGSEGV?)

ARM NEON 的 SSE _mm_movemask_epi8 等效方法 (SSE _mm_movemask_epi8 equivalent method for ARM NEON)

使用 simd 指令時，32 位圖像處理是否比 24 位圖像處理快？ (Is 32 bit image processing faster than 24 bit image processing when simd instructions are used?)

điều phối cpu cho studio trực quan cho AVX và SSE (cpu dispatcher for visual studio for AVX and SSE)

如何將內存中的 96 位加載到 XMM 寄存器中？ (How to load 96 bits from memory into an XMM register?)

x86中“非臨時”內存訪問的含義是什麼 (What is the meaning of "non temporal" memory accesses in x86)

現代編譯器如何使用 mmx/3dnow/sse 指令？ (How do modern compilers use mmx/3dnow/sse instructions?)

如何讓 ICC 編譯器在內循環中生成 SSE 指令？ (How do you get the ICC compiler to generate SSE instructions within an inner loop?)

如何從 SSE 中獲得最大速度？ (How do you get maximal speed out of SSE?)

XMM 寄存器可以用來做任何 128 位整數數學嗎？ (Can XMM registers be used to do any 128 bit integer math?)

留言討論