問題描述
使用 SSE2 模擬 packusdw 功能 (Simulating packusdw functionality with SSE2)
I'm implementing a fast x888 ‑> 565 pixel conversion function in pixman according to the algorithm described by Intel [pdf]. Their code converts x888 ‑> 555 while I want to convert to 565. Unfortunately, converting to 565 means that the high bit is set, which means I can't use signed‑saturation pack instructions. The unsigned pack instruction, packusdw
wasn't added until SSE4.1. I'd like to implement its functionality with SSE2 or find another way of doing this.
This function takes two XMM registers containing 4 32‑bit pixels each and outputs a single XMM register containing the 8 converted RGB565 pixels.
<pre class="lang‑c prettyprint‑override">static force_inline m128i
pack_565_2packedx128_128 (m128i lo, m128i hi)
{
m128i rb0 = _mm_and_si128 (lo, mask_565_rb);
__m128i rb1 = _mm_and_si128 (hi, mask_565_rb);
m128i t0 = _mm_madd_epi16 (rb0, mask_565_pack_multiplier);
m128i t1 = _mm_madd_epi16 (rb1, mask_565_pack_multiplier);
m128i g0 = _mm_and_si128 (lo, mask_green);
m128i g1 = _mm_and_si128 (hi, mask_green);
t0 = _mm_or_si128 (t0, g0);
t1 = _mm_or_si128 (t1, g1);
t0 = _mm_srli_epi32 (t0, 5);
t1 = _mm_srli_epi32 (t1, 5);
/ XXX: maybe there's a way to do this relatively efficiently with SSE2? /
return _mm_packus_epi32 (t0, t1);
}
</code></pre>
Ideas I've thought of:
Subtracting 0x8000, _mm_packs_epi32
, re‑adding 0x8000 to each 565 pixel. I've tried this, but I can't make this work.
<pre class="lang‑c prettyprint‑override"> t0 = _mm_sub_epi16 (t0, mask_8000); t1 = _mm_sub_epi16 (t1, mask_8000); t0 = _mm_packs_epi32 (t0, t1); return _mm_add_epi16 (t0, mask_8000);
</pre>
</li>
Shuffle data instead of packing it. Works for MMX, but since SSE 16‑bit shuffles work on only the high or low 64‑bits, it would get messy.
Save high bits, set them to zero, do the pack, restore them afterwards. Seems quite messy.
</ul> Is there some other (hopefully more efficient) way I could do this?
‑‑‑‑‑
參考解法
方法 1:
You could sign extend the values first and then use _mm_packs_epi32
:
t0 = _mm_slli_epi32 (t0, 16);
t0 = _mm_srai_epi32 (t0, 16);
t1 = _mm_slli_epi32 (t1, 16);
t1 = _mm_srai_epi32 (t1, 16);
t0 = _mm_packs_epi32 (t0, t1);
You could actually combine this with the previous shifts to save two instructions:
t0 = _mm_slli_epi32 (t0, 16 ‑ 5);
t0 = _mm_srai_epi32 (t0, 16);
t1 = _mm_slli_epi32 (t1, 16 ‑ 5);
t1 = _mm_srai_epi32 (t1, 16);
t0 = _mm_packs_epi32 (t0, t1);
參考文件