如何將內存中的 96 位加載到 XMM 寄存器中? (How to load 96 bits from memory into an XMM register?)


問題描述

如何將內存中的 96 位加載到 XMM 寄存器中? (How to load 96 bits from memory into an XMM register?)

假設我在 rsi 中有一個指向內存的指針,我想將指向的 12 字節值加載到 xmm0 的低 96 位中。我不在乎高 32 位會發生什麼。什麼是執行此操作的有效方法?

(附帶問題:我想出的最好的方法涉及 movlpd“移動低壓縮雙精度浮點值”指令. 這個指令有什麼特定於浮點值的嗎?我不明白它是以這種方式記錄的;當然它也應該適用於整數。)


參考解法

方法 1:

If a 16byte load won't cross into another page and fault, then use movups. The high 4 bytes will be whatever garbage is there in memory. Causing a cache miss for the 4B you don't care about may be a problem, as might the cache‑line split.

Otherwise use movq / pinsrd (SSE4.1), or some other way of doing two loads + a shuffle. movq + pinsrd is going to be 3 fused‑domain uops on Intel SnB‑family CPUs, because pinsrd can't micro‑fuse. (And its ALU uop requires the shuffle port (p5)).


Another possibility: AVX VMASKMOVPS xmm1, xmm2, m128.

Conditionally moves packed data elements from the second source operand into the corresponding data element of the destination operand, depending on the mask bits associated with each data element (MSB of 1st src operand).

... Faults will not occur due to referencing any memory location if the corresponding mask bit for that memory location is 0.

Intel Haswell: 3 fused‑domain uops (one load and two shuffle (p5)). 4c latency, one per 2c throughput.

It's probably not very good compared, esp. if the surrounding code has to shuffle.


Your very‑rarely‑taken conditional branch that uses movups any time it's guaranteed not to fault is also 3 fused‑domain uops on the fast‑path, and one of them can run on port6 (not competing with vector ALUs at all). The LEA isn't on the critical path either.


movlpd is safe to use on any data. It will never fault or be slow with data that represents a floating point NaN, or anything like that. You only have to worry about that with instructions that are listed in the insn ref manual with a non‑empty "SIMD Floating‑Point Exceptions" section. e.g. addps can generate "Overflow, Underflow, Invalid, Precision, Denormal" exceptions, but shufps says "None".

方法 2:

Peter Cordes's answer helped by making me think of pages, and I wound up just checking whether there was any chance we'd fault:

 // We'd like to perform only a single load from memory, but there's no 96‑bit
 // load instruction and it's not necessarily safe to load the full 128 bits
 // since this may read beyond the end of the buffer.
 //
 // However, observe that memory protection applies with granularity of at
 // most 4 KiB (the smallest page size). If the full 16 bytes lies within a
 // single 4 KiB page, then we're fine. If the 12 bytes we are to read
 // straddles a page boundary, then we're also fine (because the next four
 // bytes must lie in the second page, which we're already reading). The only
 // time we're not guaranteed to be okay to read 16 bytes is if the 12 bytes
 // we want to read lie near the end of one page, and some or all of the
 // following four bytes lie within the next page.
 //
 // In other words, the only time there's a risk is when the pointer mod 4096
 // is in the range [4081, 4085). This is <0.1% of addresses. Check for this
 // and handle it specially.
 //
 // We perform the check by adding 15 and then checking for the range [0, 3).
 lea rax, [rsi+15]
 test eax, 0xffc
 jz slow_read

 // Hooray, we can load from memory just once.
 movdqu xmm0, XMMWORD PTR [rsi]

done_reading:
 [...]

slow_read:
 movq xmm1, QWORD PTR [rsi]
 pinsrd xmm1, DWORD PTR [rsi+8], 2
 jmp done_reading

方法 3:

    movss xmm0, [rdx+8]         //; +8*8Bits = 64 Bits
    pshufd xmm0, xmm0, 0x00     //; spreading it in every part
    movlps xmm0, [rdx]          //; overwriting the lower with 64 Bits

It worked in my case with Float, not sure if it suits your.

(by jacobsaPeter CordesjacobsahobbyAndroidDev)

參考文件

  1. How to load 96 bits from memory into an XMM register? (CC BY‑SA 2.5/3.0/4.0)

#sse #intel #assembly #sse4 #sse2






相關問題

SSE:如果不為零則倒數 (SSE: reciprocal if not zero)

使用 SSE2 模擬 packusdw 功能 (Simulating packusdw functionality with SSE2)

什麼會導致 _mm_setzero_si128() 到 SIGSEGV? (What would cause _mm_setzero_si128() to SIGSEGV?)

ARM NEON 的 SSE _mm_movemask_epi8 等效方法 (SSE _mm_movemask_epi8 equivalent method for ARM NEON)

使用 simd 指令時,32 位圖像處理是否比 24 位圖像處理快? (Is 32 bit image processing faster than 24 bit image processing when simd instructions are used?)

điều phối cpu cho studio trực quan cho AVX và SSE (cpu dispatcher for visual studio for AVX and SSE)

如何將內存中的 96 位加載到 XMM 寄存器中? (How to load 96 bits from memory into an XMM register?)

x86中“非臨時”內存訪問的含義是什麼 (What is the meaning of "non temporal" memory accesses in x86)

現代編譯器如何使用 mmx/3dnow/sse 指令? (How do modern compilers use mmx/3dnow/sse instructions?)

如何讓 ICC 編譯器在內循環中生成 SSE 指令? (How do you get the ICC compiler to generate SSE instructions within an inner loop?)

如何從 SSE 中獲得最大速度? (How do you get maximal speed out of SSE?)

XMM 寄存器可以用來做任何 128 位整數數學嗎? (Can XMM registers be used to do any 128 bit integer math?)







留言討論