在地址位移內還是在地址位移外相乘更有效? (Is it more efficient to multiply within the address displacement or outside it?)


問題描述

在地址位移內還是在地址位移外相乘更有效? (Is it more efficient to multiply within the address displacement or outside it?)

我正在編寫一個以參數為參數的 x86 彙編例程:

  1. [ESP+4]:後面的 32 位整數的數量。
  2. [ESP+8]開始:要加起來的32位整數列表。

它返回所有的總和從 [ESP+8] 開始傳遞的整數。因此,基本上,函數的 C 原型將是:

int addN(int numberofitems, ...);

我可以選擇用兩種方式編寫這個 x86 彙編例程:

第一種方式(乘以項目大小在地址位移內):

addN PROC
    mov ecx, dword ptr [esp+4] ; Dec ecx per argument processed
    mov edx, 2 ; Offset into variable length argument list
    xor eax, eax ; Accumulator
    AdderLoop:
        add eax, dword ptr [esp+edx*4]
        inc edx
        dec ecx
        jnz AdderLoop
    ret
addN ENDP

第二種方式(項目大小添加到 edx 本身):


參考解法

方法 1:

With modern CPUs it is very tricky to theorize about performance of particular source. But I will burn myself anyway.

Especially as I didn't bother to learn anything about ASM performance coding in last decade, so most of my comments are based on tiny glimpses of thing here and there, not any thorough knowledge and experience.

Step zero: figure out, how you will profile your code. Without real world profiling you will get nowhere, as everything I will describe next can make the result both faster and slower, obviously on different target CPUs, but even on the same target machine ‑ depends how the rest of the executable will land around, so how alignments will turn out for other functions and how cache will cover the other functions code.

First thing: use align directive at the start of the loop. (or align the procedure start in such way, that first instruction of loop will be aligned). How much? Looks like 16 will generally speed it up on most of current CPUs. This may have real effect on performance, but not only positive, usage for only frequently branched‑to addresses is recommended.

Subtleties:

Let's test few variants, how they compile into machine code:

0:  8b 04 94                mov    eax,DWORD PTR [esp+edx*4]
3:  8b 04 14                mov    eax,DWORD PTR [esp+edx*1]
6:  8b 04 24                mov    eax,DWORD PTR [esp]
9:  8b 44 95 00             mov    eax,DWORD PTR [ebp+edx*4+0x0]
d:  8b 44 15 00             mov    eax,DWORD PTR [ebp+edx*1+0x0]
11: 8b 45 00                mov    eax,DWORD PTR [ebp+0x0] 

As you can see, the *4 vs *1 variant has the same length, and the performance will be equal, so you don't have to worry about that *4 in addressing.

So use whichever mode makes the rest of code shorter/faster. inc edx is 1B long opcode, add edx,4 is 3B long, so I would go for the first one just because in complex executable the shorter machine code will fit into cache better, and there shouldn't be any performance difference on modern x86 between inc and add ‑ when considered in isolation from rest of code. When considered not in isolation, inc was evil on the Intel Pentium 4 CPUs a few years back, but the recent generations are again OK, so it should be as fast as add.

(Now I did notice you use add eax,..., so one more time different variants of addressing that one):

0:  42                      inc    edx
1:  83 c2 04                add    edx,0x4
4:  03 04 94                add    eax,DWORD PTR [esp+edx*4]
7:  03 44 95 00             add    eax,DWORD PTR [ebp+edx*4+0x0]
b:  03 04 14                add    eax,DWORD PTR [esp+edx*1]
e:  03 44 15 00             add    eax,DWORD PTR [ebp+edx*1+0x0]
12: 03 45 00                add    eax,DWORD PTR [ebp+0x0] 

Now I thought I saw something about addressing through esp having additional prefix byte, but I don't see it here, so it was maybe in 16b? That's why I did test the ebp variants too, to get rid of esp. But as the esp has shorter machine code (ebp enforces displacement +0x0 byte usage), I would keep it just like you are using it now.


On some older CPUs it would be probably faster to interleave the dependent instructions:

AdderLoop:
    add eax, dword ptr [esp+edx*4]
    dec ecx
    lea edx, [edx+1]
    jnz AdderLoop

But latests architectures use something called "macro‑fusion" of instructions and dec + jnz pairs should be now kept together.


And if you know the number of arguments will be most of the time considerably big (unlikely, as that would overflow the result value), you can of course unroll the loop for few iterations (4,8 or 16, wouldn't go higher due to cache pollution by large code).


Then again, if the number of arguments would be considerably high, you would probably end waiting for memory to load the values for most of the loop.

And then any of the code variants above would ends with the same performance, as memory cache‑miss is worth of tens to hundreds instructions, not single‑instruction nuance in addressing mode.

Did I warn you, that it's very tricky? I did, in the first sentence.


Conclusion:

Don't bother with this, you are wasting your time.

Simply write the simplest most readable source which is correct (in your particular case I prefer the *4 variant with "index"‑like source).

Once you have your application done, profile it.

Fix real bottlenecks.

方法 2:

In the binary machine‑code, the scale factor is encoded as a 2‑bit shift count (which is why only powers of 2 from 0 to 3 are supported, not arbitrary multipliers). So [esp+edx] in machine code is really encoded as [esp+edx*1]: There's still a shift‑amount, but it's set to 0.

Shift‑count=0 (i.e. scale‑factor=1) is not a special‑case for the hardware, because shifting is really easy for hardware to do. So really, both your loops are using the same addressing mode as far as how the hardware behaves internally.

So @Ped7g is right: the difference between your loops just comes down to saving code size by using inc instead of add.


Actual speedups

See the x86 tag wiki for performance links, especially Agner Fog's guides.

Obviously summing an array will go much faster with SSE2 or AVX2 vectors. Use PADDD. (And since you need to go 16B at a time, you can't use INC and a scale factor. You could ADD by 4, and use a scale factor of 4.)

It can be more efficient to avoid using an indexed addressing mode altogether. Intel Sandybridge‑family CPUs before Skylake can't micro‑fuse indexed addressing modes.

addN PROC
    mov   ecx, dword ptr [esp+4] ; length
    lea   edx, [esp+8]           ; start of args
    lea   ecx, [edx + ecx*4]     ; end pointer
    xor   eax, eax    ; Accumulator
    AdderLoop:                      ; do{
        add    eax, dword ptr [edx]
        add    edx, 4
        cmp    edx, ecx
        jb     AdderLoop            ; } while(p < endp)
    ret
addN ENDP

The add eax, dword ptr [edx] can micro‑fuse even on Sandybridge, and CMP/JB can macro‑fuse on more CPUs than DEC/JNZ. (AMD, and Intel Core2/Nehalem, can only fuse CMP/JB). Note that this cost us an extra instruction outside the loop.


You can even reduce the number of instructions inside the loop, by counting up towards zero, and using that counter to index from the end of the array. Or, since you're just summing the array, order doesn't matter and you can loop downward:

addN PROC
    mov   ecx, dword ptr [esp+4] ; length
    xor   eax, eax    ; Accumulator
    AdderLoop:                      ; do{
        add    eax, dword ptr [esp+8 + ecx*4‑4]  ; the +8 and ‑4 reduce down to just +4, but written this way for clarity.
        dec    ecx
        jnz    AdderLoop            ; } while(idx != 0)
    ret
addN ENDP

Since modern x86 CPUs can do two loads per clock, we're only getting half the throughput without unrolling. This technique applies to all methods of indexing.

(This is not actually optimal. It demonstrates the counting‑upward‑towards‑zero technique I mentioned before. I don't have time to rewrite this one after realizing that looping downwards would be best.)

;; untested: unroll by two with a clever way to handle the odd element
addN PROC
    mov   ecx, dword ptr [esp+4] ; length
    lea   edx, [esp+8 + ecx*4]   ; one‑past‑the‑end

    xor   eax, eax        ; sum1
    push  esi
    xor   esi, esi        ; sum2

    ;; Unrolling means extra work to handle the case where the length is odd
    shr   ecx, 1          ; ecx /= 2, shifting the low bit into CF
    cmovc eax, [esp+8]    ; sum1 = first element if count was odd

    neg   ecx                    ; edx + ecx*8 == 1st or 2nd element

    AdderLoop:                   ; do{
        add    eax, dword ptr [edx + ecx*8]
        add    esi, dword ptr [edx + ecx*8 + 4]
        inc    ecx
        jl    AdderLoop          ; } while(idx < 0)

    add   eax, esi
    pop   esi
    ret
addN ENDP

This should run twice as fast (if the data is hot in L1 cache), on some CPUs. Using multiple accumulators (in this case EAX and ESI) is a very useful technique for higher‑latency operations, like FP add. We only needed two here because integer ADD has 1 cycle latency on every x86 microarchitecture.

On Intel pre‑Skylake, using non‑indexed addressing modes (and add edx, 8) would be better, since there are two memory‑addressing operations per loop, but still only one branch (which would need a CMP/JB instead of testing flags set by incrementing the index).

When unrolling, it's more common to just use a not‑unrolled loop to handle the first or last left‑over iterations. I was able to use a shift and CMOV to initialize one of the accumulators since we're only unrolling by 2, and indexed addressing modes go up to a scale factor of 8. (I could also have masked ecx with and ecx, ~1 to clear the low bit instead of shifting it, and then compensating with a higher scale factor.)

(by Govind ParmarPed7gPeter Cordes)

參考文件

  1. Is it more efficient to multiply within the address displacement or outside it? (CC BY‑SA 2.5/3.0/4.0)

#offset #assembly #cpu-registers #x86 #optimization






相關問題

ASM 使用代碼查找偏移量 (ASM find offset with code)

沒有在偏移量 0 處映射 Win32 便攜式可執行文件的可能原因是什麼? (What are possible reasons for not mapping Win32 Portable Executable images at offset 0?)

Адлюстраванне тоста з зададзеным зрушэннем (Displaying toast at a given offset)

c - 刪除前 4 個字節的數據 (c - remove first 4 bytes of data)

PCM 樣本位置 [字節偏移] 在 flac (PCM sample position [byte offset] in flac)

到達 (window).scroll 上的中間元素 (Reach middle element on (window).scroll)

插入新元素時“LIMIT OFFSET”是否穩定? (Is 'LIMIT OFFSET' stable when new element inserted?)

如何從包含 Oracle 中時區偏移的日期/時間字符串中獲取 UTC 日期/時間 (How to get UTC date/time from a date/time string that contains timezone offset in Oracle)

嚴重性:警告消息:非法字符串偏移 'id' MY OWN PROJECT (Severity: Warning Message: Illegal string offset 'id' MY OWN PROJECT)

jquery 獲取和設置文檔偏移量(或位置?) (jquery get and set document offset (or position?))

在地址位移內還是在地址位移外相乘更有效? (Is it more efficient to multiply within the address displacement or outside it?)

如何在裝配中進行十六進制偏移計算 (how to do hex offset calculation in assembly)







留言討論