使用 32 位字計算浮點數 (Floating point number calculation using a 32 bit word)


問題描述

使用 32 位字計算浮點數 (Floating point number calculation using a 32 bit word)

我正在閱讀 Patterson 的計算機組織第五版書,我對這兩頁文字感到困惑。第一頁:

enter image description here

第一個單詞是否等於十進制的0.5?我看到符號為 0,指數為 ‑1,小數為 0,有效數中隱含 1。所以 1.0_two * 2^‑1 = 0.5?對嗎?

為什麼 1.0 * 2^1 是“較小的二進制數?”。第二個字不是更大嗎?它的符號為 0,指數為 1,有效數為隱含的 1 = 1.0 * 2^1 = 2?對嗎?

我不知道 t 理解以下段落:

因此,理想的符號必須將最負的指數表示為 00 ... 00_two,將最正的指數表示為 11 ... 11_two。這種約定稱為偏差表示法,偏差是從正常的無符號表示中減去的數字,以確定實際值。


參考解法

方法 1:

If you look at them just as binary numbers, the first one is 0x7f800000 while the second is 0x00800000, so the second is a smaller binary number even though it represents a larger floating point number. So using a binary comparison or sort would do the wrong thing.

So instead the biased representation for the exponent is used, which means the binary value for 0.5 is 0x3f000000 and the binary value for 2.0 is 0x40000000, and the binary comparison "works" for comparing and sorting floating point numbers.

The problem being that this is still a sign+magnitude representation, so you need a sign+magnitude binary comparison, while most hardware uses 2s‑complement. So you still end up needing special floating point comparison instructions/hardware.

(by Jwan622Chris Dodd)

參考文件

  1. Floating point number calculation using a 32 bit word (CC BY‑SA 2.5/3.0/4.0)

#floating-point #computer-science






相關問題

C++:如何在不捨入、截斷或填充的情況下將浮點數轉換為字符串? (C++: How to Convert From Float to String Without Rounding, Truncation or Padding?)

Python: Bagaimana cara memperkirakan float ketika saya mengonversi ke string? (Python: How do I approximate a float when I convert to string?)

使用浮點值進行單元測試 (Unit testing with floating point values)

C# 中的雙精度數比浮點數快嗎? (Are doubles faster than floats in C#?)

在 MIPS 中對浮點數進行排序 (Sorting Floating Point Numbers In MIPS)

在浮動和雙精度之間進行選擇 (Choosing between float and double)

帶浮點數的無限循環 (Endless for loop with float)

使用 JavaScript 獲取數字的小數部分 (Get decimal portion of a number with JavaScript)

Java:字節到浮點數/整數 (Java: Bytes to floats / ints)

為什麼 C 和 Java 對浮點數進行舍入不同? (Why do C and Java round floats differently?)

為半浮點優化 GLSL (Optimize GLSL for half float)

輸出為-1.#QNAN0,我不知道為什麼 (Output is -1.#QNAN0 and I don't know why)







留言討論