如何在 PySpark 中有效地按值排序？ (How to sort by value efficiently in PySpark?)

問題描述

我想按 V 對我的 K,V 元組進行排序，即按值。我知道 TakeOrdered 如果你知道你需要多少就可以做到這一點：

b = sc.parallelize([('t',3),('b',4),('c',1)])

Using TakeOrdered:

b.takeOrdered(3,lambda atuple: atuple[1])

Using Lambda

b.map(lambda aTuple: (aTuple[1], aTuple[0])).sortByKey().map(
    lambda aTuple: (aTuple[0], aTuple[1])).collect()

我已經檢查了問題 here，暗示後者。我很難相信 takeOrdered 如此簡潔，但它需要與 Lambda 解決方案相同數量的操作。

有人知道spark中更簡單、更簡潔的Transformation to sort by value嗎？

參考解法

方法 1:

I think sortBy() is more concise:

b = sc.parallelize([('t', 3),('b', 4),('c', 1)])
bSorted = b.sortBy(lambda a: a[1])
bSorted.collect()
...
[('c', 1),('t', 3),('b', 4)]

It's actually not more efficient at all as it involves keying by the values, sorting by the keys, and then grabbing the values but it looks prettier than your latter solution. In terms of efficiency, I don't think you'll find a more efficient solution as you would need a way to transform your data such that values will be your keys (and then eventually transform that data back to the original schema).

方法 2:

Just wanted to add this tip.. which helped me out alot

Ascending:

bSorted = b.sortBy(lambda a: a[1])

Descending:

bSorted = b.sortBy(lambda a: ‑a[1])

(by makansij、Rohan Aletty、REZ)

參考文件

How to sort by value efficiently in PySpark? (CC BY‑SA 2.5/3.0/4.0)

如何在 PySpark 中有效地按值排序？ (How to sort by value efficiently in PySpark?)

問題描述

參考解法

方法 1:

方法 2:

參考文件

相關問題

留言討論

如何在 PySpark 中有效地按值排序？ (How to sort by value efficiently in PySpark?)

問題描述

參考解法

方法 1:

方法 2:

參考文件

相關問題

Lambda 表達式中的 SQL WHERE 等價物是什麼？ (What's the equivalence of an SQL WHERE in Lambda expressions?)

如何將 lambda 傳遞給 Razor 輔助方法？ (How to pass in a lambda to a Razor helper method?)

lỗi biểu thức lambda: biểu thức phải là giá trị có thể sửa đổi (lambda expression error: expression must be a modifiable lvalue)

如何在 PySpark 中有效地按值排序？ (How to sort by value efficiently in PySpark?)

將列表列表減少為字典，以子列表大小為鍵，出現次數為值 (Reduce list of list to dictionary with sublist size as keys and number of occurances as value)

匿名類作為泛型參數 (anonymous class as generic parameter)

如何為 lambda 中的運算符賦予不定性？ (How to give infixities to operators in lambda?)

如何發出委託或 lambda 表達式 (Howto emit a delegate or lambda expression)

深入學習 C# 表達式樹的最佳資源是什麼？ (What is the best resource for learning C# expression trees in depth?)

根據最高日期從 IGrouping 中獲取項目 (Get an item from a IGrouping based on the highest date)

如何在 C# 中使用“param”來獲取這個示例？ (How can I get this example using "param" in C# to work?)

如何使用 C# 中的 Lambda 表達式僅返回沒有來自另一個表的行的行 (How to return only rows where no rows from another table using Lambda Expressions in C#)

留言討論