問題描述
如何在 PySpark 中有效地按值排序? (How to sort by value efficiently in PySpark?)
我想按 V 對我的 K,V 元組進行排序,即按值。我知道 TakeOrdered
如果你知道你需要多少就可以做到這一點:
b = sc.parallelize([('t',3),('b',4),('c',1)])
Using TakeOrdered:
b.takeOrdered(3,lambda atuple: atuple[1])
Using Lambda
b.map(lambda aTuple: (aTuple[1], aTuple[0])).sortByKey().map(
lambda aTuple: (aTuple[0], aTuple[1])).collect()
我已經檢查了問題 here,暗示後者。我很難相信 takeOrdered
如此簡潔,但它需要與 Lambda
解決方案相同數量的操作。
有人知道spark中更簡單、更簡潔的Transformation to sort by value嗎?
參考解法
方法 1:
I think sortBy()
is more concise:
b = sc.parallelize([('t', 3),('b', 4),('c', 1)])
bSorted = b.sortBy(lambda a: a[1])
bSorted.collect()
...
[('c', 1),('t', 3),('b', 4)]
It's actually not more efficient at all as it involves keying by the values, sorting by the keys, and then grabbing the values but it looks prettier than your latter solution. In terms of efficiency, I don't think you'll find a more efficient solution as you would need a way to transform your data such that values will be your keys (and then eventually transform that data back to the original schema).
方法 2:
Just wanted to add this tip.. which helped me out alot
Ascending:
bSorted = b.sortBy(lambda a: a[1])
Descending:
bSorted = b.sortBy(lambda a: ‑a[1])
(by makansij、Rohan Aletty、REZ)