問題描述
Pandas ‑ 將直方圖桶分配給每一行 (Pandas ‑ assign histogram bucket to each row)
這是我的數據框:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4, 6, 4, 3, 2, 7]})
buckets = [(0,3),(3,5),(5,9)]
我也有上面提到的直方圖桶。現在我想將每一行數據幀分配給存儲桶索引。所以我想獲得包含以下信息的新列:
df['buckets_index'] = [0,0,0,1,2,1,0,0,2]
當然,我可以使用循環來完成,但我有相當大的數據框(250 萬行),所以我需要完成它很快。
有什麼想法嗎?
參考解法
方法 1:
You can use pd.cut
, with labels=False
if you only want the index:
buckets = [0,3,5,9]
df['bucket'] = pd.cut(df['A'], bins=buckets)
df['bucket_idx'] = pd.cut(df['A'], bins=buckets, labels=False)
The resulting output:
A bucket bucket_idx
0 1 (0, 3] 0
1 2 (0, 3] 0
2 3 (0, 3] 0
3 4 (3, 5] 1
4 6 (5, 9] 2
5 4 (3, 5] 1
6 3 (0, 3] 0
7 2 (0, 3] 0
8 7 (5, 9] 2
方法 2:
You could use np.searchsorted
‑
df['buckets_index'] = np.asarray(buckets)[:,1].searchsorted(df.A.values)
Runtime test ‑
In [522]: df = pd.DataFrame({'A': np.random.randint(1,8,(10000))})
In [523]: buckets = [0,3,5,9]
In [524]: %timeit pd.cut(df['A'], bins=buckets, labels=False)
1000 loops, best of 3: 460 µs per loop # @root's soln
In [525]: buckets = [(0,3),(3,5),(5,9)]
In [526]: %timeit np.asarray(buckets)[:,1].searchsorted(df.A.values)
10000 loops, best of 3: 166 µs per loop
Outside limits cases : For such cases, we need to use clipping, like so ‑
np.asarray(buckets)[:,1].searchsorted(df.A.values).clip(max=len(buckets)‑1)
(by user1700890、root、Divakar)