Pandas - 將直方圖桶分配給每一行 (Pandas - assign histogram bucket to each row)


問題描述

Pandas ‑ 將直方圖桶分配給每一行 (Pandas ‑ assign histogram bucket to each row)

這是我的數據框:

import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4, 6, 4, 3, 2, 7]})
buckets = [(0,3),(3,5),(5,9)]

我也有上面提到的直方圖桶。現在我想將每一行數據幀分配給存儲桶索引。所以我想獲得包含以下信息的新列:

df['buckets_index']  = [0,0,0,1,2,1,0,0,2]

當然,我可以使用循環來完成,但我有相當大的數據框(250 萬行),所以我需要完成它很快。

有什麼想法嗎?


參考解法

方法 1:

You can use pd.cut, with labels=False if you only want the index:

buckets = [0,3,5,9]
df['bucket'] = pd.cut(df['A'], bins=buckets)
df['bucket_idx'] = pd.cut(df['A'], bins=buckets, labels=False)

The resulting output:

   A  bucket  bucket_idx
0  1  (0, 3]           0
1  2  (0, 3]           0
2  3  (0, 3]           0
3  4  (3, 5]           1
4  6  (5, 9]           2
5  4  (3, 5]           1
6  3  (0, 3]           0
7  2  (0, 3]           0
8  7  (5, 9]           2

方法 2:

You could use np.searchsorted

df['buckets_index'] = np.asarray(buckets)[:,1].searchsorted(df.A.values)

Runtime test ‑

In [522]: df = pd.DataFrame({'A': np.random.randint(1,8,(10000))})

In [523]: buckets = [0,3,5,9]

In [524]: %timeit pd.cut(df['A'], bins=buckets, labels=False)
1000 loops, best of 3: 460 µs per loop # @root's soln

In [525]: buckets = [(0,3),(3,5),(5,9)]

In [526]: %timeit np.asarray(buckets)[:,1].searchsorted(df.A.values)
10000 loops, best of 3: 166 µs per loop

Outside limits cases : For such cases, we need to use clipping, like so ‑

np.asarray(buckets)[:,1].searchsorted(df.A.values).clip(max=len(buckets)‑1)

(by user1700890rootDivakar)

參考文件

  1. Pandas ‑ assign histogram bucket to each row (CC BY‑SA 2.5/3.0/4.0)

#Python #performance #histogram #pandas #vectorization






相關問題

如何從控制台中導入的文件中訪問變量的內容? (How do I access the contents of a variable from a file imported in a console?)

在 python 3.5 的輸入列表中添加美元符號、逗號和大括號 (Adding dollar signs, commas and curly brackets to input list in python 3.5)

為 KeyError 打印出奇怪的錯誤消息 (Strange error message printed out for KeyError)

django 1.9 中的 from django.views.generic.simple import direct_to_template 相當於什麼 (What is the equivalent of from django.views.generic.simple import direct_to_template in django 1.9)

查詢嵌入列表中的數組 (Querying for array in embedded list)

如何在 Python 中搜索子字符串是否在二進製文件中? (How to search if a substring is into a binary file in Python?)

為什麼要避免 while 循環? (Why avoid while loops?)

使用python的json模塊解析json請求 (Parse a json request using json module of python)

為什麼使用 py2app 模塊創建 mac 文件時出現錯誤? (Why i am getting Error when creating mac file using py2app module?)

當 python 線程在網絡調用(HTTPS)中並且發生上下文切換時會發生什麼? (What happens when the python thread is in network call(HTTPS) and the context switch happens?)

如何繪製一條帶斜率和一個點的線?Python (How to plot a line with slope and one point given? Python)

Pickle 找不到我不使用的模塊? (Pickle can't find module that I am not using?)







留言討論