使用 numpy 或 pandas 從元組列表中為二元組創建頻率矩陣 (Create a frequency matrix for bigrams from a list of tuples, using numpy or pandas)


問題描述

使用 numpy 或 pandas 從元組列表中為二元組創建頻率矩陣 (Create a frequency matrix for bigrams from a list of tuples, using numpy or pandas)

我對 Python 很陌生。我有一個元組列表,我在其中創建了二元組。

這個問題非常接近我的需求

my_list = [('we', 'consider'), ('what', 'to'), ('use', 'the'), ('words', 'of')]

現在我正在嘗試將其轉換為頻率矩陣

所需的輸出是

          consider  of  the  to  use  we  what  words
consider         0   0    0   0    0   0     0      0
of               0   0    0   0    0   0     0      0
the              0   0    0   0    0   0     0      0
to               0   0    0   0    0   0     0      0
use              0   0    1   0    0   0     0      0
we               1   0    0   0    0   0     0      0
what             0   0    0   1    0   0     0      0
words            0   1    0   0    0   0     0      0

如何做到這一點,使用 numpypandas?不幸的是,我只能看到 nltk 的東西。


參考解法

方法 1:

You can create frequancy data frame and call index‑values by words:

words=sorted(list(set([item for t in my_list for item in t])))
df = pd.DataFrame(0, columns=words, index=words)
for i in my_list:
  df.at[i[0],i[1]] += 1

output:

          consider  of  the  to  use  we  what  words
consider         0   0    0   0    0   0     0      0
of               0   0    0   0    0   0     0      0
the              0   0    0   0    0   0     0      0
to               0   0    0   0    0   0     0      0
use              0   0    1   0    0   0     0      0
we               1   0    0   0    0   0     0      0
what             0   0    0   1    0   0     0      0
words            0   1    0   0    0   0     0      0

Note that in this one, the order in the bigram matters. If you don't care about order, you should sort the tuples by their content first, using this:

my_list = [tuple(sorted(i)) for i in my_list]

Another way is to use Counter to do the count, but I expect it to be similar performance(again if order in bigrams matters, remove sorted from frequency_list):

from collections import Counter

frequency_list = Counter(tuple(sorted(i)) for i in my_list)
words=sorted(list(set([item for t in my_list for item in t])))
df = pd.DataFrame(0, columns=words, index=words)
for k,v in frequency_list.items():
  df.at[k[0],k[1]] = v

output:

          consider  of  the  to  use  we  what  words
consider         0   0    0   0    0   1     0      0
of               0   0    0   0    0   0     0      1
the              0   0    0   0    1   0     0      0
to               0   0    0   0    0   0     1      0
use              0   0    0   0    0   0     0      0
we               0   0    0   0    0   0     0      0
what             0   0    0   0    0   0     0      0
words            0   0    0   0    0   0     0      0

方法 2:

If you do not care about speed too much you could use for loop.

import pandas as pd
import numpy as np
from itertools import product

my_list = [('we', 'consider'), ('what', 'to'), ('use', 'the'), ('words', 'of')]

index = pd.DataFrame(my_list)[0].unique()
columns = pd.DataFrame(my_list)[1].unique()
df = pd.DataFrame(np.zeros(shape=(len(columns), len(index))),
                  columns=columns, index=index, dtype=int)

for idx,col in product(index, columns):
    df[col].loc[idx] = my_list.count((idx, col))

print(df)

Output:

       consider  to  the  of
we            1   0    0   0
what          0   1    0   0
use           0   0    1   0
words         0   0    0   1

(by Anakin SkywalkerEhsansszokoly)

參考文件

  1. Create a frequency matrix for bigrams from a list of tuples, using numpy or pandas (CC BY‑SA 2.5/3.0/4.0)

#matrix #Python #Numpy #text #pandas






相關問題

BLAS 子程序 dgemm、dgemv 和 ddot 不適用於標量? (BLAS subroutines dgemm, dgemv and ddot doesn't work with scalars?)

為什麼我們需要維護自己的矩陣來轉換遊戲對象? (Why we need to maintain our own matrices to transform Game objects?)

R 高斯消除和 qr 分解 (R Gaussian Elimination and qr factorization)

生成尺寸為 8x8 的正定矩陣 (Generating Positive definite matrix of dimensions 8x8)

替代在此 Ruby 代碼中使用基於時間間隔分配標籤的巨型 if/else (Alternative to using a giant if/else in this Ruby code that assigns labels based on the time interval)

如何創建一個行矩陣,其元素是我的 while 循環的迭代 (How to create a row matrix whose elements are the iterations of my while loop)

在Matlab中找到矩陣中相同元素的開始索引和結束索引 (Find the Start Index and End Index of the same Element in a Matrix in Matlab)

用 Matlab 寫一個方程(矩陣大小) (writing an equation with Matlab (Matrix size))

使用 numpy 或 pandas 從元組列表中為二元組創建頻率矩陣 (Create a frequency matrix for bigrams from a list of tuples, using numpy or pandas)

如何在循環和 if 語句中使用遞歸公式 (How to use recursive formula in loop and if statement)

如何從 p 值矩陣中獲得緊湊的字母顯示? (How to get a compact letter display from a matrix of p-values?)

刺激基質上的液體流動 (Stimulating Liquid Flow on Matrix)







留言討論