如何在 R 中讀取 HDFS 中的文件而不會丟失列名和行名 (How to read files in HDFS in R without loosing column and row names)


問題描述

如何在 R 中讀取 HDFS 中的文件而不會丟失列名和行名 (How to read files in HDFS in R without loosing column and row names)

我的問題是,當我讀取一個包含列名(例如,標題)的 csv 文件時,列名消失了,而是改為“V1”、“V2”......

我有csv格式的mtcars數據集,這裡是預覽

model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1

我想上傳到HDFS並閱讀它,所以我去“HUE”平台上傳文件。我可以在文件管理器中查看。這是一個小預覽:

enter image description here

然後在 R 會話中使用 plyrmr 我運行代碼以後:

filename3 <‑ "/user/sgerony/mtcars.csv"
input(filename3,format=make.input.format(format = "csv", sep=","))

結果如下:

                V1   V2  V3    V4  V5   V6    V7    V8 V9 V10  V11  V12
1    Chrysler Imperial 14.7   8   440 230 3.23 5.345 17.42  0   0    3    4
2             Fiat 128 32.4   4  78.7  66 4.08   2.2 19.47  1   1    4    1
3          Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1   1    4    2
4       Toyota Corolla 33.9   4  71.1  65 4.22 1.835  19.9  1   1    4    1

如您所見,列名已消失。我做錯了什麼?

謝謝


參考解法

方法 1:

This is the solution I found (I really don't like like it so if you have a better one please do share).

I separated the csv file in two csv files, one containing only the column names (mtcars_names.csv) and the other containing the data (mtcars_no_names.csv). Then uploaded them on the file manager.

filename <‑ "/user/sgerony/mtcars_no_names.csv"
filename.names <‑ "/user/sgerony/mtcars_names.csv"
filename.names <‑ as.data.frame(input(filename.names,
format=make.input.format(format = "csv", sep=",")))

# transform the columns in "character" types
for(i in 1:dim(filename.names)[2]){
  filename.names[,i] <‑ as.character(filename.names[,i])
}

Now everytime I write /read the file I code:

### comlumn name information is once more lost
output(input(filename,format=make.input.format(format = "csv",
sep=",", col.names = filename.names[1,])),
path="/user/sgerony/mtcars_output_csv")

input("/user/sgerony/mtcars_output_csv",
format=make.input.format(format = "csv", 
sep=",", col.names = filename.names[1,]))

which can get quite messy if I generate data subsets. For each subset with different column names a new file containing the column names will have to be generated

(by SamSam)

參考文件

  1. How to read files in HDFS in R without loosing column and row names (CC BY‑SA 2.5/3.0/4.0)

#R #hadoop #CSV #hdfs






相關問題

如何將均值、標準差等函數應用於整個矩陣 (How to apply mean, sd etc. function to a whole matrix)

Tạo các thùng của mỗi hàng trong bảng và vẽ hình thanh ngăn xếp trong R (Make bins of each table row and draw stack bar figure in R)

Reading not quite correct .csv file in R (Reading not quite correct .csv file in R)

包'treemap'中的線條粗細 (Thickness of lines in Package ‘treemap’)

是否需要帶有 awk 的預處理文件,或者可以直接在 R 中完成? (Is preprocessing file with awk needed or it can be done directly in R?)

rpivotTable 選擇元素下拉菜單 (rpivotTable select elements drop down menu)

優化性能 - Shiny 中的大文件輸入 (Optimizing Performance - Large File Input in Shiny)

數值取決於所應用的應用系列,R (Numeric values depending of apply family applied, R)

如何記錄全年的值? (How to note the values across year?)

R中的線性搜索 (Linear search in R)

在 dplyr/purrr 工作流程中動態連接多個數據集 (Dynamically join multiple datasets in a dplyr/purrr workflow)

如何將行值更改為列名 (R) (How change Row values to Column names (R))







留言討論