問題描述
如何在 R 中讀取 HDFS 中的文件而不會丟失列名和行名 (How to read files in HDFS in R without loosing column and row names)
我的問題是,當我讀取一個包含列名(例如,標題)的 csv 文件時,列名消失了,而是改為“V1”、“V2”......
我有csv格式的mtcars
數據集,這裡是預覽
model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
我想上傳到HDFS並閱讀它,所以我去“HUE”平台上傳文件。我可以在文件管理器中查看。這是一個小預覽:
然後在 R 會話中使用 plyrmr
我運行代碼以後:
filename3 <‑ "/user/sgerony/mtcars.csv"
input(filename3,format=make.input.format(format = "csv", sep=","))
結果如下:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
2 Fiat 128 32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
3 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
4 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
如您所見,列名已消失。我做錯了什麼?
謝謝
參考解法
方法 1:
This is the solution I found (I really don't like like it so if you have a better one please do share).
I separated the csv file in two csv files, one containing only the column names (mtcars_names.csv) and the other containing the data (mtcars_no_names.csv). Then uploaded them on the file manager.
filename <‑ "/user/sgerony/mtcars_no_names.csv"
filename.names <‑ "/user/sgerony/mtcars_names.csv"
filename.names <‑ as.data.frame(input(filename.names,
format=make.input.format(format = "csv", sep=",")))
# transform the columns in "character" types
for(i in 1:dim(filename.names)[2]){
filename.names[,i] <‑ as.character(filename.names[,i])
}
Now everytime I write /read the file I code:
### comlumn name information is once more lost
output(input(filename,format=make.input.format(format = "csv",
sep=",", col.names = filename.names[1,])),
path="/user/sgerony/mtcars_output_csv")
input("/user/sgerony/mtcars_output_csv",
format=make.input.format(format = "csv",
sep=",", col.names = filename.names[1,]))
which can get quite messy if I generate data subsets. For each subset with different column names a new file containing the column names will have to be generated