問題描述
ValueError:發現樣本數量不一致的輸入變量:[2935848、2935849] (ValueError: Found input variables with inconsistent numbers of samples: [2935848, 2935849])
當我運行這段代碼時:
feature_names = ["date","shop_id", "item_id", "item_price", "item_cnt_day"]
feature_names
X_train = train[feature_names]
print(X_train.shape)
X_train.head()
X_sales = sales[feature_names]
print(X_sales.shape)
X_sales.head()
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_sales, y_train, y_sales = train_test_split(X_train, X_sales, test_size=0.3)
feature_names = ["date","shop_id", "item_id", "item_price", "item_cnt_day"]
feature_names
X_train = train[feature_names]
print(X_train.shape)
X_train.head()
X_sales = sales[feature_names]
print(X_sales.shape)
X_sales.head()
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_sales, y_train, y_sales = train_test_split(X_train, X_sales, test_size=0.3)
(2935848, 5)
(2935849, 5)
我得到這個 ValueError
:
293 check_consistent_length(*result) 294 返回結果 295ValueError Traceback(最近一次調用最後一次)來自 13 sklearn.metrics 導入 mean_squared_error 14 ‑‑‑> 15 X_train, X_sales, y_train, y_sales = train_test_split(X_train, X_sales, test_size=0.3) 16
~/anaconda3/envs/aiffel/lib/python3.7/site‑packages/sklearn/model_selection/_split .py in train_test_split(*arrays, **options) 2125 raise TypeError(“傳遞的參數無效:%s”% str(options)) 2126 ‑> 2127 數組 = 可索引(*數組) 2128 2129 n_samples = _num_samples(數組 [0])
~/anaconda3/envs/aiffel/lib/python3.7/site‑packages/sklearn/utils/validation。py in indexable(*iterables) 291 “” 292 結果 = [_make_indexable(X) for X in iterables] ‑‑> 293 check_consistent_length(*result) 294 返回結果 295
~/anaconda3/envs/aiffel/lib/python3.7/site‑packages/sklearn/utils/validation.py in check_consistent_length(*arrays) 255 if len(唯一)> 1: 256 raise ValueError("Found input variables with distinct numbers of" ‑‑> 257" samples: %r" % [int(l) for l in lengths]) 258 259
ValueError :發現樣本數量不一致的輸入變量:[2935848, 2935849]
~/anaconda3/envs/aiffel/lib/python3.7/site‑packages/sklearn/utils/validation.py in check_consistent_length(*arrays) 255 if len(唯一)> 1: 256 raise ValueError("Found input variables with distinct numbers of" ‑‑> 257" samples: %r" % [int(l) for l in lengths]) 258 259
ValueError :發現樣本數量不一致的輸入變量:[2935848, 2935849]
293 check_consistent_length(*result) 294 返回結果 295~/anaconda3/envs/aiffel/lib/python3.7/site‑packages/sklearn/utils/validation.py in check_consistent_length(*arrays) 255 if len(唯一)> 1: 256 raise ValueError("Found input variables with distinct numbers of" ‑‑> 257" samples: %r" % [int(l) for l in lengths]) 258 259
ValueError :發現樣本數量不一致的輸入變量:[2935848, 2935849]
找到具有不一致數量的“”的輸入變量 ‑‑> 257” 樣本:%r” % [int(l) for l in lengths]) 258 259ValueError: 發現輸入變量的樣本數不一致:[2935848, 2935849]
找到具有不一致數量的“”的輸入變量 ‑‑> 257” 樣本:%r” % [int(l) for l in lengths]) 258 259ValueError: 發現輸入變量的樣本數不一致:[2935848, 2935849]
參考解法
方法 1:
Your problem is reached because you two dataframe (train and sales) have different length. Your train dataset has 2935848 samples and the sales dataset has 2935849. Both dataset has to have the same length in order to work properly. Check why this length is not matching and add one row or drop one to match them.
Secondly, but no least, you should understand what are you doing with train_test_split
and which is your goal. This function inputs are X and Y, and outputs X_train
, X_test
, y_train
, y_test
. Reading your code, you are inputting two X (X_train
and X_sales
) with same 5 features. I hope you are doing this because some reason, be aware of this.
X are all the samples with their features, and Y are the corresponding outputs value you want to predict. Check that and evaluate is using train_test_split
is the function you are looking for.
方法 2:
I have this error while I'm trying to do my confusion matrix: Found input variables with inconsistent numbers of samples: [1527, 1]
This is my code:
x = df[['gender', 'age', 'hypertension', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'work_type_cat', 'gender_cat', 'Residence_type_cat']]
y = df['stroke']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=20)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
scaler = StandardScaler()
x_train_scale = scaler.fit_transform(x_train)
x_test_scale = scaler.fit_transform(x_test)
KNN = KNeighborsClassifier()
x = df[['gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'work_type_cat', 'gender_cat', 'Residence_type_cat']]
y = df['stroke']
print(x.head())
print(y.head())
KNN = KNN.fit(x, y)
test = pd.DataFrame()
test['gender'] = [2]
test['age'] = [3]
test['hypertension'] = [0]
test['heart_disease'] = [0]
test['ever_married'] = [2]
test['work_type'] = [4]
test['Residence_type'] = [2]
test['avg_glucose_level'] = [95.12]
test['bmi'] = [18]
test['smoking_status'] = [2]
test['work_type_cat'] = [4]
test['gender_cat'] = [1]
test['Residence_type_cat'] = [1]
y_predict = KNN.predict(test)
print(y_predict)
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_predict))
(by 이진규、Alex Serra Marrugat、Rachel)