〈 AI 學習筆記〉Automated Machine Learning 簡介與Auto-Sklearn實作紀錄

一、簡介 Automated Machine Learning

Automated Machine Learning 顧名思義就是將訓練模型的步驟自動化。從資料前處理、特徵處理、挑選最適模型(NAS)、hyperparameters的優化等，盡可能減少整個流程的繁瑣程度。

二、用Auto-Sklearn和Fetal Health資料集實作

Fetal Health是Kaggle上提供的資料集，有21個特徵欄位和目標欄位 ['fetal_health'] ，模型的目標是能夠分類出3種健康狀況：1-Normal; 2-Suspect; 3-Pathological.

Auto-Sklearn是一個AutoML的套件，這次是使用Version 1。因為似乎沒有支援Linux以外的操作系統，為方便起見我使用Google Colab：

首先先用pip install載入套件，出現錯誤是因為系統沒辦法取得元特徵的資料，稍後會處理

!pip install auto-sklearn #跑完後要重新啟動執行階段

接著載入基本資料處理的套件，讀入需要的資料集

import numpy as np
import pandas as pd

df = pd.read_csv('/content/fetal_health.csv')
df # 'fetal_health' encoded as 1-Normal; 2-Suspect; 3-Pathological.

之後進行資料前處理，只要將feature columns和target column分開就行了，Auto-Sklearn會自動進行缺失項填補、one-hot encoding等簡易的前處理

from sklearn.model_selection import train_test_split
X = df.drop(labels=['fetal_health'],axis=1).values # 移除Species並取得剩下欄位資料
y = df['fetal_health'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print('train shape:', X_train.shape)
print('test shape:', X_test.shape)

輸出結果：

train shape: (1700, 21)
test shape: (426, 21)

建構和fit模型

import autosklearn.classification
autoclassifier = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=180,
    per_run_time_limit=40,
    resampling_strategy='cv',
    resampling_strategy_arguments={'folds': 5},
    include = {
        'classifier': ["random_forest",'sgd','adaboost','bernoulli_nb','gradient_boosting'],
        'feature_preprocessor': ["pca",'kernel_pca','fast_ica','liblinear_svc_preprocessor'],
    },
    initial_configurations_via_metalearning=0
)
autoclassifier.fit(X_train, y_train)

上面有提到Google Colab可能沒辦法取得到元特徵的資料，因此在fit時會一直跳出錯誤訊息，可以透過：
initial_configurations_via_metalearning=0
移除錯誤訊息，就是不使用元特徵進行模型初始化。

我們可以使用include來規定所需要的搜尋空間，限定我們想用的classifier, regressor或preprocessor，至於有哪些可以選擇，這邊提供一個表單

此外注意我們放入的資料型態是pandas的DataFrame，有所謂的per-column dtype，因此不需要告訴模型各個column的資料型態(建議還是加入，以便模型分辨哪些是numerical或categorical等等)。因此，在使用numpy array這種沒有per-column dtype的資料型態時，我們就需要在fit時用feat_type來記錄我們每一行的dtype。

feat_type = ['numerical']*21
autoclassifier.fit(X_train, y_train, X_test, y_test, feat_type=feat_type)

最後來看看模型結果

print('autoclassifier 訓練集: ',autoclassifier.score(X_train,y_train))
print('autoclassifier 測試集: ',autoclassifier.score(X_test,y_test))

輸出結果

訓練集:  0.9994117647058823
測試集:  0.903755868544601

查看訓練完的模型可以使用autoclassifier.leaderboard()

autoclassifier.leaderboard(detailed = True, ensemble_only=True)

三、結語

因為是簡易範例因此整體準確度有待加強，但在使用上的確比傳統的機器學習方便很多，而且新推出的Auto-Sklearn2可以達到更高的準確度，之後有機會再來寫

〈 AI 學習筆記〉Automated Machine Learning 簡介與Auto-Sklearn實作紀錄

一、簡介 Automated Machine Learning

二、用Auto-Sklearn和Fetal Health資料集實作

三、結語

tueric32

你可能感興趣的文章

留言討論

〈 AI 學習筆記 〉Automated Machine Learning 簡介與Auto-Sklearn實作紀錄

一、簡介 Automated Machine Learning

二、用Auto-Sklearn和Fetal Health資料集實作

三、結語

tueric32

你可能感興趣的文章

Day 7 - Hangman

1193. Monthly Transactions I

如何充滿熱情地學習 - 以資料結構為例

留言討論

〈 AI 學習筆記〉Automated Machine Learning 簡介與Auto-Sklearn實作紀錄