본문 바로가기
인공지능/Machine Learning

[ML] Tree Based Learning Algorithms - Gradient Boosting

by 유일리 2022. 11. 30.
  • Like random forests, boosting provides another regression/classification technique for aggregating the outcome of multiple decision trees.
  • Rather than building random independent variants of a decision tree in parallel, gradient boosting is a sequential method that aims to improve the performance with each subsequent tree.
  • This works by evaluating the performance of weak models and then overweighting subsequent models to mitigate the outcome of instances misclassified in earlier rounds.
  • Instances that were classified correctly at the previous round are also replaced with a higher proportion of instances that weren’t accurately classified.
  • The concept of developing strong learners from weak learners is achieved by adding weights to trees based on misclassified cases in the previous tree.
  • The ability of the algorithm to learn from its mistakes makes gradient boosting one of the most popular algorithms in machine learning today.
  • For complex datasets with a large number of outliers, random forests may be a preferred alternative approach to boosting. The other main downside of boosting is the slow processing speed that comes with training a sequential decision model.
  • The final downside, which applies to boosting as well as random forests is the loss of visual simplicity and ease of interpretation that comes with using a single decision tree. When you have hundreds of decision trees it becomes more difficult to visualize and interpret the overall decision structure.

Boosting_classifier Example

1-2. Import libraries /dataset

import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import classification_report, confusion_matrix


df = pd.read_csv('/content/advertising.csv')

3-4. Convert non-numeric variables / Remove columns

df = pd.get_dummies(df, columns=['Country', 'City'])

del df['Ad Topic Line']
del df['Timestamp']

5. Set X and y variables

X = df.drop('Clicked on Ad',axis=1) 
y = df['Clicked on Ad']
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10, shuffle=True)

6. Set algorithm

model = ensemble.GradientBoostingClassifier( n_estimators = 200, 
                                            learning_rate = 0.1,
                                            max_depth = 5,
                                            min_samples_split = 4,
                                            min_samples_leaf = 6,
                                            max_features = 0.6,
                                            loss = 'deviance' )
model.fit(X_train, y_train)

7. Evaluate

model_predict = model.predict(X_test)

print(confusion_matrix(y_test, model_predict))
print(classification_report(y_test, model_predict))

 ∴ Better performance than Random Forest classifier

 

Boosting_regressor Example

1-2. Import libraries /dataset

listings_berlin.csv
2.81MB

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error

df = pd.read_csv('/content/listings_berlin.csv')

3-4. Convert non-numeric variables / Remove columns

del df['id']
del df['name']
del df['host_name']
del df['last_review']
del df['calculated_host_listings_count']
del df['availability_365']
del df['longitude']
del df['neighbourhood']
del df['latitude']

df = pd.get_dummies(df, columns = ['neighbourhood_group', 'room_type'])

df.dropna(axis = 0, how = 'any', thresh = None, subset = None, inplace = True)

5. Set X and y variables

X = df.drop('price', axis = 1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10, shuffle=True)

6. Set algorithm

model = ensemble.GradientBoostingRegressor(
        n_estimators = 350, 
        learning_rate = 0.1, 
        max_depth = 5, 
        min_samples_split = 4, 
        min_samples_leaf = 6, 
        max_features = 0.6, 
        loss = 'huber'
       )

model.fit(X_train, y_train)

mae_train, mae_test = 0.0, 0.0

mae_train = mean_absolute_error(y_train, model.predict(X_train))
print("Training Set Mean Absolute Error: %.2f" % mae_train)

mae_test = mean_absolute_error(y_test, model.predict(X_test))
print("Test Set Mean Absolute Error: %.2f" % mae_test)

7. Evaluate

new_property = [
    3176426, # host_id
    2, # minimum_nights
    19, # number_of_reviews
    1.08, # reviews_per_month
    0, # neighbourhood_group_Charlottenburg-Wilm
    0, # neighbourhood_group_Friedrichshain-Kreuzberg
    0, # neighbourhood_group_Lichtenberg
    0, # neighbourhood_group_Marzahn-Hellersdorf
    1, # neighbourhood_group_Mitte
    0, # neighbourhood_group_Neukolln
    0, # neighbourhood_group_Pankow
    0, # neighbourhood_group_Reinickendorf
    0, # neighbourhood_group_Spandau
    0, # neighbourhood_group_Steglitz-Zehlendorf
    0, # neighbourhood_group_Tempelhof-Schoneberg
    0, # neighbourhood_group_Treptow-Kopenick
    1, # room_type_Entire home/apt
    0, # room_type_Hotel room
    0, # room_type_Private room
    0  # room_type_Shared room
]

new_pred = model.predict([new_property])
new_pred

https://github.com/erica00j/machinelearning/blob/main/boosting_regressor.ipynb

 

GitHub - erica00j/machinelearning

Contribute to erica00j/machinelearning development by creating an account on GitHub.

github.com

 

댓글