[ML] Tree Based Learning Algorithms

Like random forests, boosting provides another regression/classification technique for aggregating the outcome of multiple decision trees.
Rather than building random independent variants of a decision tree in parallel, gradient boosting is a sequential method that aims to improve the performance with each subsequent tree.
This works by evaluating the performance of weak models and then overweighting subsequent models to mitigate the outcome of instances misclassified in earlier rounds.
Instances that were classified correctly at the previous round are also replaced with a higher proportion of instances that weren’t accurately classified.
The concept of developing strong learners from weak learners is achieved by adding weights to trees based on misclassified cases in the previous tree.
The ability of the algorithm to learn from its mistakes makes gradient boosting one of the most popular algorithms in machine learning today.
For complex datasets with a large number of outliers, random forests may be a preferred alternative approach to boosting. The other main downside of boosting is the slow processing speed that comes with training a sequential decision model.
The final downside, which applies to boosting as well as random forests is the loss of visual simplicity and ease of interpretation that comes with using a single decision tree. When you have hundreds of decision trees it becomes more difficult to visualize and interpret the overall decision structure.

Boosting_classifier Example

1-2. Import libraries /dataset

import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import classification_report, confusion_matrix


df = pd.read_csv('/content/advertising.csv')

3-4. Convert non-numeric variables / Remove columns

df = pd.get_dummies(df, columns=['Country', 'City'])

del df['Ad Topic Line']
del df['Timestamp']

5. Set X and y variables

X = df.drop('Clicked on Ad',axis=1) 
y = df['Clicked on Ad']
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10, shuffle=True)

6. Set algorithm

model = ensemble.GradientBoostingClassifier( n_estimators = 200, 
                                            learning_rate = 0.1,
                                            max_depth = 5,
                                            min_samples_split = 4,
                                            min_samples_leaf = 6,
                                            max_features = 0.6,
                                            loss = 'deviance' )
model.fit(X_train, y_train)

7. Evaluate

model_predict = model.predict(X_test)

print(confusion_matrix(y_test, model_predict))
print(classification_report(y_test, model_predict))

∴ Better performance than Random Forest classifier

Boosting_regressor Example

1-2. Import libraries /dataset

listings_berlin.csv

2.81MB

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error

df = pd.read_csv('/content/listings_berlin.csv')

3-4. Convert non-numeric variables / Remove columns

del df['id']
del df['name']
del df['host_name']
del df['last_review']
del df['calculated_host_listings_count']
del df['availability_365']
del df['longitude']
del df['neighbourhood']
del df['latitude']

df = pd.get_dummies(df, columns = ['neighbourhood_group', 'room_type'])

df.dropna(axis = 0, how = 'any', thresh = None, subset = None, inplace = True)

5. Set X and y variables

X = df.drop('price', axis = 1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10, shuffle=True)

6. Set algorithm

model = ensemble.GradientBoostingRegressor(
        n_estimators = 350, 
        learning_rate = 0.1, 
        max_depth = 5, 
        min_samples_split = 4, 
        min_samples_leaf = 6, 
        max_features = 0.6, 
        loss = 'huber'
       )

model.fit(X_train, y_train)

mae_train, mae_test = 0.0, 0.0

mae_train = mean_absolute_error(y_train, model.predict(X_train))
print("Training Set Mean Absolute Error: %.2f" % mae_train)

mae_test = mean_absolute_error(y_test, model.predict(X_test))
print("Test Set Mean Absolute Error: %.2f" % mae_test)

7. Evaluate

new_property = [
    3176426, # host_id
    2, # minimum_nights
    19, # number_of_reviews
    1.08, # reviews_per_month
    0, # neighbourhood_group_Charlottenburg-Wilm
    0, # neighbourhood_group_Friedrichshain-Kreuzberg
    0, # neighbourhood_group_Lichtenberg
    0, # neighbourhood_group_Marzahn-Hellersdorf
    1, # neighbourhood_group_Mitte
    0, # neighbourhood_group_Neukolln
    0, # neighbourhood_group_Pankow
    0, # neighbourhood_group_Reinickendorf
    0, # neighbourhood_group_Spandau
    0, # neighbourhood_group_Steglitz-Zehlendorf
    0, # neighbourhood_group_Tempelhof-Schoneberg
    0, # neighbourhood_group_Treptow-Kopenick
    1, # room_type_Entire home/apt
    0, # room_type_Hotel room
    0, # room_type_Private room
    0  # room_type_Shared room
]

new_pred = model.predict([new_property])
new_pred

https://github.com/erica00j/machinelearning/blob/main/boosting_regressor.ipynb

GitHub - erica00j/machinelearning

Contribute to erica00j/machinelearning development by creating an account on GitHub.

github.com

'인공지능 > Machine Learning' 카테고리의 다른 글

[ML] Neural Network 예제 (0)	2022.12.01
[ML] Artificial Neural Networks (ANN, 인공 신경망) (0)	2022.12.01
[ML] Tree Based Learning Algorithms - Random Forests (0)	2022.11.30
[ML] Tree Based Learning Algorithms - Decision Trees (0)	2022.11.30
[ML] k-NEAREST NEIGHBORS 예제 (0)	2022.11.15

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

고구마의 개발

[ML] Tree Based Learning Algorithms - Gradient Boosting

1-2. Import libraries /dataset

3-4. Convert non-numeric variables / Remove columns

5. Set X and y variables

6. Set algorithm

7. Evaluate

1-2. Import libraries /dataset

3-4. Convert non-numeric variables / Remove columns

5. Set X and y variables

6. Set algorithm

7. Evaluate

'인공지능 > Machine Learning' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[ML] Tree Based Learning Algorithms - Gradient Boosting

1-2. Import libraries /dataset

3-4. Convert non-numeric variables / Remove columns

5. Set X and y variables

6. Set algorithm

7. Evaluate

1-2. Import libraries /dataset

3-4. Convert non-numeric variables / Remove columns

5. Set X and y variables

6. Set algorithm

7. Evaluate

'인공지능 > Machine Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역