- Like random forests, boosting provides another regression/classification technique for aggregating the outcome of multiple decision trees.
- Rather than building random independent variants of a decision tree in parallel, gradient boosting is a sequential method that aims to improve the performance with each subsequent tree.
- This works by evaluating the performance of weak models and then overweighting subsequent models to mitigate the outcome of instances misclassified in earlier rounds.
- Instances that were classified correctly at the previous round are also replaced with a higher proportion of instances that weren’t accurately classified.
- The concept of developing strong learners from weak learners is achieved by adding weights to trees based on misclassified cases in the previous tree.
- The ability of the algorithm to learn from its mistakes makes gradient boosting one of the most popular algorithms in machine learning today.
- For complex datasets with a large number of outliers, random forests may be a preferred alternative approach to boosting. The other main downside of boosting is the slow processing speed that comes with training a sequential decision model.
- The final downside, which applies to boosting as well as random forests is the loss of visual simplicity and ease of interpretation that comes with using a single decision tree. When you have hundreds of decision trees it becomes more difficult to visualize and interpret the overall decision structure.

Boosting_classifier Example
1-2. Import libraries /dataset
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import classification_report, confusion_matrix
df = pd.read_csv('/content/advertising.csv')
3-4. Convert non-numeric variables / Remove columns
df = pd.get_dummies(df, columns=['Country', 'City'])
del df['Ad Topic Line']
del df['Timestamp']
5. Set X and y variables
X = df.drop('Clicked on Ad',axis=1)
y = df['Clicked on Ad']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10, shuffle=True)
6. Set algorithm
model = ensemble.GradientBoostingClassifier( n_estimators = 200,
learning_rate = 0.1,
max_depth = 5,
min_samples_split = 4,
min_samples_leaf = 6,
max_features = 0.6,
loss = 'deviance' )
model.fit(X_train, y_train)
7. Evaluate
model_predict = model.predict(X_test)
print(confusion_matrix(y_test, model_predict))
print(classification_report(y_test, model_predict))

∴ Better performance than Random Forest classifier
Boosting_regressor Example
1-2. Import libraries /dataset
listings_berlin.csv
2.81MB
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error
df = pd.read_csv('/content/listings_berlin.csv')
3-4. Convert non-numeric variables / Remove columns
del df['id']
del df['name']
del df['host_name']
del df['last_review']
del df['calculated_host_listings_count']
del df['availability_365']
del df['longitude']
del df['neighbourhood']
del df['latitude']
df = pd.get_dummies(df, columns = ['neighbourhood_group', 'room_type'])
df.dropna(axis = 0, how = 'any', thresh = None, subset = None, inplace = True)
5. Set X and y variables
X = df.drop('price', axis = 1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 10, shuffle=True)
6. Set algorithm
model = ensemble.GradientBoostingRegressor(
n_estimators = 350,
learning_rate = 0.1,
max_depth = 5,
min_samples_split = 4,
min_samples_leaf = 6,
max_features = 0.6,
loss = 'huber'
)
model.fit(X_train, y_train)
mae_train, mae_test = 0.0, 0.0
mae_train = mean_absolute_error(y_train, model.predict(X_train))
print("Training Set Mean Absolute Error: %.2f" % mae_train)
mae_test = mean_absolute_error(y_test, model.predict(X_test))
print("Test Set Mean Absolute Error: %.2f" % mae_test)

7. Evaluate
new_property = [
3176426, # host_id
2, # minimum_nights
19, # number_of_reviews
1.08, # reviews_per_month
0, # neighbourhood_group_Charlottenburg-Wilm
0, # neighbourhood_group_Friedrichshain-Kreuzberg
0, # neighbourhood_group_Lichtenberg
0, # neighbourhood_group_Marzahn-Hellersdorf
1, # neighbourhood_group_Mitte
0, # neighbourhood_group_Neukolln
0, # neighbourhood_group_Pankow
0, # neighbourhood_group_Reinickendorf
0, # neighbourhood_group_Spandau
0, # neighbourhood_group_Steglitz-Zehlendorf
0, # neighbourhood_group_Tempelhof-Schoneberg
0, # neighbourhood_group_Treptow-Kopenick
1, # room_type_Entire home/apt
0, # room_type_Hotel room
0, # room_type_Private room
0 # room_type_Shared room
]
new_pred = model.predict([new_property])
new_pred

https://github.com/erica00j/machinelearning/blob/main/boosting_regressor.ipynb
GitHub - erica00j/machinelearning
Contribute to erica00j/machinelearning development by creating an account on GitHub.
github.com
'인공지능 > Machine Learning' 카테고리의 다른 글
[ML] Neural Network 예제 (0) | 2022.12.01 |
---|---|
[ML] Artificial Neural Networks (ANN, 인공 신경망) (0) | 2022.12.01 |
[ML] Tree Based Learning Algorithms - Random Forests (0) | 2022.11.30 |
[ML] Tree Based Learning Algorithms - Decision Trees (0) | 2022.11.30 |
[ML] k-NEAREST NEIGHBORS 예제 (0) | 2022.11.15 |
댓글