[ML] Tree Based Learning Algorithms

Decision trees technique is prone to overfitting.
Decision trees are accurate at decoding patterns using the training data.
But because there is a fixed sequence of decision paths, any variance in the test data or new data can result in poor predictions.
The fact that there is only one tree design also limits the flexibility of this method to manage variance and future outliers.
A solution for mitigating overfitting is to grow multiple trees using a different technique called random forests.
This method involves growing multiple decision trees using a randomized selection of input data for each tree and combining the results by averaging the output for regression or class voting for classification.
If the entire forest inspected a full set of variables, each tree would look similar, as the trees would each attempt to maximize information gain at the subsequent layer and thereby select the optimal variable at each split.
Unlike a standard decision tree, though, which has a full set of variables to draw from, the random forests algorithm has an artificially limited set of variables available to build decisions.
Due to fewer variables shown and the randomized data provided, random forests are less likely to generate a collection of similar trees.
Embracing randomness and volume, random forests are subsequently capable of providing a reliable result with potentially less variance and overfitting than a single decision tree.

Example

1-2. Import libraries / dataset

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

df = pd.read_csv('/content/advertising.csv')

3. Convert non-numeric variables

df = pd.get_dummies(df, columns=['Country', 'City'])

4. Remove columns

del df['Ad Topic Line']
del df['Timestamp']

5. Set X and y variables

X = df.drop('Clicked on Ad',axis=1) 
y = df['Clicked on Ad'] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10, shuffle=True)

6. Set algorithm

model = RandomForestClassifier(n_estimators=150)
model.fit(X_train, y_train)

7. Set algorithm

model_predict = model.predict(X_test)
 
print(confusion_matrix(y_test, model_predict))
print(classification_report(y_test, model_predict))

'인공지능 > Machine Learning' 카테고리의 다른 글

[ML] Artificial Neural Networks (ANN, 인공 신경망) (0)	2022.12.01
[ML] Tree Based Learning Algorithms - Gradient Boosting (0)	2022.11.30
[ML] Tree Based Learning Algorithms - Decision Trees (0)	2022.11.30
[ML] k-NEAREST NEIGHBORS 예제 (0)	2022.11.15
[ML] k-NEAREST NEIGHBORS (k-최근접 이웃 알고리즘) (0)	2022.11.15

고구마의 개발

[ML] Tree Based Learning Algorithms - Random Forests

1-2. Import libraries / dataset

3. Convert non-numeric variables

4. Remove columns

5. Set X and y variables

6. Set algorithm

7. Set algorithm

'인공지능 > Machine Learning' 카테고리의 다른 글

댓글

티스토리툴바

[ML] Tree Based Learning Algorithms - Random Forests

1-2. Import libraries / dataset

3. Convert non-numeric variables

4. Remove columns

5. Set X and y variables

6. Set algorithm

7. Set algorithm

'인공지능 > Machine Learning' 카테고리의 다른 글

관련글

댓글

티스토리툴바