k-nearest neighbors algorithm 이란?
패턴 인식에서 k-최근접 이웃 알고리즘(또는 줄여서 k-NN)은 분류나 회귀에 사용되는 비모수 방식이다. 아래 그림처럼 만약 k가 3이라면 보라색 마름모 (new data point)는 가장 가까운 3개 (별 1, 동그라미 2)를 파악해서 class B로 분류하게 된다. 만약 k가 7이라면 별 4, 동그라미 3으로 class A로 분류하게 될 것이다.
k-nearest neighbors Example
we will practice using k-nearest neighbors to predict the outcome of a user clicking on an online advertisement based on the class of nearby data points.
dataset
1-2. Import Libaries / Dataset
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
df = pd.read_csv('/content/advertising.csv')
3. Remove Variables
- we remove the discrete variables from the dataframe, including Ad Topic Line, Timestamp, Male, Country, and City.
-
k-NN generally works best with continuous variables such as age and area income.
del df['Ad Topic Line']
del df['Timestamp']
del df['Male']
del df['Country']
del df['City']
df.head()
4-5. Scale data/Set X and y values
- we use StandardScaler() from Scikit-learn to standardize the variance of the independent variables (while dropping the dependent variable Clicked on Ad).
- This transformation will help to avoid one or more variables with a high range unfairly pulling the focus of the model.
scaler = StandardScaler()
scaler.fit(df.drop('Clicked on Ad',axis=1))
scaled_features = scaler.transform(df.drop('Clicked on Ad',axis=1))
X = scaled_features
y = df['Clicked on Ad']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10, shuffle=True)
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
model_predict = model.predict(X_test)
print(confusion_matrix(y_test, model_predict))
print(classification_report(y_test, model_predict))
6. Set algorithm
-
Note that setting k to an uneven number helps to eliminate the possibility of a prediction stalemate in the case of a binary prediction.
7. Evaluate
8. Optimize
-
We can experiment with the number of neighbors chosen in step 6 and attempt to reduce the number of incorrectly predicted outcomes.
-
Based on manual trial and error, we can improve the model by opting for 3 neighbors. (k는 홀수가 좋은 경우가 많음)
9. Predict
-
We can deploy our model(n_neighbors=3) on the first 10 rows of the scaled_features dataframe to predict the likely outcome.
model.predict(scaled_features)[0:10]
https://github.com/erica00j/machinelearning/blob/main/KNN_advertising.ipynb
'인공지능 > Machine Learning' 카테고리의 다른 글
[ML] Tree Based Learning Algorithms - Decision Trees (0) | 2022.11.30 |
---|---|
[ML] k-NEAREST NEIGHBORS 예제 (0) | 2022.11.15 |
[ML] Bias & Variance (0) | 2022.11.15 |
[ML] Support Vector Machines, SVM (0) | 2022.11.15 |
[ML] Logistic Regression (로지스틱 회귀) (0) | 2022.11.15 |
댓글