본문 바로가기
인공지능/Machine Learning

[ML] k-NEAREST NEIGHBORS (k-최근접 이웃 알고리즘)

by 유일리 2022. 11. 15.
k-nearest neighbors algorithm 이란?

 

패턴 인식에서 k-최근접 이웃 알고리즘(또는 줄여서 k-NN)은 분류나 회귀에 사용되는 비모수 방식이다. 아래 그림처럼 만약 k가 3이라면 보라색 마름모 (new data point)는 가장 가까운 3개 (별 1, 동그라미 2)를 파악해서 class B로 분류하게 된다. 만약 k가 7이라면 별 4, 동그라미 3으로 class A로 분류하게 될 것이다.

k-nearest neighbors Example
we will practice using k-nearest neighbors to predict the outcome of a user clicking on an online advertisement based on the class of nearby data points.

dataset

advertising.csv
0.10MB

1-2. Import Libaries / Dataset

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

df = pd.read_csv('/content/advertising.csv')

3. Remove Variables

  • we remove the discrete variables from the dataframe, including Ad Topic Line, Timestamp, Male, Country, and City.
  • k-NN generally works best with continuous variables such as age and area income.
del df['Ad Topic Line']
del df['Timestamp'] 
del df['Male'] 
del df['Country'] 
del df['City']

df.head()

4-5. Scale data/Set X and y values

  • we use StandardScaler() from Scikit-learn to standardize the variance of the independent variables (while dropping the dependent variable Clicked on Ad).
  • This transformation will help to avoid one or more variables with a high range unfairly pulling the focus of the model.
scaler = StandardScaler()
scaler.fit(df.drop('Clicked on Ad',axis=1))
scaled_features = scaler.transform(df.drop('Clicked on Ad',axis=1))

X = scaled_features
y = df['Clicked on Ad']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10, shuffle=True)

model = KNeighborsClassifier(n_neighbors=5)

model.fit(X_train, y_train)
 
model_predict = model.predict(X_test) 

print(confusion_matrix(y_test, model_predict)) 
print(classification_report(y_test, model_predict))

6. Set algorithm

  • Note that setting k to an uneven number helps to eliminate the possibility of a prediction stalemate in the case of a binary prediction.

7. Evaluate

8. Optimize

  • We can experiment with the number of neighbors chosen in step 6 and attempt to reduce the number of incorrectly predicted outcomes.
  •  
    Based on manual trial and error, we can improve the model by opting for 3 neighbors. (k는 홀수가 좋은 경우가 많음)

9. Predict

  • We can deploy our model(n_neighbors=3) on the first 10 rows of the scaled_features dataframe to predict the likely outcome.
model.predict(scaled_features)[0:10]

https://github.com/erica00j/machinelearning/blob/main/KNN_advertising.ipynb

 

댓글