본문 바로가기
인공지능/Machine Learning

[ML] Support Vector Machines, SVM

by 유일리 2022. 11. 15.
svm이란?

기계 학습의 분야 중 하나로 패턴 인식, 자료 분석을 위한 지도 학습 모델이며, 주로 분류와 회귀 분석을 위해 사용한다. 두 카테고리 중 어느 하나에 속한 데이터의 집합이 주어졌을 때, SVM 알고리즘은 주어진 데이터 집합을 바탕으로 하여 새로운 데이터가 어느 카테고리에 속할지 판단하는 비확률적 이진 선형 분류 모델을 만든다. 만들어진 분류 모델은 데이터가 사상된 공간에서 경계로 표현되는데 SVM 알고리즘은 그 중 가장 큰 폭을 가진 경계를 찾는 알고리즘이다.

      • SVM is used as classification technique for predicting categorical outcomes – similar to logistic regression.
      • SVM is one of the best classifiers in supervised learning for analyzing complex data and downplaying the influence of outliers.
      • Unlike logistic regression, SVM attempts to separate data classes from a position of maximum distance between itself and the partitioned data points.
      • Its key feature is the margin, which is the distance between the boundary line and the nearest data points, multiplied by two.
      • The margin provides support to cope with new data points and outliers that otherwise infringe on a logistic regression boundary line.

A new data point is added to the scatterplot

SVM Example
Use the SVM algorithm as a binary classifier to predict the outcome of a user clicking on an online advertisement.

 

dataset

advertising.csv
0.10MB

1-2. Import Libaries / Dataset

import pandas as pd 
from sklearn.model_selection import train_test_split 
from sklearn.svm import SVC 
from sklearn.metrics import classification_report, confusion_matrix 
from sklearn.model_selection import GridSearchCV

df = pd.read_csv('/content/advertising.csv')

3. Remove Variables

del df['Ad Topic Line']
del df['Timestamp']

4. Convert non-numeric values into numeric values using one-hot encoding

df = pd.get_dummies(df, columns=['Country', 'City'])

5. Set X and y variables

X = df.drop('Clicked on Ad', axis=1)
y = df['Clicked on Ad']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

6. Set algorithm

model = SVC()
model.fit(X_train, y_train)

7. Evaluate

model_predict = model.predict(X_test)

#Confusion matrix 
print(confusion_matrix(y_test, model_predict)) 

#Classification report 
print(classification_report(y_test, model_predict))

  • The current performance of the model isn’t as accurate as we might hope.
  • The confusion matrix reports a high occurrence of false-negatives (68), and the classification report states that precision, recall, and the f1-score are all below 0.75.

8. Grid search

hyperparameters = {'C':[10,25,50], 'gamma':[0.001,.0001,0.00001]}
grid = GridSearchCV(SVC(), hyperparameters)
grid.fit(X_train, y_train)
grid.best_params_

  • We can improve the accuracy of our model using a special technique called “grid search” to help us find the optimal hyperparameters for this algorithm.
  • While there are many hyperparameters belonging to SVC, we will focus on C and gamma, which generally have the biggest impact on prediction accuracy using this algorithm.
  • The hyperparameter C regulates the extent to which misclassified cases (placed on the wrong side of the margin) are ignored.
  • This flexibility in the model is referred to as a “soft margin” and ignoring cases that cross-over the soft margin can lead to a better fit.
  • The lower C is, the more errors the soft margin is permitted to ignore.
  • A C value of ‘0’ enforces no penalty on misclassified cases.
  • Gamma refers to the Gaussian radial basis function and the influence of the support vector.
  • In general, a small gamma produces high bias and low variance models. Conversely, a large gamma leads to low bias and high variance in the model.
  • Grid search allows us to list a range of values to test for each hyperparameter.
  • After testing each possible permutation provided for C and Gamma, grid search has found that 50 for C and 0.0001 for gamma are the ideal hyperparameters for this model.

9. Grid search predict

grid_predictions = grid.predict(X_test)

#Confusion matrix 
print(confusion_matrix(y_test,grid_predictions))

#Classification report 
print(classification_report(y_test,grid_predictions))

  • Let’s link the test data with the model using the new hyperparameters supplied by grid search, and review the prediction results inside a new cell.
  • As evidenced in the confusion matrix and classification report, the new hyperparameters have improved the prediction performance of this model, with an almost evenly split number of false-positives (17) and false-negatives (15), and 0.89 for precision, recall, and f1-score.

https://github.com/erica00j/machinelearning/blob/main/svm.ipynb

댓글