본문 바로가기
인공지능/Machine Learning

[ML] Split-Validation / Machine Learning Model Design

by 유일리 2022. 10. 17.

데이터를 분리해거 검증하는 가장 기본적인 방법이다.

  • Training data : used to build the prediction model
  • Test data : used to access the accuracy of the model developed from the training data

Training data 와 Test data로 나누어지며 전형적으로 70:30 혹은 80:20 비율로 나뉜다. 

 

1. Perform split validation in python

To perform split validation in python, we can use train_test_split from Scikit-learn, which requires an initial import from the sklearn.model_selection library.

from sklearn.model_selection import train_test_split

2. Set X and y values

Before using this code ibrary, we first need to set our X and y values.

import pandas as pd 
df =pd.read_csv(‘C:\\Users\\JiMyeong\\Desktop\\Advertising\\advertising.csv’) 
X = df[['Daily Time Spent on Site', 'Age', 'Area Income', 'Daily Internet Usage', 'Ad 
Topic Line', 'Country’]] 
y = df['Clicked on Ad']

3. Create training data & test data

X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size=0.3, random_state=10, shuffle=True)

4. Validation set

  • Scikit-learn does not provide a specific function to create a three-way train/validation/test split.
  • One quick solution is to split the test data into two partitions as demonstrated below.
X_train, X_test, y_train, y_test = train_test_split(X, 
y, test_size=0.4)
X_test, X_val, y_test, y_val = 
train_test_split(X_test, y_test, test_size=0.5)

Machine Learning Model Design

1-2. Import Libaries/Dataset

  • Using Scikit-learn, you can self-generate a random dataset using a function called make blobs, as used with k-means clustering.
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
from sklearn.datasets import make_blobs
data = make_blobs(n_samples=200, n_features=2, centers=4, cluster_std=1.8, 
random_state=101)

3. Exploratory data analysis

  • EDA provides an opportunity to get familiar with your data including distribution and the state of missing values.
  • EDA drives the next stage of data scrubbing and your choice of algorithm.

4. Data scrubbing

  • The data scrubbing stage, as detailed in 5th week, usually consumes the most time and effort in developing a prediction model.
  • This is cleaning up the data, inspecting its value, making repairs, and also knowing when to throw it out.

5. Pre-model algorithm(optional)

  • As an optional extension of the data scrubbing process, unsupervised learning techniques including k-means clustering analysis and descending dimension algorithms are sometimes used in preparation for analyzing large and complex datasets.
  • This step is optional and does not apply to every model, particularly for small datasets with a low number of dimensions (features) or rows.

6. Split validation

  • Split validation is used to partition the data for the purpose of training and test analysis.
  • It’s also useful to randomize your data at this point using the shuffle feature and to set a random state if you want to replicate the model’s output in the future.

7. Set algorithm

  • Algorithms are the headline act for every machine learning model and must be chosen carefully.
  • By executing a series of steps defined by the algorithm, the model reacts to input variables to interpret patterns, make calculations, and reach decisions.
  • For context, the algorithm should not be confused or mistaken for the model.
  • The model is the final state of the algorithm (after hyperparameters are consolidated in response to patterns learned from the data) and the combination of data scrubbing, split validation, and evaluation techniques.

8. Predict

  • After devising an initial model using patterns extracted from the training data, the predict function is called on the test data to validate the model. 

- In the case of regression problems, the predict function generates a numeric value such as price or a numeric indicator of correlation

- in the case of classification, the predict function is used to generate discrete classes, such as the movie category or spam/non-spam classification.

9. Evaluate

  • This step is to evaluate the results.
  • The method of evaluation will depend on whether it is a classification or regression model.

 - In the case of classification, the common evaluation methods are the confusion matrix, classification report, and accuracy score.

  • Accuracy Score: This is a simple metric measuring how many cases the model classified correctly divided by the full number of cases.
  • Confusion Matrix: A confusion matrix, also known as an error matrix, is a simple table that summarizes the performance of the model, including the exact number of false-positives and false-negatives.
  • Classification Report: generates three evaluation metrics as well as support.

A classification report generated using Scikit-learn

a) Precision: the ratio of correctly predicted true-positives to the total of predicted positive cases. A high precision score translates to a low occurrence of false-positives.

  • Recall: similar to precision but represents the ratio of correctly predicted true-positives to the total of actual positive cases.
  • F1-score: a weighted average of precision and recall.It’s typically used as a metric for model-to-model comparison rather than for stand-alone model accuracy. F1-score is generally lower than the accuracy score due to the way recall and precision are calculated.
  • Support: is not an evaluation metric but rather a tally of the number of positive and negative cases respectively. In regards to evaluating regression problems (predicting continuous variables), the two most common measures are mean absolute error (MAE) and root mean square error (RMSE). MAE measures the average of the errors in a set of predictions, i.e. how far the regression line is to the actual data points.

10. Optimize

Model optimization can be performed manually using a trial and error system or via automation using a method like grid search. This particular technique allows you to trial a range of configurations for each hyperparameter and methodically test each of those possible hyperparameters.

댓글