Join us

We’re fixing an issue that may be impacting the tagging system and the links you shared in your timeline. Your previously shared links may not appear in your profile during this maintenance time but will come back.

Cross Validation is a concept which we can use to validate different models and find the best tuning parameters.

At times, in machine learning we always face the questions like :

- How do we decide which model to use?
- Which model will be more efficient?
- Which parameters are the best for a model?

Cross Validation is a concept which we can use to validate different models and find the best tuning parameters.

- First, we divide our dataset in k number of partitions called folds
- Then we use one of the folds as testing set and remaining folds as training set
- We calculate the testing accuracy and repeat the steps such that we have used each fold as testing set and others as training set
- After that we take the average of these accuracy and estimate the combined accuracy by finding the mean
- We do the same for multiple models and pick up the best model for our use case

(*Figure showing cross validation with 5 folds*)

- Importing the python modules

- Loading the iris dataset

Iris dataset is a dataset which contains various features like sepal length/width and petal length/width mapped with a target value of the flower. So basically, here we will be predicting the variety of flower based on it’s characteristics.

- Initializing the features X and target y

- The X is a feature matrix which contains a list of features our model needs to get trained. These are independent variables.

The y is the target variable which we want to predict and it is dependent on X. In this example, y can be either of (0,1,2) representing a variety of flower which our model will predict. - Creating the models and validating the accuracy

- Here our aim is to create two models Logistic Regression and K-Nearest Neighbors and validate them in order to chose the best among them.

We have used 10 fold cross validation(cv=10) to validate both the models.

The KNN algorithm (96.66%) is slightly better than logistic regression (94%) after the evaluation so we would be using that to build our machine learning model.

We can find the best tuning parameters using cross validation. Let’s try to find the most optimal value of k to tune the K-Nearest Neighbours algorithm.

- Firstly, we initialise a range of k values from which we want to chose the most optimal one.
- Then we find the mean accuracy for each run using a different k and chose the best value for k.

In the above example we want to chose the best value of k (1 to 20) and their performance accuracy are stored in k_scores list.

- This plot shows the relationship between k and accuracy score and how k=12 can be the most optimal value of k with 98% accuracy and k=1 can be least optimal value with 95% accuracy.

After validating our dataset with cross validation we would want to use KNN Algorithm to create our model with the best value of the tuning parameter k to be 12.

This is an introduction to machine leaning model validation and I’ll be posting many more content of machine learning and data science.

Thanks for reading , I’ll be posting more contents related to this. :)

Join other developers and claim your FAUN account now!

Influence

Total Hits

Posts

Only registered users can post comments. Please, login or signup.