Complete Machine Learning Project for Beginner

Muhammad Iqbal bazmi
Analytics Vidhya
Published in
5 min readOct 15, 2019

--

Photo by Jeffrey Eisen on Unsplash

Iris Classification: A Multi-class Classification

Hi!, In this blog, I am gonna show you how to build a complete Machine Learning(ML) project for beginners.

The problem that we are gonna solve today is Iris Flower Classification: A multiclass classification problem and It is also called “Hello World” of Machine Learning.

The Agenda for today:

  • Machine Learning
  • Template of predictive Analytics project

Machine Learning

Machine Learning is a way to make the computer able to solve specific tasks without being explicitly programmed.

In reality: There is nothing called learning in Machines. There is some Statistical model that is being used in Machine Learning(ML) to make a decision. (ref: The Hundred-Page Machine Learning Book).

I assume here, that you are somehow familiar with the basics of machine learning.

Let’s see the template to solve real-world Machine Learning(ML) projects.

Template of ML (Predictive Analytics) project

(Inspired by- Jason Brownlee)

  1. Prepare Problem
  • Load libraries
  • Load dataset

2. Summarize Data

  • Descriptive statistics
  • Data Visualizations

3. Prepare Data

  • Data Cleaning
  • Feature Selection
  • Data Transform

4. Evaluate Algorithms

  • Split-out validation dataset
  • Test options and evaluation metric
  • Spot Check Algorithms
  • Compare Algorithms

5. Improve Accuracy

  • Algorithm Tuning
  • Ensembles

6. Finalize Model

  • Predictions on the validation dataset
  • Create Standalone model on an entire training dataset
  • Save the model for later use

It’s recommended to use Jupyter notebook although you can use any IDE. But, I am gonna use Jupyter notebook codes in this blog.

Prepare Problem

Load Libraries

Before proceeding further for model development first load important libraries.

fig 1. Load important libraries ( Inspired by Jason Brownlee)

Load dataset

the dataset should be in the same folder where your python file is.

To understand the dataset in detail just click here. then go further in this blog.

fig 2. load dataset

Summarize Data

The dimension of the dataset

fig 3. dimension fo the dataset

In the above fig 3. It shows that there are 150 rows and 6 columns in the given dataset.

Peek at the Data

fig 4. Peek at the Data

Bottom of the Data

fig 5. Bottom of the Data

Description

fig 6. Description of the data(Statistical Analysis)

Class Distribution

Class Distribution shows that how many classes are there in the given dataset and how many instances for each class.

fig 7. group of Species

Data Visualization

Pair Plot

fig 8. code for pair plot using Seaborn
fig 9. Pair Plot

By seeing the above Visualization (Pair Plot) It is very clear that two features petal_length and petal_width are import features.

Evaluate Some Algorithms

Now let’s create some models of the given data and estimate their accuracy on unseen data.

Steps to Evaluate Algorithms

  1. Separate out a validation dataset.
  2. Setup the test harness to use 10-fold cross-validation.
  3. Build 5 different models to predict species from flower measurements.
  4. Select the best model.

Please drop the columns ‘Id’ before going further using dataset.drop(columns='Id')

fig 10. Code for Validation dataset.

Question: Why I used an array instead of a pandas Dataframe?

Answer: Because simple array is computationally faster than a pandas Dataframe.

Spot-Check Algorithms

fig 11. Spot-Check Algorithms

In the above picture, It is showing that SVM is the best choice among all to be selected as an Algorithm to make our model

Make predictions

SVM was the most accurate model that we tested. So, I am gonna make a prediction using the Support Vector Machine(SVM).

let’s create a model

fig 12. Creating a model

let’s fit the model

fig 13. fitting the model using fit() method

Let’s make predictions

fig 14. predicting result on unseen data
fig 15. predicted results

— Let’s check what is the accuracy of this model

fig 16. accuracy of the model

Let’s see the confusion matrix of the predicted result

fig 17. confusion matrix

let’s see the classification report

fig 18. classification report.

Save the model for later use

save the model to the disk

sometime later

Please check the below link.

You can use the above template to solve any real-world Classification problem.

You are welcome for any queries and questions.

--

--

Muhammad Iqbal bazmi
Analytics Vidhya

A self-taught programmer, Data Scientist and Machine Learning Engineer