Iris Dataset Analysis using Python | Classification | Machine Learning Project Tutorial

Hackers Realm
Mar 1, 2022
4 min read

Updated: Jun 3, 2023

Unveil the secrets of the Iris dataset with Python! This comprehensive tutorial dives into classification techniques and machine learning algorithms to analyze and classify Iris flowers based on their features. Learn to preprocess data, train models, and evaluate their performance. Enhance your skills in data analysis, machine learning, and unlock the power of the Iris dataset. Join this project tutorial to unravel the patterns hidden within the flowers and master the art of classification with Python. #IrisDataset #Python #Classification #MachineLearning #DataAnalysis #FlowerClassification

Iris Dataset Analysis Classification — Iris Dataset Analysis

In this project tutorial, we are going to analyze the tabular data with various visualizations and build a robust machine learning model to predict the class of the flower.

You can watch the video based tutorial with step by step explanation down below

Dataset Information

The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Attribute Information:-

sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
species
- Iris Setosa
- Iris Versicolour
- Iris Virginica

Download the Iris Dataset here

Import modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

pandas - used to perform data manipulation and analysis
numpy - used to perform a wide variety of mathematical operations on arrays
matplotlib - used for data visualization and graphical plotting
seaborn - built on top of matplotlib with similar functionalities
warnings - to manipulate warnings details

filterwarnings('ignore') is to ignore the warnings thrown by the modules (gives clean results)

Loading the Dataset

# load the csv data
df = pd.read_csv('Iris.csv')
df.head()

pd.read_csv() loads the csv(comma seperated value) data into a dataframe
df.head() displays the 5 first rows from the dataframe

# delete a column
df = df.drop(columns = ['Id'])
df.head()

# to display stats about data
df.describe()

Statistical Information about Iris Flower Dataset

# to get basic info about datatypes
df.info()

Data type Information about Iris Flower Dataset

All the input attributes(0-3) are in float and the output attribute(4) is in object

# to display no. of samples on each class
df['Species'].value_counts()

value_counts() creates a dictionary of counts for each unique value.
We have 50 samples in each output class

Preprocessing the Dataset

Let's check for NULL values in the dataset

# check for null values
df.isnull().sum()

There are no NULL values present in the dataset.
If any NULL values are present, we have to fill all the NULL values before proceeding to model training.

Exploratory Data Analysis

In Exploratory Data Analysis(EDA), we will visualize the data with different kinds of plots for inference. It is helpful to find some patterns (or) relations within the data

# histograms
df['SepalLengthCm'].hist()

df['SepalWidthCm'].hist()

df['PetalLengthCm'].hist()

df['PetalWidthCm'].hist()

Sepal Length and Sepal Width forming a normal distritbution
Petal Length and Petal Width have two separate bells, it's due to the measurements of different species

Let's create some scatter plots for inference

# create list of colors and class labels
colors = ['red', 'orange', 'blue']
species = ['Iris-virginica', 'Iris-versicolor', 'Iris-setosa']

df[df['Species'] == species[i]] - filters samples for each class label
plt.scatter() - generates a scatterplot for the data
plt.xlabel() - label for x-axis
plt.ylabel() - label for y-axis
plt.legend() - display the legend for the plot

for i in range(3):
    # filter data on each class
    x = df[df['Species'] == species[i]]
    # plot the scatter plot
    plt.scatter(x['SepalLengthCm'], x['SepalWidthCm'], c = colors[i], label=species[i])
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
plt.legend()

Scatter Plot on Sepal Length and Sepal Width

for i in range(3):
    # filter data on each class
    x = df[df['Species'] == species[i]]
    # plot the scatter plot
    plt.scatter(x['PetalLengthCm'], x['PetalWidthCm'], c = colors[i], label=species[i])
plt.xlabel("Petal Length")
plt.ylabel("Petal Width")
plt.legend()

Scatter Plot on Petal Length and Petal Width

for i in range(3):
    # filter data on each class
    x = df[df['Species'] == species[i]]
    # plot the scatter plot
    plt.scatter(x['SepalLengthCm'], x['PetalLengthCm'], c = colors[i], label=species[i])
plt.xlabel("Sepal Length")
plt.ylabel("Petal Length")
plt.legend()

Scatter Plot on Sepal Length and Petal Length

for i in range(3):
    # filter data on each class
    x = df[df['Species'] == species[i]]
    # plot the scatter plot
    plt.scatter(x['SepalWidthCm'], x['PetalWidthCm'], c = colors[i], label=species[i])
plt.xlabel("Sepal Width")
plt.ylabel("Petal Width")
plt.legend()

Scatter Plot on Sepal Width and Petal Width

Here we can see, iris-setosa is easily separable from the other 2 classes
In petal length and petal width plot, the classes plotted without overlapping
In other plots, some samples are overlapping with other classes

Correlation Matrix

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. The value is in the range of -1 to 1. If two variables have high correlation, we can neglect one variable from those two.

# display the correlation matrix
df.corr()

Correlation matrix of Iris Flower Dataset — Correlation Matrix

corr = df.corr()
# plot the heat map
fig, ax = plt.subplots(figsize=(5,4))
sns.heatmap(corr, annot=True, ax=ax, cmap = 'coolwarm')

Petal length and petal width have high positive correlation of 0.96
If petal length value increases, petal width also increases
Sepal length have high positive correlation with petal length and petal width
Sepal width have negative correlation with petal length and petal width

Label Encoder

In machine learning, we usually deal with datasets which contains multiple labels in one or more than one columns. These labels can be in the form of words or numbers. Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# transform the string labels to integer
df['Species'] = le.fit_transform(df['Species'])
df.head()

Model Training and Testing

Now the preprocessing has been done, let's perform the model training and testing

from sklearn.model_selection import train_test_split
## train - 70%
## test - 30%

# input data
X = df.drop(columns=['Species'])
# output data
Y = df['Species']
# split the data for train and test
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.30)

X - contains input attributes
Y - contains the output attribute
train_test_split() - splits the data for training and testing (here we are splitting 70% data for training and 30% for testing)

Let's import some models and train

# logistic regression 
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

# model training
model.fit(x_train, y_train)

fit() - used for training the model with the data

# print metric to get performance
print("Accuracy: ",model.score(x_test, y_test) * 100)

Accuracy: 91.11111111111111

model.score() - gives the accuracy for the test data

# knn - k-nearest neighbours
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()

model.fit(x_train, y_train)

# print metric to get performance
print("Accuracy: ",model.score(x_test, y_test) * 100)

Accuracy: 100.0

# decision tree
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()

model.fit(x_train, y_train)

# print metric to get performance
print("Accuracy: ",model.score(x_test, y_test) * 100)

Accuracy: 91.11111111111111

Final Thoughts

We have got around 100% accuracy for KNN with our test data split
You can also try out various machine learning models similar to above
More EDA can be done with boxplots, violinplot, barplot, etc.,

In this project tutorial, we have learnt on how to train machine learning classification model for iris flower dataset. We also learned about data analysis, visualizations, data transformation, model creation, etc.,

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm