Titanic Dataset Analysis using Python (Kaggle) | Classification | Machine Learning Project Tutorial
Updated: Apr 9, 2022
Titanic Dataset Analysis is a popular classification dataset for beginners. It is a Kaggle project which uses machine learning to predict the survival of passengers in the titanic. The objective of this project is to submit the prediction result with the best accuracy.
In this project tutorial, we are going to train the dataset using the train.csv that includes training and validation. Afterwards, we will use the trained model to predict the test dataset results and upload them into the Kaggle. We will perform basic level training without using hyperparameter tuning.
You can watch the video-based tutorial with step by step explanation down below
The data has been split into two groups:
training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.
pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
The output class is survival, where we have to predict 0 (No) or 1 (Yes).
Download the Dataset here
Let us import all the basic modules we will be needing for this project.
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import warnings warnings.filterwarnings('ignore') %matplotlib inline
pandas - used to perform data manipulation and analysis
numpy - used to perform a wide variety of mathematical operations on arrays
matplotlib - used for data visualization and graphical plotting
seaborn - built on top of matplotlib with similar functionalities
%matplotlib - to enable the inline plotting.
warnings - to manipulate warnings details
filterwarnings('ignore') is to ignore the warnings thrown by the modules (gives clean results)
Load the Dataset
We will use Kaggle to load the data set.
train = pd.read_csv('/kaggle/input/titanic/train.csv') test = pd.read_csv('/kaggle/input/titanic/test.csv') train.head()
We have to combine the train and test data. It will allow us to preprocess the data all at once.
## statistical info train.describe()
We will fill the missing values using the range values (mean, minimum and maximum values).
## datatype info train.info()
We will convert the string values into integers later.
Exploratory Data Analysis
Before preprocessing let us explore the categorical columns.
## categorical attributes sns.countplot(train['Survived'])
The distribution of data is reasonable.
There is uneven distribution due to the 3rd class passengers.
We observe more males than females.
0 indicates that the passenger is travelling solo.
Embarked contains the boarding port/cities of passengers.
There are three cities with S having the more number of values.
Let us explore the numerical columns.
## numerical attributes sns.distplot(train['Age'])
The graph shows a bell curve indicating a normal distribution.
We need to do preprocessing these data to convert the right-skewed curve into a normal distribution.
Let us compare ticket classes by creating a new graph using a pivot table.
class_fare = train.pivot_table(index='Pclass', values='Fare') class_fare.plot(kind='bar') plt.xlabel('Pclass') plt.ylabel('Avg. Fare') plt.xticks(rotation=0) plt.show()
It will help us to make an assumption on fares and the ticket class.
Let's compare Pclass by creating a new graph using a pivot table.
class_fare = train.pivot_table(index='Pclass', values='Fare', aggfunc=np.sum) class_fare.plot(kind='bar') plt.xlabel('Pclass') plt.ylabel('Total Fare') plt.xticks(rotation=0) plt.show()
All these visualizations help in understanding the variation of the dataset depending on the attributes.
Let us display the difference between 'Pclass' and 'Survived' with the help of a barplot.
sns.barplot(data=train, x='Pclass', y='Fare', hue='Survived')
This plot has a comparison of survived passengers depending on the ticket fare and passenger class.
Let's change the horizontal and vertical axis of the graph.
sns.barplot(data=train, x='Survived', y='Fare', hue='Pclass')
Similar to the previous graph, it shows the comparison of survived passengers.
We now combine the train and test datasets.
train_len = len(train) # combine two dataframes df = pd.concat([train, test], axis=0) df = df.reset_index(drop=True) df.head()
train_len is for the length of train data.
axis=0 means it will concatenate in respect of row.
axis=1 means it will concatenate in respect of columns.
df.head() displays the first five rows from the data frame.
df.tail() displays the last five rows from the data frame.
Let us check for NULL values in the dataset.
## find the null values df.isnull().sum()
Survived attributes NULL values are for the test data. Hence, we can avoid its NULL values.
Since the cabin has more than a thousand NULL values, we need to drop the column.
We will fill the missing values for other columns that show null values using the mean.
Let us remove column 'Cabin'.
# drop or delete the column df = df.drop(columns=['Cabin'], axis=1)
The mean value of column 'Age'.
We will use the mean values to fill the missing values for 'Age' and 'Fare'.
# fill missing values using mean of the numerical column df['Age'] = df['Age'].fillna(df['Age'].mean()) df['Fare'] = df['Fare'].fillna(df['Fare'].mean())
The mean value of column 'Embarked'.
The mode values return an dataframe, so we will use subscript to get the value.
Similarly, we will use the mode value to fill the missing values for 'Embarked'.
# fill missing values using mode of the categorical column df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode())
We use mode to fill the missing values of the categorial column.
Log transformation for Normal data distribution
We have to normalize the column 'Fare'.
df['Fare'] = np.log(df['Fare']+1)
If the 'fare' has a '0' value then it will result in an error.
To resolve this issue we have to add +1 in log transformation.
It is not a complete normal distribution, but we can manage with this curve.
A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. The value is in the range of -1 to 1. If two variables have a high correlation, we can neglect one variable from those two.
corr = df.corr() plt.figure(figsize=(15, 9)) sns.heatmap(corr, annot=True, cmap='coolwarm')
The 'Fare' shows a negative correlation with Pclass.
Additionally, Fare has some level of correlation with all classes. Hence, the Fare column is an essential attribute for this project.
Let us display the dataset again.