Hackers Realm

# How to Fill Missing Values in Dataset | Machine Learning | Python

Filling missing values in a dataset is an essential step in data preprocessing to ensure the accuracy and reliability of your analysis or machine learning model. There are several techniques you can use to fill missing values based on the nature of your data and the characteristics of the missing values.

We will see basic to advanced techniques to fill the missing values in the dataset. It's important to understand the characteristics of the missing data and choose an appropriate approach that preserves the integrity and meaningfulness of your dataset.

You can watch the video-based tutorial with step by step explanation down below.

**Load the Dataset**

**We will read the data from the csv file.**

```
df = pd.read_csv('data/Loan Prediction Dataset.csv')
df.head()
```

The code snippet reads a CSV file named '

**Loan Prediction Dataset.csv**' into a Pandas DataFrame object named '**df**' and then displaying the first few rows of the DataFrame using the**head()**function.

**Check the missing values**

**First we will see the count of missing values in each column.**

```
# check null values
df.isnull().sum()
```

By calling the

**isnull()**method on the DataFrame**df**, it returns a DataFrame of the same shape with True values where the data is missing (null) and False values where the data is present.The

**sum()**function is then applied to the resulting DataFrame, which sums the values along each column.

**Different Approaches to fill missing values **

**Let us see basics to advanced approaches to fill missing values in the dataset.**

**1) Fill with negative values**

Filling missing values with negative values can be a suitable approach in certain scenarios such as Absence of data, Numerical data with zero as a valid value, Bias or pattern preservation, Compatibility with downstream algorithms.

**First let us copy the existing dataframe to new dataframe**

`new_df = df.copy()`

The code snippet creates a new DataFrame named

**new_df**that is a copy of an existing DataFrame named**df**. This approach is commonly used to create a separate copy of the original DataFrame, ensuring that any modifications made to new_df do not affect the original DataFrame df.

**Next let us fill with negative value**

```
new_df = df.fillna(-999)
new_df.isnull().sum()
```

The code snippet fills the missing values in

**df**with a specific value, such as**-999**, and then check the number of null values in each column of**new_df**using**new_df.isnull().sum().**

We can see that the count of missing values has become zero after filling it with negative values.

**2) Consider NULL Value as new category**

We can treat null values as a distinct category or label in your dataset instead of filling them with a specific value, you can assign them a unique value or string to represent the null category.

**First let us copy the existing dataframe to new dataframe.**

`new_df = df.copy()`

**Next let us see the count of different category in a particular column.**

`df['Gender'].value_counts()`

**Next let us consider null values as a different category.**

```
# consider nan as category
new_df['Gender'] = df['Gender'].fillna('nan')
```

The

**fillna()**method is used to replace**NaN**values in the '**Gender**' column of the DataFrame new_df with the string '**nan**'. By using the inplace=True parameter, the changes are made directly in the**'Gender**' column of**new_df**without the need to assign the modified column back.

**Next let us see the count of different category after filling null values.**

`new_df['Gender'].value_counts()`

We can see that there is a new category

**nan**and the count of it is**13**which is same as the count of missing values in**Gender**column.

**3) Drop rows which have NULL values**

If the missing values are sparse and don't significantly affect your dataset's overall integrity, you may choose to drop the rows or columns containing missing values. However, this approach should be used cautiously to avoid losing valuable information.

**First let us copy the existing dataframe with missing values to new dataframe.**

`new_df = df.copy()`

**Next let us check the length of the dataframe with missing values.**

`len(new_df)`

614

**Next let us drop the rows which has null values.**

```
new_df = df.dropna(axis=0)
len(new_df)
```

480

**df.dropna(axis=0)**drops all rows from df that contain any**NaN**values along the row axis (axis=0). The resulting DataFrame is stored in**new_df**.The

**len()**function is then used to calculate the number of rows in**new_df**, which represents the length of the DataFrame.We can see that the length of new dataframe is less than the old dataframe as the rows with null values are removed

**Next let us check the count of null values in each column after dropping the missing values rows**

`new_df.isnull().sum()`

**4) Fill missing value with mean, median and mode**

For numerical data, you can replace missing values with the mean, median and for categorical data with mode of the available values in that column. This approach assumes that the missing values are most likely to have the same distribution as the available data.

**First let us copy the existing dataframe to new dataframe.**

`new_df = df.copy()`

**Next let us calculate the mean value for one of the column. Here we have considered LoanAmount column as it is a numerical column.**

`df['LoanAmount'].mean()`

**Next let us plot the distribution of the column.**

`sns.distplot(df['LoanAmount'])`

This plots the distribution of the '

**LoanAmount**' column from the DataFrame**df**. The resulting plot shows the density estimation of the variable.

**Next let us fill the null values in LoanAmount column with the calculated mean value.**

```
# fill missing value for numerical
new_df['LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].mean())
new_df.isnull().sum()
```

The code snippet fills the missing values in the '

**LoanAmount**' column of**new_df**with the mean value of the '**LoanAmount**' column from df. This approach assumes that the missing values are most likely to have the same distribution as the available data.Finally,

**new_df.isnull().sum()**calculates the sum of null values in each column of new_df. By printing**null_counts**, you can see the count of remaining missing values in each column after filling the '**LoanAmount**' column with the mean.

We can see that the count of null values in the

**LoanAmount**column is zero as we have filled the null values with the mean value of that column.

**Next let us fill another numerical column Loan_Amount_Term with median value.**

```
new_df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].median())
new_df.isnull().sum()
```

This fills the missing values in the '

**Loan_Amount_Term**' column of**new_df**with the median value of the '**Loan_Amount_Term**' column from**df**. Filling with the median is a common strategy when dealing with numerical data and can be more robust to outliers compared to using the mean.Finally,

**new_df.isnull().sum()**calculates the sum of null values in each column of**new_df**. By printing null_counts, you can see the count of remaining missing values in each column after filling the**'Loan_Amount_Term'**column with the median.

We can see that the count of null values in the

**Loan_Amount_Term**column is zero as we have filled the null values with the median value of that column.

**Next let us fill a categorical column with mode value . Let us plot a count plot of the column Self_Employed.**

`sns.countplot(df['Self_Employed'])`

This plots the count of each unique value in the '

**Self_Employed**' column from the DataFrame df. The resulting plot displays the number of occurrences for each category.

**Next let us find the mode of the Self_Employed column.**

`df['Self_Employed'].mode()[0]`

'No'

**Next let us fill the missing values in Self_Employed with the mode value.**

```
# fill missing value for categorical
new_df['Self_Employed'] = df['Self_Employed'].fillna(df['Self_Employed'].mode()[0])
new_df.isnull().sum()
```

This fills the missing values in the '

**Self_Employed**' column of**new_df**with the mode value of the '**Self_Employed**' column from**df**. Filling with the mode is a common strategy when dealing with categorical data, as it replaces missing values with the most frequently occurring category.Finally,

**new_df.isnull().sum()**calculates the sum of null values in each column of new_df. By printing null_counts, you can see the count of remaining missing values in each column after filling the '**Self_Employed**' column with the mode.

We can see that the count of null values in the

**Self_Employed**column is zero as we have filled the null values with the mode value of that column.

**5) Fill missing value based on grouping category**

Filling missing values based on grouping categories can be a suitable approach in situations where the missing values exhibit a relationship or dependence on other categorical variables. By considering the means of grouping categories, you can utilize the available information from similar groups to estimate and fill the missing values more accurately.

**First let us copy the existing dataframe to new dataframe.**

`new_df = df.copy()`

**Next let us calculate the mean value by performing groupBy on the category column.**

```
mean_df = df.groupby('Loan_Status').mean()['LoanAmount']
mean_df
```

**df.groupby('Loan_Status').mean()**groups the DataFrame df by the '**Loan_Status**' column and calculates the mean value for each group.Then,

**['LoanAmount']**is used to select only the '**LoanAmount**' column from the resulting grouped data.The final result, stored in

**mean_df**, is a series with the average**'LoanAmount'**for each**'Loan_Status'**category.By printing

**mean_df**, you will see the mean 'LoanAmount' value for each unique '**Loan_Status**' category.

Mean value for the

**Loan_Amount**column with**Loan_Status****N**category is**151.220994**and for**Y**category is**144.294404.**

**Next let us fill the missing value with mean value**

```
# fill missing value for numerical column
new_df.loc[(new_df['Loan_Status']=='N'), 'LoanAmount'] = new_df.loc[(new_df['Loan_Status']=='N'), 'LoanAmount'].fillna(mean_df['N'])
new_df.loc[(new_df['Loan_Status']=='Y'), 'LoanAmount'] = new_df.loc[(new_df['Loan_Status']=='Y'), 'LoanAmount'].fillna(mean_df['Y'])
```

**new_df.loc[new_df['Loan_Status'] == 'N', 'LoanAmount']**selects the '**LoanAmount**' values where the '**Loan_Status**' is**'N'**and**fillna(mean_df['N'])**fills the corresponding missing values with the mean value for the**'N'**category from**mean_df**. Similarly, the second line fills missing values for the**'Y'**category.

**Next let us see the count of missing values in each column.**

`new_df.isnull().sum()`

We can see that the count of null values in the

**LoanAmount**column is zero as we have filled the null values with the mean value from**mean_df**.

**Next let us print the each key (category) in the mean_df series This will be helpful when we have multiple categories**

```
for val in mean_df.keys():
print(val)
```

N

Y

**mean_df.keys()**returns an iterable object representing the keys (categories) in the**mean_df**series

**Next let us get the mean value for Loan_Amount_Term column.**

```
mean_df = df.groupby('Loan_Status').mean()['Loan_Amount_Term']
mean_df
```

**df.groupby('Loan_Status')['Loan_Amount_Term'].mean()**groups the DataFrame df by the**'Loan_Status'**column and calculates the mean value for the**'Loan_Amount_Term'**column for each group.The resulting series,

**mean_df**, will contain the average '**Loan_Amount_Term**' value for each unique '**Loan_Status**' category.By printing

**mean_df**, you will see the mean**'Loan_Amount_Term'**value for each category in the**'Loan_Status'**column.

**Next let us fill the missing values for Loan_Amount_Term column with mean value**

```
for val in mean_df.keys():
new_df.loc[(new_df['Loan_Status']==val), 'Loan_Amount_Term'] = new_df.loc[(new_df['Loan_Status']==val), 'Loan_Amount_Term'].fillna(mean_df[val])
```

The loop

**for val in mean_df.keys()**iterates over each key (category) in**mean_df**.**new_df.loc[new_df['Loan_Status'] == val, 'Loan_Amount_Term']**selects the**'Loan_Amount_Term'**values where the**'Loan_Status'**matches the current category (val), and**fillna(mean_df[val])**fills the corresponding missing values with the mean value for that category from**mean_df**.

**Next let us see the count of missing values in each column**

`new_df.isnull().sum()`

**Next let us fill the missing values for categorical column Self_Employed. First let us find the mode of the column **

```
# fill missing value for categorical
mode_df = df.groupby('Loan_Status')['Self_Employed'].agg(pd.Series.mode)
mode_df
```

**df.groupby('Loan_Status')['Self_Employed']**groups the DataFrame**df**by the '**Loan_Status**' column and selects the '**Self_Employed**' column.The

**agg(pd.Series.mode)**part calculates the mode for each category in the '**Loan_Status**' column.The resulting series,

**mode_df**, will contain the mode value for '**Self_Employed**' within each '**Loan_Status**' category.By printing

**mode_df**, you will see the mode '**Self_Employed**' value for each unique 'Loan_Status' category.

**Next let us fill the missing values for Self_Employed column.**

```
for val in mode_df.keys():
new_df.loc[(new_df['Loan_Status']==val), 'Self_Employed'] = new_df.loc[(new_df['Loan_Status']==val), 'Self_Employed'].fillna(mode_df[val])
```

The loop

**for val in mode_df.keys()**iterates over each key (category) in**mode_df.****new_df.loc[new_df['Loan_Status'] == val, 'Self_Employed']**selects the**'Self_Employed'**values where the**'Loan_Status'**matches the current category (val), and**fillna(mode_df[val])**fills the corresponding missing values with the mode value for that category from**mode_df.**

**Next let us see the count of missing values in each column.**

`new_df.isnull().sum()`

**6) Fill missing value using ML Model**

You can use machine learning algorithms to predict missing values based on other features in your dataset. This approach involves treating missing values as the target variable and training a model to predict their values based on the available data. Regression algorithms or advanced techniques like K-nearest neighbors (KNN) can be used for this purpose.

**First let us copy the existing dataframe to new dataframe.**

`new_df = df.copy()`

**Next let us select few specific columns.**

```
new_df = new_df[['LoanAmount', 'Loan_Amount_Term', 'ApplicantIncome', 'CoapplicantIncome']]
new_df.head()
```

**line 1**selects the specified columns from the DataFrame new_df and assigns them to new_df itself, effectively creating a new DataFrame that contains only the selected columns.The

**new_df.head()**statement then displays the first few rows of the new DataFrame, allowing you to inspect the selected columns.

**Next let us check the length of the new dataframe.**

`len(new_df)`

614

**Next let us first drop the rows with missing values.**

```
col = "LoanAmount"
# fill numerical values
new_df_temp = new_df.dropna(subset=[col], axis=0)
print(col, len(new_df_temp))
```

LoanAmount 592

**new_df.dropna(subset=[col], axis=0)**drops the rows with missing values in the specified**column (col)**from the DataFrame**new_df**.The

**subset=[col]**parameter specifies the column(s) to consider when dropping the rows. The resulting DataFrame is stored in**new_df_temp.**The

**print(col, len(new_df_temp))**statement then prints the column name (col) and the length of**new_df_temp**, which indicates the number of rows remaining in the DataFrame after dropping the missing values in the specified column.

**Next let us get the input and output split**

```
# input and output split
X = new_df_temp.drop(columns=[col], axis=1)
y = new_df_temp[col]
```

**new_df_temp.drop(columns=[col], axis=1)**drops the column specified by col from the DataFrame new_df_temp to form the input features matrix**X.**The

**columns=[col]**parameter specifies the column(s) to drop, and**axis=1**indicates that the columns should be dropped.The line

**y = new_df_temp[col]**assigns the column specified by col as the output variable**y**.

**Next let us fit the ML model**

```
from lightgbm import LGBMRegressor
model = LGBMRegressor(use_missing=False)
model.fit(X, y)
```

**LGBMRegressor(use_missing=False)**creates an instance of the**LightGBM**regressor model.The parameter

**use_missing=False**indicates that missing values should not be treated as a special category and should be ignored during model training.The

**model.fit(X, y)**line fits the model to the input features**X**and the target variable**y**, training the model on the provided data.

**Next let us populate with the predicted values from the model.**

```
d = {}
temp = new_df.drop(columns=[col], axis=1)
d[col] = list(model.predict(temp))
```

**d = {}**initializes an empty dictionary d to store the predictions.The line

**temp = new_df.drop(columns=[col], axis=1)**creates a temporary DataFrame temp that excludes the column specified by col. This DataFrame is used as input features to generate predictions.The line

**d[col] = list(model.predict(temp))**populates the dictionary**d**with predictions generated by calling**model.predict(temp)**.The predictions are stored as a list and associated with the key col in the dictionary.

**Next let us fill the missing values in the column with the corresponding predicted values.**

```
i = 0
for val, d_val in zip(new_df[col], d[col]):
if pd.isna(val):
new_df.at[i, col] = d_val
i += 1
```

**i**is initialized as**0**to track the index of the iteration.The loop

**for val, d_val in zip(new_df[col], d[col])**iterates over the values of the col column in**new_df**and the corresponding predicted values from**d[col]**.For any missing value (NaN) encountered in the col column, the line

**new_df.at[i, col] = d_val**assigns the corresponding predicted value d_val to that position in the DataFrame.By the end of the loop, the missing values in the col column of

**new_df**will be replaced with the corresponding predicted values.

**Next let us see the count of missing values in each column.**

`new_df.isnull().sum()`

**Next let us see few of the rows in the dataframe.**

`new_df.head()`

**Final Thoughts**

It's crucial to analyze the missing data pattern in your dataset. Determine if the missing values are random or if there is a systematic reason behind them. This understanding can help guide your approach to filling missing values.

The method of filling missing values depends on the type of variable. Numerical variables can be filled using statistical measures like mean, median, or regression-based imputation. Categorical variables can be filled with mode or by creating a new category to represent missing values.

When filling missing values, be aware that the imputed values can introduce bias to the data. Ensure that the imputation method aligns with the characteristics and distribution of the variable. Evaluate the impact of imputation on your analysis and consider sensitivity analysis to assess the robustness of the results.

Utilize your domain knowledge or subject matter expertise to guide the imputation process. Consider any relevant information that may help inform the filling of missing values. Domain knowledge can provide insights into potential relationships between variables and guide the choice of imputation method.

Assess the quality and reliability of the imputed values. Compare imputed values with known values, perform cross-validation, or conduct statistical tests to ensure the imputed values align with the overall data distribution.

If the missing values are not completely random, you may want to consider advanced techniques like multiple imputation. Multiple imputation generates multiple imputed datasets, capturing the uncertainty associated with the missing values and allowing for more accurate estimates and statistical inference.

In this article we have explored basic to advanced techniques to fill the missing values in the dataset. However, the choice of method depends on the specific dataset, the nature of the missing values, and the analysis goals. It is essential to carefully analyze your data, understand the underlying patterns, and make informed decisions when filling missing values.

Get the project notebook from *here*

Thanks for reading the article!!!

Check out more project videos from the YouTube channel *Hackers Realm*