Hackers Realm

# Feature Selection using Correlation Matrix (Numerical) | Machine Learning | Python

The correlation matrix measures the linear relationship between pairs of features in a dataset. It provides an indication of how strongly and in what direction two features are related. A correlation value ranges from -1 to 1, where -1 indicates a strong negative correlation, 0 indicates no correlation, and 1 indicates a strong positive correlation.

Additionally, correlation-based feature selection is best suited for problems where the relationship between features is expected to be linear. If you suspect non-linear relationships, you may need to explore other methods such as feature importance based on model performance or feature engineering techniques

You can watch the video-based tutorial with step by step explanation down below.

**Load the Dataset**

**First we will have to load the data**

```
df = pd.read_csv('data/bike sharing dataset.csv')
df.head()
```

We will read the CSV file '

**bike sharing dataset.csv**' located in the 'data' directory and assign it to the DataFrame**df**using**read_csv()**functionThe

**head()**method is called on the DataFrame**df**to display the first few rows of the modified DataFrame

**Finding Correlation Matrix**

**Next we will create a correlation Matrix of the dataset **

```
corr = df.corr()
corr
```

In the above code snippet you will calculate the correlation matrix for the features in the DataFrame

**df**and store it in the variable**corr**. You can then print**corr**to see the correlation matrix, which shows the pairwise correlations between all the features in the dataset

**Display Correlation Matrix **

**Next we will display the correlation matrix in heatmap with which we can easily analyze the correlation matrix**

```
# display correlation matrix in heatmap
corr = df.corr()
plt.figure(figsize=(14,9))
sns.heatmap(corr, annot=True, cmap='coolwarm')
```

First we will calculate the correlation matrix

We will set the figure size using

**plt.figure(figsize=(14, 9))**to make the heatmap larger and easier to readThen, we use

**sns.heatmap()**to create the heatmap, passing the correlation matrix**corr**as the data. The**annot=True**argument adds the correlation values to the heatmap cells. The**cmap='coolwarm'**argument sets the color map for the heatmapFinally, we use

**plt.show()**to display the heatmap

**cnt**is the target variable of this correlation matrixFrom the heatmap we can infer that the

**casual**and**registered**attributes have high correlation with target variableIf you have high correlation then those attributes are treated as important attributes and with the help of those attributes we can easily predict target variable

Any attribute whose range is above

**+0.05 or -0.05**that attribute will have some importance with the variable. Here you can see that**temperature**attribute has positive correlation of**0.4**Based on hour attribute you can also predict how many vehicles will be rented by users in the particular hour

You can also infer that the attributes

**holiday, weekday**and**workingday**are not much important variables as their values are below +0.05 or -0.05To eliminate some of the features in the input variable you should check the complete data, here you can clearly see that the attributes

**atemp and temp**has**0.99**correlation which is highly positive value. If you see correlation values more than**0.7**then you can drop any one of the feature as both the values represent a similar patternYou can also see that

**yr(year)**is highly correlated with**instant , instant**attribute contains serial numbers which is of less importance so we can drop**mnth(month)**is highly correlated with**season**so you can drop any one of themYou can observe the correlation matrix more carefully and infer many other information from it

**Final Thoughts**

Correlation matrix allows you to quickly identify highly correlated features, which can help in identifying redundant (or) overlapping information

By removing highly correlated features, you can reduce dimensionality, improve model interpretability, and potentially enhance model performance by reducing noise and overfitting

Correlation matrix-based feature selection considers pairwise relationships, but it may not account for the combined influence of multiple features on the target variable

Correlation analysis assumes that the relationship between variables is linear and follows a normal distribution. If these assumptions are violated, the correlation results may not be accurate or meaningful

Correlated features may still be important if they have non-linear or complex relationships with the target variable, which are not captured by correlation analysis alone

It is important to consider domain knowledge, as well as the performance of the selected features in a chosen model, to ensure the most relevant and informative features are selected

In summary, correlation matrix-based feature selection is a valuable technique to identify and remove highly correlated features. However, it should be used as part of a broader feature selection strategy, considering other methods and domain knowledge, to ensure a comprehensive and accurate selection of features for your specific machine learning problem

In this article we have explored how we can perform feature selection using correlation matrix. In future articles we will explore different methods to perform feature selection

Get the project notebook from *here*

Thanks for reading the article!!!

Check out more project videos from the YouTube channel *Hackers Realm*