Normalizing data is a common preprocessing step in machine learning which refers to the process of transforming numerical data into a standardized format, typically within a specific range or distribution. The goal of normalization is to bring different features or variables onto a common scale, enabling fair comparisons and improving the performance of machine learning algorithms. Two commonly used methods for normalization are Max Absolute Scaling and Min-Max Scaling.
In this project tutorial we will explore how to normalize the data using max absolute & min-max scaling in python. Data Normalization is very important for data with uneven distribution. Normalized data helps in capturing information better for simpler algorithms
You can watch the video-based tutorial with step by step explanation down below.
Import Modules
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import numpy as np
warnings.filterwarnings('ignore')
%matplotlib inline
pandas - used to perform data manipulation and analysis
seaborn - provides a high-level interface for creating attractive and informative statistical graphics. Seaborn is particularly useful for visualizing relationships between variables, exploring distributions, and presenting complex statistical analyses
matplotlib.pyplot - used for data visualization and graphical plotting
warnings - used to control and suppress warning messages that may be generated by the Python interpreter or third-party libraries during the execution of a Python program
numpy - used to perform a wide variety of mathematical operations on arrays
Import Data
Next we will read the data from the csv file
df = pd.read_csv('data/winequality.csv')
df.head()
The code snippet reads a CSV file named 'winequality.csv' into a Pandas DataFrame object named 'df' and then displaying the first few rows of the DataFrame using the head() function
Next we will see the statistical summary of the DataFrame
df.describe()
The describe() function in Pandas provides a statistical summary of the DataFrame, including various descriptive statistics such as count, mean, standard deviation, minimum value, 25th percentile (Q1), median (50th percentile or Q2), 75th percentile (Q3), and maximum value for each numerical column in the DataFrame
Next let us create a plot of the free sulfur dioxide column in the DataFrame
sns.distplot(df['free sulfur dioxide'])
This will generate a distribution plot that displays the distribution of values in the 'free sulfur dioxide' column. The plot will include a histogram to visualize the frequency of different values and a smooth curve representing the kernel density estimate
Next let us create a plot of the alcohol column in the DataFrame
sns.distplot(df['alcohol'])
This will generate a distribution plot that displays the distribution of values in the 'alcohol' column. The plot will include a histogram to visualize the frequency of different values and a smooth curve representing the kernel density estimate
Next we will normalize the data. First we will use Max Absolute scaling to normalize the data
Max absolute scaling
Max Absolute Scaling scales the data based on the maximum absolute value of each feature. The formula to normalize a value using Max Absolute Scaling is normalized_value = value / max_abs_value
In this method, the maximum absolute value across all features is determined, and each value is divided by this maximum absolute value. The resulting values will be between -1 and 1
Let us see how we can normalize the data for the columns free sulfur dioxide and alcohol
First we will create a copy of a dataframe
df_temp = df.copy()
Here we are creating a copy of the DataFrame df and assigning it to a new DataFrame called df_temp. This allows you to work with a separate copy of the data without modifying the original DataFrame df.
By using the copy() method, you create a deep copy of the DataFrame, meaning that any changes made to df_temp will not affect the original df DataFrame. This can be useful when you want to perform operations on the data or make modifications without altering the original dataset
Next we will be normalizing the 'free sulfur dioxide' column in the DataFrame df_temp using the Max Absolute Scaling method
df_temp['free sulfur dioxide'] = df_temp['free sulfur dioxide'] / df_temp['free sulfur dioxide'].abs().max()
df_temp['free sulfur dioxide'].abs().max() calculates the maximum absolute value of the 'free sulfur dioxide' column. The abs() function is used to get the absolute values of each element in the column, and max() returns the maximum value
Next perform the normalization by dividing each value in the 'free sulfur dioxide' column by the maximum absolute value obtained in the previous step
The result is assigned back to the 'free sulfur dioxide' column in df_temp, replacing the original values with the normalized values
Next let us create a plot of the normalized free sulfur dioxide column in the DataFrame
sns.distplot(df_temp['free sulfur dioxide'])
Now we can see the data range is from 0 to 1
Next we will be normalizing the 'alcohol' column in the DataFrame df_temp using the Max Absolute Scaling method
df_temp['alcohol'] = df_temp['alcohol'] / df_temp['alcohol'].abs().max()
We will calculate the maximum absolute value of the 'alcohol' column using the same formula that we used for free sulfur dioxide column
Next let us create a plot of the normalized alcohol column in the DataFrame
sns.distplot(df_temp['alcohol'])
We can see the data range is from 0.5 to 1 and the min value is around 0.55 or 0.6
Now let us see how we can use Min-Max scaling to normalize the data
Min-Max Scaling
Min-Max Scaling scales the data between a specified range, typically between 0 and 1. The formula to normalize a value using Min-Max Scaling is normalized_value = (value - min_value) / (max_value - min_value)
In this method, the minimum and maximum values for each feature are identified. Each value is subtracted by the minimum value and divided by the range (max_value - min_value). The resulting values will be between 0 and 1
Let us see how we can normalize the data using this method
First we will create a copy of the DataFrame df and assign it to a new DataFrame called df_temp
df_temp = df.copy()
Next we will be normalizing the 'alcohol' column in the DataFrame df_temp using the Min Max Scaling method
df_temp['alcohol'] = (df_temp['alcohol'] - df_temp['alcohol'].min()) / (df_temp['alcohol'].max() - df_temp['alcohol'].min())
df_temp['alcohol'].min() calculates the minimum value of the 'alcohol' column
df_temp['alcohol'].max() calculates the maximum value of the 'alcohol' column
(df_temp['alcohol'] - df_temp['alcohol'].min()) subtracts the minimum value from each value in the 'alcohol' column, translating the range of values to start from zero
(df_temp['alcohol'].max() - df_temp['alcohol'].min()) calculates the range of values by subtracting the minimum value from the maximum value
Next divide each value in the 'alcohol' column by the range of values obtained in the previous step
The result is assigned back to the 'alcohol' column in df_temp, replacing the original values with the normalized values
Next let us create a plot of the normalized alcohol column in the DataFrame
sns.distplot(df_temp['alcohol'])
We can see that we have got a data range from 0 to 1 in min max scaling method
Log Transformation
Log transformation is a data transformation technique commonly used to reduce the skewness of data or to stabilize variance. It involves applying the logarithm function to the data values, which compresses large values and expands small values. This transformation can be useful when the data has a long tail or when the relationship between variables is better represented on a logarithmic scale
Let us see an example to demonstrate the use of this
First we will display a column
sns.distplot(df['total sulfur dioxide'])
We can see that the curve is in right skewed manner
Next we will create a copy of the DataFrame df and assign it to a new DataFrame called df_temp
df_temp = df.copy()
Next we will apply log transformation
df_temp['total sulfur dioxide'] = np.log(df_temp['total sulfur dioxide']+1)
Add 1 to each value in the 'total sulfur dioxide' column. Adding 1 avoids taking the logarithm of zero or negative values since the logarithm function is undefined for those values
We will apply the natural logarithm (base e) to each value in the modified 'total sulfur dioxide' column
Next we will display the log transformed column
sns.distplot(df_temp['total sulfur dioxide'])
We can see that it has reduced the data range and also transformed the curve by reducing the skewness when compared to the plot without log transformation
Final Thoughts
Normalizing data is a crucial step in data preprocessing and analysis. It helps to standardize the scale and range of variables, making them comparable and ensuring that no variable dominates the analysis based on its magnitude.
Normalization also facilitates the convergence of certain machine learning algorithms that rely on scaled inputs
When normalizing data, it is important to consider the characteristics of the data and the specific requirements of your analysis. Some normalization techniques may work better for certain types of data or algorithms
Additionally, it is crucial to handle outliers, missing values, and zero or negative values appropriately to ensure the accuracy and validity of the normalization process
In this project tutorial we have seen how we can normalize the data using Max absolute scaling , Min max scaling and log transformation methods. In future we can extend this project by exploring other methods that are available to normalize the data
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm
Comments