top of page
  • Writer's pictureHackers Realm

Detect and Remove Outliers in the Data | Machine Learning | Python

Explore the process of how to detect and remove outliers in data using Python for machine learning tasks. Gain insights into outlier detection techniques, such as statistical methods and visualization tools. Learn how to handle outliers by applying robust statistical measures and preprocessing techniques. Enhance your understanding of outlier impact on machine learning models and improve the accuracy and reliability of your predictions.

Detect and Remove Outliers
Detect and Remove Outliers

Outlier handling depends on the specific context and goals of your analysis, and there is no one-size-fits-all solution. It's important to note that the decision to remove outliers should be made judiciously and should be based on a thorough understanding of the data and the specific goals of your analysis. Removing outliers can alter the distribution and characteristics of your data, so it's crucial to consider the potential implications and document the choices made during the outlier detection and removal process.



You can watch the video-based tutorial with step by step explanation down below.


Load the Dataset


We will read the data from the csv file

df = pd.read_csv('data/winequality.csv')
df.head()
  • The code snippet reads a CSV file named 'winequality.csv' into a Pandas DataFrame object named 'df' and then displaying the first few rows of the DataFrame using the head() function

First 5 rows of the dataframe
First 5 rows of the dataframe


Next we will see the statistical summary of the DataFrame

df.describe()
  • The describe() function in Pandas provides a statistical summary of the DataFrame, including various descriptive statistics such as count, mean, standard deviation, minimum value, 25th percentile (Q1), median (50th percentile or Q2), 75th percentile (Q3), and maximum value for each numerical column in the DataFrame

Statistical summary of the DataFrame
Statistical summary of the DataFrame
  • We will use the residual sugar column to detect and remove the outliers



Visualize the Data


Next we will plot the data

sns.distplot(df['residual sugar'])
  • This will generate a distribution plot that displays the distribution of values in the 'residual_sugar' column. The plot will include a histogram to visualize the frequency of different values and a smooth curve representing the kernel density estimate

Distribution plot of Residual Sugar column
Distribution plot of Residual Sugar column
  • There is a outlier as the plot is completely right skewed


Next we will use boxplot to see the outliers clearly

# to see outliers clearly
sns.boxplot(df['residual sugar'])
  • The code snippet you provided makes use of the sns.boxplot() function from the Seaborn library to create a box plot for the 'residual sugar' variable in the DataFrame df

Box  Plot of Residual Sugar column
Box Plot of Residual Sugar column
  • The box represents the interquartile range (IQR), with the line inside representing the median. The whiskers extend to the minimum and maximum values within 1.5 times the IQR from the first and third quartiles. Any points outside of the whiskers are considered potential outliers



Methods to remove Outliers


There are different methods using which we can remove outliers. Let us see few of them


Z-Score Method

  • The z-score method is a statistical technique used to detect outliers by measuring how many standard deviations a data point is away from the mean. A z-score tells you how relatively far a data point is from the mean in terms of standard deviations

First we will get the upper and lower limits

# find the limits
upper_limit = df['residual sugar'].mean() + 3*df['residual sugar'].std()
lower_limit = df['residual sugar'].mean() - 3*df['residual sugar'].std()
print('upper limit:', upper_limit)
print('lower limit:', lower_limit)
  • This code snippet calculates the upper limit as the mean plus three times the standard deviation (mean + 3 * std) and the lower limit as the mean minus three times the standard deviation (mean - 3 * std)

  • These limits define a range beyond which data points are considered outliers based on the z-score method

 Upper and Lower limit for zscore method
Upper and Lower limit
  • This is the possible upper and lower limit that we can consider



Next let us find outliers using the limits

# find the outliers
df.loc[(df['residual sugar'] > upper_limit) | (df['residual sugar'] < lower_limit)]
  • The code snippet uses the upper and lower limits calculated earlier to identify outliers in the 'residual sugar' column of the DataFrame df. It uses boolean indexing to filter the DataFrame and select rows where the 'residual sugar' values are outside the calculated limits

  • The .loc[] method is used to access the rows in df that meet the specified condition

  • The condition (df['residual sugar'] > upper_limit) | (df['residual sugar'] < lower_limit) checks whether the 'residual sugar' values are greater than the upper limit or less than the lower limit, indicating outliers

  • The resulting DataFrame outliers will contain only the rows where outliers are present in the 'residual sugar' column

Dataframe with outliers
Dataframe with outliers


Next we will trim the outliers. Trimming is a data transformation technique where outliers are removed or "trimmed" from the dataset, rather than replacing or imputing their values. Trimming involves setting a threshold or cutoff value, and any data points exceeding this threshold are removed from the dataset

# trimming - delete the outlier data
new_df = df.loc[(df['residual sugar'] <= upper_limit) & (df['residual sugar'] >= lower_limit)]
print('before removing outliers:', len(df))
print('after removing outliers:',len(new_df))
print('outliers:', len(df)-len(new_df))
  • The code snippet performs trimming by removing the outlier data from the DataFrame df based on the upper and lower limits calculated earlier

  • It creates a new DataFrame named new_df that contains only the rows with 'residual sugar' values within the calculated limits

  • The code calculates the length of df before removing outliers using len(df)

  • It then calculates the length of new_df after removing outliers using len(new_df)

  • Finally, it calculates the number of outliers removed by subtracting the length of new_df from the length of df

  • By printing these values, you can see the number of rows in df before and after removing outliers, as well as the count of outliers that were removed

Count of dataframe before and after removing outliers
Count of dataframe before and after removing outliers using Z-Score method


Next let us plot the data after trimming outliers

sns.boxplot(new_df['residual sugar'])
Boxplot of residual sugar column after removing outliers
Boxplot of residual sugar column after removing outliers


Next we will perform capping. Capping, also known as Winsorization, is a technique used to handle outliers by setting a threshold and capping or truncating extreme values to a specified percentile. Capping involves replacing outlier values with less extreme values, thus reducing the impact of outliers on the dataset without entirely removing them

# capping - change the outlier values to upper (or) lower limit values
new_df = df.copy()
new_df.loc[(new_df['residual sugar']>=upper_limit), 'residual sugar'] = upper_limit
new_df.loc[(new_df['residual sugar']<=lower_limit), 'residual sugar'] = lower_limit
  • You are performing capping by changing the outlier values in the 'residual sugar' column of the DataFrame df to the upper or lower limit values

  • The DataFrame new_df is created as a copy of the original DataFrame df, and the outlier values are replaced accordingly

  • new_df is created as a copy of df. The upper_limit and lower_limit values are calculated using the mean and standard deviation of the 'residual sugar' column

  • The .loc[] method is then used to identify the rows where the 'residual sugar' values exceed the upper limit or fall below the lower limit

  • The corresponding outlier values are replaced with the upper or lower limit values using the assignment statement

  • By performing capping in this way, the outlier values in the 'residual sugar' column are replaced with the specified upper or lower limit values, effectively bringing them within the desired range



Next let us plot the data after performing capping

sns.boxplot(new_df['residual sugar'])
Boxplot of residual sugar column after Capping
Boxplot of residual sugar column after Capping
  • Here we have not deleted any of the data rather we have capped it . We can check by printing the length of data

len(new_df)

6497

  • We can see that length of new dataframe is 6497 which is same as the old dataframe


Inter Quartile Range Method

  • The Interquartile Range (IQR) method is another statistical technique used to detect and handle outliers in a dataset. The IQR represents the range between the first quartile (Q1) and the third quartile (Q3) of a dataset


First let us calculate the first quartile (Q1), third quartile (Q3), and interquartile range (IQR) of the 'residual sugar' column in the DataFrame df

q1 = df['residual sugar'].quantile(0.25)
q3 = df['residual sugar'].quantile(0.75)
iqr = q3-q1
  • In this code, q1 is calculated as the value at the 25th percentile (first quartile) of the 'residual sugar' column using the .quantile() function with a parameter of 0.25

  • Similarly, q3 is calculated as the value at the 75th percentile (third quartile)

  • Finally, iqr is computed as the difference between q3 and q1, representing the interquartile range

  • By calculating the Q1, Q3, and IQR, you obtain important descriptive statistics that can help in understanding the spread and distribution of the 'residual sugar' data.

  • These values are commonly used in the Interquartile Range (IQR) method for outlier detection and other data analysis techniques


q1, q3, iqr

(1.8, 8.1, 6.3)

  • These are the values of Q1, Q3, and IQR for the 'residual sugar' data in your DataFrame


Next let us calculate the upper and lower limit using the Interquartile Range (IQR) method

upper_limit = q3 + (1.5 * iqr)
lower_limit = q1 - (1.5 * iqr)
lower_limit, upper_limit
  • upper_limit is computed by adding 1.5 times the IQR to Q3 (q3 + (1.5 * iqr)), while lower_limit is calculated by subtracting 1.5 times the IQR from Q1 (q1 - (1.5 * iqr))

  • By printing these values, you can obtain the specific lower and upper limits that define the range within which data points are considered non-outliers according to the IQR method

(-7.6499999999999995, 17.549999999999997)



Next let us plot the data

sns.boxplot(df['residual sugar'])
Box plot of residual sugar column with outliers
Box plot of residual sugar column with outliers


Next we will find the outliers using upper and lower limit calculated earlier

# find the outliers
df.loc[(df['residual sugar'] > upper_limit) | (df['residual sugar'] < lower_limit)]
Dataframe with outliers
Dataframe with outliers


Next let us perform trimming of the outliers

# trimming - delete the outlier data
new_df = df.loc[(df['residual sugar'] <= upper_limit) & (df['residual sugar'] >= lower_limit)]
print('before removing outliers:', len(df))
print('after removing outliers:',len(new_df))
print('outliers:', len(df)-len(new_df))
  • The code snippet performs outlier removal using the calculated upper and lower limits based on the Interquartile Range (IQR) method. It creates a new DataFrame named new_df that includes only the rows with 'residual sugar' values within the calculated limits

  • By printing the lengths of df and new_df, you can see the number of rows in the DataFrame before and after removing outliers. Additionally, the difference in lengths (len(df) - len(new_df)) gives you the count of outliers that were removed

Count of dataframe before and after removing outliers
Count of dataframe before and after removing outliers using Inter Quartile Range Method


Next let us plot the data after trimming outliers

sns.boxplot(new_df['residual sugar'])
Boxplot of residual sugar column after trimming
Boxplot of residual sugar column after trimming


Next let us perform capping of the outliers

# capping - change the outlier values to upper (or) lower limit values
new_df = df.copy()
new_df.loc[(new_df['residual sugar']>upper_limit), 'residual sugar'] = upper_limit
new_df.loc[(new_df['residual sugar']<lower_limit), 'residual sugar'] = lower_limit
  • The code snippet performs capping by replacing the outlier values in the 'residual sugar' column of the DataFrame df with the upper or lower limit values. The DataFrame new_df is created as a copy of the original DataFrame df, and the outlier values are modified accordingly


Next let us plot the data after performing capping

sns.boxplot(new_df['residual sugar'])
Boxplot of residual sugar column after capping
Boxplot of residual sugar column after capping


Percentile Method

  • The percentile method can be used to handle outliers in a dataset. The percentile method involves setting a threshold based on percentiles and capping or truncating the outlier values accordingly


First let us calculate the upper and lower limit

upper_limit = df['residual sugar'].quantile(0.99)
lower_limit = df['residual sugar'].quantile(0.01)
print('upper limit:', upper_limit)
print('lower limit:', lower_limit)
  • The quantile() function in pandas is used to calculate the desired percentiles of the 'residual sugar' column in the DataFrame df

  • upper_limit is calculated as the value at the 99th percentile (0.99) of the 'residual sugar' column, and lower_limit is calculated as the value at the 1st percentile (0.01)

  • By printing these values, you can obtain the specific upper and lower limits that define the range within which data points are considered non-outliers according to the percentile method. These limits are calculated based on the specified percentiles and can be used to handle outliers in the 'residual sugar' column of your dataset

 upper and lower limit using percentile method
Upper and Lower limit


Next let us plot the data

sns.boxplot(df['residual sugar'])
Boxplot of residual sugar column before trimming or capping outliers
Boxplot of residual sugar column before trimming or capping outliers


Next we will find the outliers using upper and lower limit calculated earlier

# find the outliers
df.loc[(df['residual sugar'] > upper_limit) | (df['residual sugar'] < lower_limit)]
Dataframe with outliers
Dataframe with outliers


Next let us perform trimming of the outliers

# trimming - delete the outlier data
new_df = df.loc[(df['residual sugar'] <= upper_limit) & (df['residual sugar'] >= lower_limit)]
print('before removing outliers:', len(df))
print('after removing outliers:',len(new_df))
print('outliers:', len(df)-len(new_df))
  • The code snippet filters the DataFrame df based on the upper and lower limits calculated using the percentile method. It creates a new DataFrame named new_df that includes only the rows with 'residual sugar' values within the calculated limits

Count of dataframe before and after removing outliers using percentile method
Count of dataframe before and after removing outliers using percentile method


Next let us plot the data after trimming outliers

sns.boxplot(new_df['residual sugar'])
Boxplot of residual sugar column after trimming using Percentile method
Boxplot of residual sugar column after trimming using Percentile method


Next let us perform capping of the outliers

# capping - change the outlier values to upper (or) lower limit values
new_df = df.copy()
new_df.loc[(new_df['residual sugar']>upper_limit), 'residual sugar'] = upper_limit
new_df.loc[(new_df['residual sugar']<lower_limit), 'residual sugar'] = lower_limit
  • The code you provided performs capping by replacing the outlier values in the 'residual sugar' column of the DataFrame df with the upper or lower limit values. The DataFrame new_df is created as a copy of the original DataFrame df, and the outlier values are modified accordingly


Next let us plot the data after performing capping

sns.boxplot(new_df['residual sugar'])
Boxplot of residual sugar column after trimming using Percentile method
Boxplot of residual sugar column after capping using Percentile method


Next let us plot the distplot for both old and new dataframe

sns.distplot(df['residual sugar'])
Distribution plot before removing outliers
Distribution plot before removing outliers


sns.distplot(new_df['residual sugar'])
Distribution plot after removing outliers
Distribution plot after removing outliers



Final Thoughts

  • Outliers are data points that deviate significantly from the majority of the dataset, and they can have a significant impact on statistical measures and model performance

  • It's crucial to have a good understanding of the data and the domain in which it is collected. Outliers may arise due to various reasons, such as measurement errors, data entry mistakes, or genuinely rare events. Understanding the nature of the data helps in making informed decisions about whether to remove or retain outliers

  • There are several methods available for detecting outliers, including statistical techniques like z-score, modified z-score, and box plots. These methods help identify observations that fall outside a certain threshold. Additionally, domain-specific knowledge and visual exploration of the data can also aid in outlier detection.

  • Outliers can significantly influence statistical measures such as mean, variance, and correlation coefficients. Therefore, it is essential to assess the impact of outliers on the analysis and decide whether their presence distorts the results. Sometimes, outliers may contain valuable information, and removing them can lead to biased or inaccurate conclusions

  • Once outliers are detected, the next step is to decide how to handle them. There are several approaches such as Remove outliers (trimming), Transform data(capping), and Treat separately

  • While outlier detection and removal can be valuable, it is essential to exercise caution and be aware of potential pitfalls such as Overzealous removal, Sample size and statistical power and Outlier definition

  • In summary, detecting and removing outliers should be approached with careful consideration of the data, domain knowledge, and the goals of the analysis. It is a crucial step in data preprocessing, but it requires judgment and an understanding of the potential impact on subsequent analyses or models

In this article we have explored how we can detect and remove outliers using Z-score method , Inter Quartile Range method and Percentile method and we have also seen how we can perform trimming and capping in each of this methods.



Get the project notebook from here


Thanks for reading the article!!!


Check out more project videos from the YouTube channel Hackers Realm

Comments


bottom of page