Text preprocessing is a crucial step in Natural Language Processing (NLP) that involves cleaning and transforming raw text data into a format that is more suitable for analysis and machine learning tasks in python. The goal of text preprocessing is to remove noise, inconsistencies, and irrelevant information from the text, making it easier for algorithms to understand and work with the data.
Text preprocessing can vary depending on the specific task and the nature of the data. It's essential to understand the requirements of your NLP task and tailor your preprocessing steps accordingly. After preprocessing, the clean and structured text can be used for tasks like text classification, sentiment analysis, machine translation, and more.
You can watch the video-based tutorial with step by step explanation down below.
Load the Dataset
First we will have to load the dataset.
import pandas as pd
import string
df = pd.read_csv('data/Twitter Sentiments.csv')
# drop the columns
df = df.drop(columns=['id', 'label'], axis=1)
df.head()
The code snippet reads a CSV file named 'Twitter Sentiments.csv' into a Pandas DataFrame object named 'df' .
Then we will drop the columns named 'id' and 'label' from the DataFrame as we will not use these columns.
Then displaying the first few rows of the DataFrame using the head() function.
Let us see different pre processing techniques
1) Convert to Lowercase
Let us convert the text into lowercase.
df['clean_text'] = df['tweet'].str.lower()
df.head()
df['tweet'].str.lower() adds a new column named 'clean_text' to the DataFrame df. The values in this column are derived from the 'tweet' column using the .str.lower() method, which converts all text in the 'tweet' column to lowercase. This is a common preprocessing step to ensure consistent comparison and analysis of text data.
Next we will display the first few rows of the DataFrame to inspect the changes.
2) Removal of Punctuations
Next we will remove all the punctuations from the text data. First we will display the punctuations that are available.
string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
This is a string that contains all ASCII punctuation characters. It's often used for text processing and cleaning tasks, such as removing punctuation from text.
the string.punctuation string contains characters like '!', '?', ',', '.', and many others as we can see here.
Let us create a function to remove the punctuations.
def remove_punctuations(text):
punctuations = string.punctuation
return text.translate(str.maketrans('', '', punctuations))
Define a function remove_punctuations(text) that takes a text input and uses the string.punctuation module to remove all punctuation characters from the text.
punctuations = string.punctuation: This line assigns the string containing all ASCII punctuation characters to the variable punctuations.
return text.translate(str.maketrans('', '', punctuations)): This line uses the str.maketrans() method to create a translation table where all characters from punctuations are mapped to None, effectively removing them. Then, the str.translate() method is applied to the input text using the translation table, resulting in the text with all punctuation characters removed.
This is a useful function for cleaning text data by removing punctuation, which can be beneficial for various natural language processing tasks.
Next let us display the text after applying the above defined function.
df['clean_text'] = df['clean_text'].apply(lambda x: remove_punctuations(x))
df.head()
df['clean_text'] : This accesses the 'clean_text' column of your DataFrame.
.apply(lambda x : remove_punctuations(x)): This applies the remove_punctuations() function to each element in the 'clean_text' column. The lambda function here is used to pass each element (text) to the remove_punctuations() function.
Finally display the first few rows of the DataFrame to inspect the changes made to the 'clean_text' column.
We can clearly find that all the punctuations are removed in the clean_text column.
3) Removal of Stopwords
First we will import the stopwords.
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))
"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mustn't, needn, needn't, shan, shan't, shouldn, shouldn't, wasn, wasn't, weren, weren't, won, won't, wouldn, wouldn't"
Import the stopwords module from the nltk.corpus package, which contains lists of stopwords for various languages.
Next retrieve the list of English stopwords using the stopwords.words('english') function from NLTK and then uses the ", ".join() method to join these stopwords into a single comma-separated string.
The output will display a comma-separated list of English stopwords. This can be useful for understanding what words are typically considered stopwords and might be removed from your text during preprocessing for text analysis tasks.
Let us create a function to remove stop words.
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
return " ".join([word for word in text.split() if word not in STOPWORDS])
Define a function named remove_stopwords(text) that takes a text input, splits it into words, and then removes stopwords from the text using NLTK's stopwords list.
STOPWORDS = set(stopwords.words('english')): This line initializes a set called STOPWORDS with the English stopwords obtained from NLTK's
return " ".join([word for word in text.split() if word not in STOPWORDS]): Within the function, this line splits the input text into words, iterates through each word, and checks if it's not in the set of stopwords (STOPWORDS). If the word is not a stopword, it's included in a list comprehension. Finally, the list of non-stopwords is joined back together using " ".join() to form a cleaned version of the text without stopwords.
Let us display the text after applying the above defined function.
df['clean_text'] = df['clean_text'].apply(lambda x: remove_stopwords(x))
df.head()
df['clean_text']: This selects the 'clean_text' column of the DataFrame df.
.apply(lambda x: remove_stopwords(x)): The .apply() function is used to apply a given function (in this case, a lambda function) to each element of the selected column. The lambda function takes each element x (a piece of text), and the remove_stopwords function is applied to it.
Finally display the first few rows of the modified DataFrame.
We can observe that stop words like a, for, your etc are removed.
4) Removal of Frequent Words
Let us first count the frequency of words in the 'clean_text' column of the DataFrame.
from collections import Counter
word_count = Counter()
for text in df['clean_text']:
for word in text.split():
word_count[word] += 1
word_count.most_common(10)
[('user', 17473),
('love', 2647),
('day', 2198),
('happy', 1663),
('amp', 1582),
('im', 1139),
('u', 1136),
('time', 1110),
('life', 1086),
('like', 1042)]
from collections import Counter: This imports the Counter class from the collections module. The Counter class is used to count the occurrences of elements in a collection.
word_count = Counter(): This initializes an empty Counter object called word_count to store the word frequencies.
for text in df['clean_text']: This iterates through each element in the 'clean_text' column of the DataFrame df.
for word in text.split(): This splits the current text into words and iterates through each word.
word_count[word] += 1: This increments the count of the current word in the word_count Counter.
word_count.most_common(10): This returns a list of the 10 most common words along with their frequencies in descending order.
The result of word_count.most_common(10) will be a list of tuples, where each tuple contains a word and its corresponding frequency. It will show the top 10 most frequent words in the 'clean_text' column.
Let us create a function to remove frequent words.
FREQUENT_WORDS = set(word for (word, wc) in word_count.most_common(3))
def remove_freq_words(text):
return " ".join([word for word in text.split() if word not in FREQUENT_WORDS])
Define a function remove_freq_words that takes a text as input and removes the words that appear in the set of frequent words.
FREQUENT_WORDS = set(word for (word, wc) in word_count.most_common(3)): This line creates a set called FREQUENT_WORDS containing the most common 3 words along with their frequencies from the word_count Counter.
def remove_freq_words(text): This line defines a function remove_freq_words that takes a text as input.
return " ".join([word for word in text.split() if word not in FREQUENT_WORDS]): This line splits the input text into words and then creates a list comprehension. The list comprehension iterates through each word in the split text and includes only those words that are not present in the FREQUENT_WORDS set. The filtered words are then joined back into a space-separated string using " ".join().
The purpose of the remove_freq_words function is to remove the most common words (as defined by FREQUENT_WORDS) from a given text. This could be useful for removing words that might not carry significant meaning due to their high frequency, such as stopwords.
Next let us display the text after applying the above defined function.
df['clean_text']=df['clean_text'].apply(lambda x: remove_freq_words(x))
df.head()
df['clean_text']: This selects the 'clean_text' column of the DataFrame df.
.apply(lambda x: remove_freq_words(x)): The .apply() function is used to apply the remove_freq_words function to each element in the 'clean_text' column. The lambda function takes each element x (a piece of text), and the remove_freq_words function is applied to it.
Finally display the first few rows of the modified DataFrame.
We can observe that the word user which was repeated many times has been removed.
5) Removal of Rare Words
Let us get the rare words in the text data.
RARE_WORDS = set(word for (word, wc) in word_count.most_common()[:-10:-1])
RARE_WORDS
{'airwaves',
'carnt',
'chisolm',
'ibizabringitonmallorcaholidayssummer',
'isz',
'mantle',
'shirley',
'youuuð\x9f\x98\x8dð\x9f\x98\x8dð\x9f\x98\x8dð\x9f\x98\x8dð\x9f\x98\x8dð\x9f\x98\x8dð\x9f\x98\x8dð\x9f\x98\x8dð\x9f\x98\x8dâ\x9d¤ï¸\x8f',
'ð\x9f\x99\x8fð\x9f\x8f¼ð\x9f\x8d¹ð\x9f\x98\x8eð\x9f\x8eµ'}
RARE_WORDS = set(word for (word, wc) in word_count.most_common()[:-10:-1]): This line creates a set called RARE_WORDS containing the least common 9 words along with their frequencies from the word_count Counter. The [:-10:-1] slice notation means to reverse the list of most common words and then take the first 9 elements from the reversed list, effectively giving you the least common 9 words.
The purpose of creating the RARE_WORDS set is likely to identify and potentially handle words that are very infrequent in the text data. You might use these rare words for tasks like further analysis or special processing.
Let us create a function to remove rare words.
def remove_rare_words(text):
return " ".join([word for word in text.split() if word not in RARE_WORDS])
Define a function called remove_rare_words that takes a single argument text, which is assumed to be a string.
text.split(): This part of the code splits the input text into a list of words, using space as the delimiter. This is done by calling the split() method on the text string. The result is a list of words.
[word for word in text.split() if word not in RARE_WORDS]: This is a list comprehension that iterates over each word in the list of words obtained from step 1. It checks if the word is not present in the RARE_WORDS list (assuming RARE_WORDS is defined somewhere in your code or globally). If the word is not rare, it's included in the new list being generated.
" ".join(...): This part of the code joins the list of words (generated in step 2) back into a single string using space as the separator. The join() method is called on an empty space string " " and the list comprehension expression is placed within the parentheses.
Next let us display the text after applying the above defined function.
df['clean_text']=df['clean_text'].apply(lambda x: remove_rare_words(x))
df.head()
Take the text in the 'clean_text' column.
Apply the remove_rare_words function to the text using the lambda function.
Update the value in the 'clean_text' column with the cleaned text.
As a result, the 'clean_text' column will be updated to contain the text with rare words removed.
6) Removal of Special characters
Let us create a function to remove special characters.
import re
def remove_spl_chars(text):
text = re.sub('[^a-zA-Z0-9]', ' ', text)
text = re.sub('\s+', ' ', text)
return text
Define a function called remove_spl_chars that takes a single argument text, assumed to be a string. The purpose of this function is to remove special characters from the input text using regular expressions and then clean up any excessive whitespace.
re.sub('[^a-zA-Z0-9]', ' ', text): This line uses the re.sub() function from the re (regular expression) module to substitute any characters that are not alphabetic letters (both uppercase and lowercase) or digits with a space. The regular expression [a-zA-Z0-9] matches any alphanumeric character, and the caret ^ inside square brackets negates the character class. So, [^a-zA-Z0-9] matches any character that is not an alphanumeric character.
re.sub('\s+', ' ', text): This line uses the re.sub() function to substitute multiple consecutive whitespace characters with a single space. The regular expression \s+ matches one or more whitespace characters (including spaces, tabs, and line breaks) in succession.
return text: The function returns the modified text after the special characters have been removed and excessive whitespace has been cleaned up.
Next let us display the text after applying the above defined function.
df['clean_text']=df['clean_text'].apply(lambda x: remove_spl_chars(x))
df.head()
The remove_spl_chars function to the 'clean_text' column of your DataFrame using a lambda function and the apply() method.
This code will clean up the text in the 'clean_text' column by removing special characters and excessive whitespace.
7) Stemming
Stemming involves reducing words to their base or root form, which can help in text processing and analysis by simplifying variations of the same word. Let us create a function to stem words in the data.
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def stem_words(text):
return " ".join([ps.stem(word) for word in text.split()])
from nltk.stem.porter import PorterStemmer: This line imports the PorterStemmer class from the nltk.stem.porter module. The Porter stemming algorithm is used to transform words to their stemmed form.
ps = PorterStemmer(): This initializes an instance of the PorterStemmer class, which will be used to perform stemming on individual words.
def stem_words(text): The function stem_words takes a single argument text, which is assumed to be a string.
" ".join([ps.stem(word) for word in text.split()]): This line splits the input text into words using spaces as delimiters (text.split()), applies stemming to each word using the PorterStemmer instance (ps.stem(word)), and then joins the stemmed words back into a string using spaces as separators.
Next let us display the text after applying the above defined function.
df['stemmed_text'] = df['clean_text'].apply(lambda x: stem_words(x))
df.head()
This code will create a new column named 'stemmed_text' in the DataFrame df that contains the stemmed versions of the words in the 'clean_text' column.
Each row in the 'stemmed_text' column will correspond to the stemmed text of the corresponding row in the 'clean_text' column.
8) Lemmatization & POS Tagging
Lemmatization involves reducing words to their base or dictionary form, which can help in text processing and analysis by simplifying variations of the same word. Let us create a function to lemmatize the words in the text data.
from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
wordnet_map = {"N":wordnet.NOUN, "V": wordnet.VERB, "J": wordnet.ADJ, "R": wordnet.ADV}
def lemmatize_words(text):
# find pos tags
pos_text = pos_tag(text.split())
return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_text])
from nltk import pos_tag: This line imports the pos_tag function from the NLTK library. This function is used to tag words with their part-of-speech (POS) information.
from nltk.corpus import wordnet: This line imports the wordnet module from the NLTK corpus. WordNet is a lexical database that provides information about words and their relationships.
from nltk.stem import WordNetLemmatizer: This line imports the WordNetLemmatizer class from the NLTK stem module. This class is used to perform lemmatization on words.
lemmatizer = WordNetLemmatizer(): This initializes an instance of the WordNetLemmatizer class, which will be used to perform lemmatization on individual words.
wordnet_map = {"N": wordnet.NOUN, "V": wordnet.VERB, "J": wordnet.ADJ, "R": wordnet.ADV}: This dictionary maps part-of-speech tags (N, V, J, R) from the pos_tag function to the corresponding constants from WordNet (wordnet.NOUN, wordnet.VERB, wordnet.ADJ, wordnet.ADV). These mappings are used to select the appropriate lemmatization rules based on the word's part of speech.
def lemmatize_words(text): The function lemmatize_words takes a single argument text, which is assumed to be a string.
pos_text = pos_tag(text.split()): This line applies the pos_tag function to the input text, splitting it into words first. It assigns part-of-speech tags to each word in the form of (word, pos) tuples.
" ".join(): This line joins the lemmatized words back into a string using spaces as separators.
Let us see the part_of_speech tag for Noun.
wordnet.NOUN
'n'
wordnet.NOUN is a constant provided by the NLTK WordNet module that represents the part of speech (POS) tag for a noun.
WordNet is a lexical database that organizes words into synonym sets (synsets) and provides various linguistic information, including part-of-speech tags.
Next let us apply the lemmatize function that we have defined above.
df['lemmatized_text'] = df['clean_text'].apply(lambda x: lemmatize_words(x))
df.head()
This code will create a new column named 'lemmatized_text' in the DataFrame df that contains the lemmatized versions of the words in the 'clean_text' column.
Each row in the 'lemmatized_text' column will correspond to the lemmatized text of the corresponding row in the 'clean_text' column.
As we are not able to see the lemmatized text in the above output . We will just shuffle the dataframe and display it again.
df.sample(frac=1).head(10)
The df.sample(frac=1) code is used to shuffle (randomly reorder) the rows of a DataFrame.
When you set frac to 1, it means you want to sample the entire DataFrame, and since the fraction is 1, you'll essentially get a shuffled version of the DataFrame.
Then display the first 10 rows of the shuffled DataFrame using shuffled_df.head(10).
We can see that the text feeling worried is truncated to its base words feel and worry.
9) Removal of URLs
First let us create a text with URL.
text = "https://www.hackersrealm.net is the URL of the channel Hackers Realm"
Next let us create a function to remove URLs in the text data.
def remove_url(text):
return re.sub(r'https?://\S+|www\.\S+', '', text)
re.sub(r'https?://\S+|www\.\S+', '', text): This line uses the re.sub() function from the re module to substitute URLs with an empty string in the given text.
r'https?://\S+ matches URLs starting with either "http://" or "https://", followed by any non-whitespace characters (\S+).
| is the regex OR operator.
www\.\S+ matches URLs that start with "www." followed by any non-whitespace characters.
The regular expression pattern https?://\S+|www\.\S+ effectively matches and removes both "http://" and "https://" URLs, as well as URLs starting with "www."
The function returns the text with URLs removed.
Next let us apply the function that we have created above.
remove_url(text)
' is the URL of the channel Hackers Realm'
The remove_url function is applied directly to the text_with_urls string, and the result is stored in the cleaned_text variable. The URLs in the text have been removed, and only the text remains.
We can see that the URL https://www.hackersrealm.net has been removed from the text.
10) Removal of HTML Tags
Let us first create a text with html tags.
text = "<html><body> <h1>Hackers Realm</h1> <p>This is NLP text preprocessing tutorial</p> </body></html>"
Next let us define a function to remove HTML tags.
def remove_html_tags(text):
return re.sub(r'<.*?>', '', text)
The remove_html_tags function uses regular expressions to remove HTML tags from a text. HTML tags are used in markup languages like HTML to define the structure and formatting of documents on the web.
re.sub(r'<.*?>', '', text): This line uses the re.sub() function from the re module to substitute HTML tags with an empty string in the given text.
r'<.*?>' is a regular expression that matches any sequence of characters between '<' and '>', effectively matching HTML tags.
The .*? inside the angle brackets is a non-greedy expression that matches the shortest possible sequence between the angle brackets.
The regular expression pattern r'<.*?>' will remove all HTML tags in the text.
The function returns the text with HTML tags removed.
Next let us apply the function that we have created above.
remove_html_tags(text)
' Hackers Realm This is NLP text preprocessing tutorial '
The remove_html_tags function is applied directly to the html_text string, and the result is stored in the cleaned_text variable. The HTML tags in the text have been removed, and only the plain text remains.
We can see that the html tags like <html><body><h1> etc have been removed.
11) Spelling Correction
Let us create a sentence with wrong spellings.
text = 'natur is a beuty'
Next let us create a function to correct the spelling mistakes.
from spellchecker import SpellChecker
spell = SpellChecker()
def correct_spellings(text):
corrected_text = []
misspelled_text = spell.unknown(text.split())
# print(misspelled_text)
for word in text.split():
if word in misspelled_text:
corrected_text.append(spell.correction(word))
else:
corrected_text.append(word)
return " ".join(corrected_text)
from spellchecker import SpellChecker: This line imports the SpellChecker class from the spellchecker library. This class provides methods to correct spelling errors.
spell = SpellChecker(): This initializes an instance of the SpellChecker class, which will be used to perform spelling corrections.
def correct_spellings(text): The function correct_spellings takes a single argument text, which is assumed to be a string.
misspelled_text = spell.unknown(text.split()): This line uses the unknown() method of the SpellChecker instance to find misspelled words in the text. The text.split() call splits the input text into a list of words.
The function then iterates through each word in the input text using a loop. If the word is in the list of misspelled words (if word in misspelled_text), it appends the corrected version of the word to the corrected_text list using spell.correction(word). If the word is not misspelled, it appends the original word to the corrected_text list.
The function returns the corrected_text list of words joined back into a string using spaces as separators.
Next let us apply the function that we have created above.
correct_spellings(text)
'nature is a beauty'
correct_spellings function is applied directly to the text_with_spelling_errors string, and the result is stored in the corrected_text variable.
The spelling errors in the text have been corrected, and the corrected text is displayed.
Final Thoughts
Tokenization: Breaking down text into individual words or tokens is often the first step. This makes it easier to analyze and process text at a granular level.
Lowercasing: Converting all text to lowercase can help ensure consistent comparisons and reduce the vocabulary size.
Stopword Removal: Stopwords are common words (e.g., "and", "the", "is") that don't carry significant meaning and can be removed to reduce noise in the data.
Special Character Removal: Removing special characters, such as punctuation, can simplify the text and reduce noise. However, be cautious with some special characters that might carry meaning, like '@' in email addresses or '#' in hashtags.
Stemming and Lemmatization: Reducing words to their base form (stemming) or dictionary form (lemmatization) can help consolidate word variations and improve analysis. Stemming is faster but might not always produce actual words, whereas lemmatization preserves actual words but can be slower.
Removing URLs and HTML Tags: Cleaning up URLs and HTML tags is often necessary to focus on the textual content itself.
Spell Checking and Correction: Correcting spelling errors can enhance the quality of the text and downstream analyses.
Part-of-Speech Tagging: Identifying the part of speech of each word can help with tasks like lemmatization and understanding context.
Word Vectorization: Converting words into numerical representations (word vectors) is essential for many machine learning models. Techniques like TF-IDF and word embeddings are commonly used.
Domain-specific Preprocessing: Depending on the application, you might need to consider domain-specific preprocessing steps, such as handling hashtags, mentions, or specific acronyms.
Order and Context: Be cautious about altering the order of words or sentences too much, as context is important in many NLP tasks.
Iterative Process: Text preprocessing is often an iterative process. After applying preprocessing, it's important to analyze the data again to ensure that no critical information has been lost or distorted.
Consider Data Size: Depending on your dataset's size, certain preprocessing steps might need to be adjusted. For very large datasets, some steps might need to be simplified for efficiency.
Text preprocessing decisions should be aligned with your specific task and goals. Different preprocessing steps might be more or less important depending on the nature of your data and the tasks you plan to perform. Always keep your end objectives in mind when deciding on the extent of preprocessing to apply.
Get the project notebook from here Thanks for reading the article!!! Check out more project videos from the YouTube channel Hackers Realm
Comments