Hackers Realm
- Aug 7, 2023
- 8 min read

Word Embedding using GloVe | Feature Extraction | NLP | Python

GloVe, which stands for Global Vectors for Word Representation, is a popular word embedding technique that captures semantic relationships between words in a vector space. It's designed to address some limitations of traditional methods like Word2Vec. GloVe produces word embeddings by analyzing the global co-occurrence statistics of words in a large corpus. We can get word embedding from glove using python for any sentence.

GloVe is also an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

You can watch the video-based tutorial with step by step explanation down below.

Pre-process the Data

In this example we have used GloVe with 100 dimensional vectors which has all the word embeddings. We can download this text from below link

https://www.kaggle.com/datasets/danielwillgeorge/glove6b100dtxt

First we will have pre process the dataset.

import pandas as pd
import string
from nltk.corpus import stopwords
df = pd.read_csv('data/Twitter Sentiments.csv')
# drop the columns
df = df.drop(columns=['id', 'label'], axis=1)

df['clean_text'] = df['tweet'].str.lower()

STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
    return " ".join([word for word in text.split() if word not in STOPWORDS])
df['clean_text'] = df['clean_text'].apply(lambda x: remove_stopwords(x))

import re
def remove_spl_chars(text):
    text = re.sub('[^a-zA-Z0-9]', ' ', text)
    text = re.sub('\s+', ' ', text)
    return text
df['clean_text'] = df['clean_text'].apply(lambda x: remove_spl_chars(x))

df.head()

First 5 rows of the Dataset — First 5 rows of the Dataframe

First import the necessary libraries, including Pandas for data manipulation, NLTK for text processing, and regular expressions (re) for text cleaning.
Next read the CSV file containing Twitter sentiment data into a Pandas DataFrame. The columns 'id' and 'label' are dropped from the DataFrame using the drop() method, as they are not needed for the preprocessing steps.
Next convert the text in the 'tweet' column to lowercase using the .str.lower() method. This ensures that text is consistent and not case-sensitive.
Next define a function remove_stopwords() that takes a text input, splits it into words, and then removes words that are in the NLTK stopwords set. Stopwords are common words like "the", "and", "is", etc. that are often removed to reduce noise in text data. The apply() method is used to apply this function to the 'clean_text' column.
Next define a function remove_spl_chars() that uses regular expressions to remove any characters that are not alphanumeric. This includes punctuation and special characters. The function also replaces consecutive white spaces with a single space. Again, the apply() method is used to apply this function to the 'clean_text' column.
Next use the apply() method to apply both the remove_stopwords() and remove_spl_chars() functions to the 'clean_text' column, further preprocessing the text.
Finally use the .head() method to display the first few rows of the DataFrame after preprocessing.

Import Modules

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

keras.preprocessing.text - This module provides utilities for preprocessing and handling text data specifically for deep learning tasks.
keras.preprocessing.sequence - This module provides various tools for working with sequences, especially for preparing data for input into deep learning models.
numpy - It provides support for working with arrays, matrices, and various mathematical operations, making it a powerful tool for numerical computations.

Tokenize the text

Next we will tokenize the data

# tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['clean_text'])

word_index = tokenizer.word_index
vocab_size = len(word_index)
vocab_size

39085

Tokenization is the process of splitting text into individual words or tokens. The Tokenizer class in Keras allows you to do this easily. You've created a Tokenizer object.
You then fit the tokenizer on the cleaned text data in your DataFrame. This step builds the vocabulary and assigns an index to each unique word in the text.
tokenizer.fit_on_texts(df['clean_text']) means that the tokenizer learns the mapping between words and indices based on the frequency of words in the dataset.
After fitting the tokenizer, you can access the word index, which is a dictionary that maps words to their corresponding indices. This dictionary allows you to convert words to their corresponding integer indices.
Finally, you calculate the size of the vocabulary by getting the length of the word_index dictionary.
The vocab_size variable now holds the number of unique words in your text data, which is essential for setting up the embedding layer in your deep learning model.
39085 is the number of unique words in your text data.

Next let us see the max length of the data.

max(len(data) for data in df['clean_text'])

len(data) for data in df['clean_text']: This is a list comprehension that iterates over each element (sequence) in the 'clean_text' column of your DataFrame and calculates the length of each sequence.
The max function is then used to find the maximum value among the calculated sequence lengths, giving you the length of the longest sequence in the 'clean_text' column.

Padding the data

Next we will be padding the text data.

# padding text data
sequences = tokenizer.texts_to_sequences(df['clean_text'])
padded_seq = pad_sequences(sequences, maxlen=131, padding='post', truncating='post')

You've already fit the Tokenizer on the cleaned text data, and now you're using the tokenizer to convert the text data into sequences of integer indices.
The resulting sequences variable holds a list of sequences, where each sequence is a list of integer indices corresponding to the words in the original text.
pad_sequences function to ensure that all sequences have the same length. This is important because many machine learning models, especially neural networks, require fixed-length inputs.
padded_seq is a 2D NumPy array where each row corresponds to a padded sequence. The maxlen parameter specifies the desired length of the padded sequences.
You've chosen to pad and truncate sequences after the content, and the padding parameter is set to 'post' for that purpose.

Next try to access the first padded sequence in the padded_seq array that you've created.

padded_seq[0]

array([ 1, 28, 15330, 2630, 6365, 184, 7786, 385, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0])

padded_seq is a 2D NumPy array containing your preprocessed text data with sequences padded to a common length. Each row of the array represents a padded sequence.

Word Embedding

First we will create the embedding index.

# create embedding index
embedding_index = {}
with open('glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coefs

embedding_index: This is a dictionary where keys are words, and values are their corresponding pre-trained word embeddings. This dictionary will be used to look up embeddings for words in your text data.
with open('glove.6B.100d.txt', encoding='utf-8') as f: This line opens the GloVe file for reading. Make sure that the file is in the correct directory or provide the correct path to it.
for line in f: This loop iterates through each line in the GloVe file.
values = line.split(): This splits the line into a list of values, where the first value is the word, and the rest are the components of the word's embedding.
word = values[0]: This extracts the word from the values list.
coefs = np.asarray(values[1:], dtype='float32'): This converts the remaining values into a NumPy array of float32 data type, which represents the word's embedding components.
embedding_index[word] = coefs: This adds the word as a key to the embedding_index dictionary and associates it with its corresponding embedding vector.

Next let us access the pre-trained GloVe word embedding vector for the word "good" from the embedding_index dictionary that you've created.

embedding_index['good']

array([-0.030769 , 0.11993 , 0.53909 , -0.43696 , -0.73937 ,

-0.15345 , 0.081126 , -0.38559 , -0.68797 , -0.41632 ,

-0.13183 , -0.24922 , 0.441 , 0.085919 , 0.20871 ,

-0.063582 , 0.062228 , -0.051234 , -0.13398 , 1.1418 ,

0.036526 , 0.49029 , -0.24567 , -0.412 , 0.12349 ,

0.41336 , -0.48397 , -0.54243 , -0.27787 , -0.26015 ,

-0.38485 , 0.78656 , 0.1023 , -0.20712 , 0.40751 ,

0.32026 , -0.51052 , 0.48362 , -0.0099498, -0.38685 ,

0.034975 , -0.167 , 0.4237 , -0.54164 , -0.30323 ,

-0.36983 , 0.082836 , -0.52538 , -0.064531 , -1.398 ,

-0.14873 , -0.35327 , -0.1118 , 1.0912 , 0.095864 ,

-2.8129 , 0.45238 , 0.46213 , 1.6012 , -0.20837 ,

-0.27377 , 0.71197 , -1.0754 , -0.046974 , 0.67479 ,

-0.065839 , 0.75824 , 0.39405 , 0.15507 , -0.64719 ,

0.32796 , -0.031748 , 0.52899 , -0.43886 , 0.67405 ,

0.42136 , -0.11981 , -0.21777 , -0.29756 , -0.1351 ,

0.59898 , 0.46529 , -0.58258 , -0.02323 , -1.5442 ,

0.01901 , -0.015877 , 0.024499 , -0.58017 , -0.67659 ,

-0.040379 , -0.44043 , 0.083292 , 0.20035 , -0.75499 ,

0.16918 , -0.26573 , -0.52878 , 0.17584 , 1.065 ],

dtype=float32)

This code retrieves the pre-trained embedding vector for the word "good" from the embedding_index dictionary. The resulting embedding_vector_for_good will be a NumPy array containing the vector representation of the word "good" as per the GloVe embeddings you've loaded.

Next let us create embedding matrix.

# create embedding matrix
embedding_matrix = np.zeros((vocab_size+1, 100))
for word, i in word_index.items():
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

embedding_matrix = np.zeros((vocab_size + 1, 100)): This initializes an embedding matrix with zeros. The dimensions of the matrix are (vocab_size + 1, 100), where vocab_size + 1 accounts for the fact that the word indices start from 1, not 0. The size of the word embeddings is 100, which matches the size of the pre-trained GloVe embeddings you're using.
for word, i in word_index.items(): This loop iterates through each word and its corresponding index in the word_index dictionary, which you obtained from tokenizing your text data.
embedding_vector = embedding_index.get(word): This line attempts to retrieve the pre-trained embedding vector for the current word from the embedding_index dictionary.
if embedding_vector is not None: If a pre-trained embedding vector is found for the current word, the code enters this conditional block.
embedding_matrix[i] = embedding_vector: The code assigns the pre-trained embedding vector to the corresponding row in the embedding matrix at index i.

Next let us see the embedding matrix shape.

embedding_matrix.shape

(39086, 100)

The shape of the embedding_matrix that you've created using the code provided will be (vocab_size + 1, 100), where vocab_size represents the number of unique words in your vocabulary, and 100 is the dimensionality of the word embeddings you're using (assuming you're using the GloVe embeddings with 100 dimensions). The +1 in the vocab_size + 1 accounts for the fact that word indices start from 1, not 0.
Here, vocab_size + 1 is 39086 and the dimensionality is 100 as we are using GloVe Embeddings with 100 dimensions.

Final Thoughts

GloVe embeddings excel at capturing semantic relationships between words. Words with similar meanings or contexts are located closer to each other in the embedding space. This property is essential for understanding the meaning of words in the context of a language.
GloVe offers pre-trained word embeddings on large corpora of text data, which can save you time and resources. These embeddings carry general semantic knowledge that can be leveraged in your specific NLP tasks without the need for extensive training.
Using pre-trained GloVe embeddings as initial weights for your embedding layer is a form of transfer learning. This can lead to faster convergence and better performance, especially if your dataset is small or if the pre-trained embeddings are derived from a similar domain.
GloVe embeddings can effectively reduce the dimensionality of your text data, making it more manageable for downstream tasks. This can be crucial for computational efficiency and avoiding overfitting.
One limitation of pre-trained embeddings like GloVe is that they might not include embeddings for all words. Out-of-vocabulary words may need to be handled separately through techniques like subword embeddings or handling them as special cases.
While GloVe embeddings capture semantic meaning, they don't consider contextual information within sentences. More recent methods like BERT or GPT-3 capture contextual information more effectively, which can be essential for certain tasks.
Depending on your task, it might be beneficial to fine-tune the pre-trained embeddings on your specific dataset. This can help align the embeddings more closely with the nuances of your task.
When using GloVe embeddings, you need to determine parameters like the dimensionality of embeddings, context window size, etc. Proper hyperparameter tuning can impact the quality of your results.

In summary, GloVe word embeddings are a valuable tool for NLP tasks, providing an efficient way to represent words and their semantic relationships. By integrating pre-trained GloVe embeddings into your models, you can enhance their performance and achieve better results across various text analysis tasks. However, it's important to understand the limitations and consider the specific needs of your project when deciding how to use word embeddings effectively.

Get the project notebook from here

Thanks for reading the article!!!

Check out more project videos from the YouTube channel Hackers Realm