Feature Extraction using Term Frequency - Inverse Document Frequency (TF-IDF) | NLP | Python
Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical representation often used in natural language processing and information retrieval to evaluate the importance of a word in a document relative to a collection of documents (a corpus) using python.
It is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc.,) in a document amongst a collection of documents.
You can watch the video-based tutorial with step by step explanation down below.
Let us see the working of this technique with the help of an example
First we will create a new sentence.
text_data = ['I am interested in NLP', 'This is a good tutorial with good topic', 'Feature extraction is very important topic']
Here, we have created a list of 2-3 sentences.
from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(stop_words='english')
TfidfVectorizer class from the sklearn.feature_extraction.text module in the scikit-learn library. This class is used for generating TF-IDF feature vectors from a collection of text documents.
The TfidfVectorizer class is initialized with various parameters, and in this case, it's configured with stop_words='english'. The stop_words parameter allows you to specify a list of common words (such as "and," "the," "is," etc.) that should be ignored during the TF-IDF calculation since they typically don't carry much specific meaning.
Fit the data
Next we will have to fit the data that we we have created earlier.
# fit the data tfidf.fit(text_data)
The fit method of TfidfVectorizer is called with the text_data variable as input. This step analyzes the text_data, learns the vocabulary (unique words) present in the data, and computes the Inverse Document Frequency (IDF) values for each term based on the entire corpus.
After fitting the data, you can use the trained TfidfVectorizer to transform new text data into TF-IDF vectors using the transform method.
Display the vocabulary list
Next we will get the vocabulary list.
# get the vocabulary list tfidf.vocabulary_
You can access the learned vocabulary, which is a dictionary that maps each unique word (term) in the corpus to its corresponding index (feature number) in the TF-IDF matrix. To get the vocabulary list, you can use the vocabulary_ attribute of the fitted TfidfVectorizer.
The vocabulary_list variable will now contain the learned vocabulary as a Python dictionary. Each key-value pair in this dictionary represents a unique word and its corresponding index.
Transform the data
Next we will transform the data.
tfidf_features = tfidf.transform(text_data) tfidf_features
<3x8 sparse matrix of type '<class 'numpy.float64'>' with 9 stored elements in Compressed Sparse Row format>
The transform method of the TfidfVectorizer is called with the text_data variable as input. This step converts the input text data into TF-IDF feature vectors using the learned vocabulary and IDF values obtained during the fitting process.
The variable tfidf_features now holds the sparse matrix representation of the TF-IDF features for the input text_data.
The tfidf_features matrix is typically a sparse matrix since most text data will have a lot of zeros in their TF-IDF representations due to the sparsity of the vocabulary across the documents.
Visualize the TF-IDF features
To access the actual values of the tfidf_features matrix, you can convert it to a dense NumPy array using the toarray() method.
tfidf_feature_array = tfidf_features.toarray() tfidf_feature_array
array([[0. , 0. , 0. , 0. , 0.70710678,
0.70710678, 0. , 0. ],
[0. , 0. , 0.84678897, 0. , 0. ,
0. , 0.32200242, 0.42339448],
[0.52863461, 0.52863461, 0. , 0.52863461, 0. ,
0. , 0.40204024, 0. ]])
The toarray() method is called on the sparse matrix tfidf_features to convert it into a dense NumPy array. This transformation converts the sparse matrix into a 2-dimensional array, where each row corresponds to a document, and each column represents the TF-IDF score of a specific term (word) in the vocabulary.
The variable tfidf_feature_array will hold the dense NumPy array representation of the TF-IDF features for the input text_data. Each element in this array represents the TF-IDF score of a term in a particular document.
Next let us print the values more clearly.
for sentence, feature in zip(text_data, tfidf_features): print(sentence) print(feature)
I am interested in NLP
(0, 5) 0.7071067811865476
(0, 4) 0.7071067811865476
This is a good tutorial with good topic
(0, 7) 0.42339448341195934
(0, 6) 0.3220024178194947
(0, 2) 0.8467889668239187
Feature extraction is very important topic
(0, 6) 0.4020402441612698
(0, 3) 0.5286346066596935
(0, 1) 0.5286346066596935
(0, 0) 0.5286346066596935
zip(text_data, tfidf_features): The zip function combines elements from text_data and tfidf_features into pairs. In this case, it pairs each document (sentence) from text_data with its corresponding TF-IDF feature vector from tfidf_features.
The for loop iterates over each pair (sentence, feature) obtained from zip(text_data, tfidf_features).
The output of this loop will be the text data (text_data) printed as sentences and the corresponding TF-IDF feature vectors (tfidf_features) printed as arrays. Each array will represent the sparse TF-IDF feature vector for a specific document (sentence) in the text_data.
TF-IDF is an essential tool for text analysis tasks, such as document classification, information retrieval, sentiment analysis, text clustering, and keyword extraction. It allows us to represent text data in a meaningful and quantitative way, enabling machine learning algorithms to work effectively with textual information.
TF-IDF takes into account both the frequency of a term in a document (TF) and its rarity across the entire corpus (IDF). This approach assigns higher weights to terms that are frequent in a document but rare in the corpus, making them more discriminative and informative.
Text data often results in high-dimensional and sparse feature vectors due to the vast vocabulary and the presence of many rare terms in the corpus. As a result, TF-IDF representations are typically stored as sparse matrices to save memory and computational resources.
While TF-IDF is a valuable technique, it has some limitations. For example, it does not consider the semantic meaning of words, the order of terms within a document, or relationships between words. More advanced techniques, like word embeddings and deep learning models, are used to address these limitations in certain applications.
Preprocessing of text data is crucial before applying TF-IDF. Techniques such as tokenization, stemming, lemmatization, and lowercasing are commonly used to standardize and clean the text data.
Longer documents may have higher TF-IDF scores due to their higher term frequencies. To mitigate this, normalization techniques like dividing by the Euclidean norm are often used.
In summary, TF-IDF is a valuable and widely used technique for converting text data into numerical features while considering term importance and document relevance. However, it is just one of the many tools in the NLP toolbox, and its effectiveness depends on the specific use case and the complexity of the underlying text data. For more sophisticated tasks and to capture semantic relationships, more advanced models and techniques are often employed.
Get the project notebook from here
Thanks for reading the article!!!
Check out more project videos from the YouTube channel Hackers Realm