Hackers Realm
Image Caption Generator using Python | Flickr Dataset | Deep Learning Tutorial
Updated: Feb 8
Image caption generator is a process of recognizing the context of an image and annotating it with relevant captions using deep learning, and computer vision. This is an advanced deep learning project where more than one model must be used for analysis and preprocessing the data to obtain the results.

In this project tutorial, we will build an image caption generator to load a random image and give some captions describing the image. We will use Convolutional Neural Network (CNN) for image feature extraction and Long Short-Term Memory Network (LSTM) for Natural Language Processing (NLP).
You can watch the step by step explanation video tutorial down below
Dataset Information
The objective of the project is to predict the captions for the input image. The dataset consists of 8k images and 5 captions for each image. The features are extracted from both the image and the text captions for input.
The features will be concatenated to predict the next word of the caption. CNN is used for image and LSTM is used for text. BLEU Score is used as a metric to evaluate the performance of the trained model.
Download the Flickr dataset here
Import Modules
First, we have to import all the basic modules we will be needing for this project
import os
import pickle
import numpy as np
from tqdm.notebook import tqdm
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding, Dropout, add
os - used to handle files using system commands.
pickle - used to store numpy features extracted
numpy - used to perform a wide variety of mathematical operations on arrays
tqdm - progress bar decorator for iterators. Includes a default range iterator printing to stderr.
VGG16, preprocess_input - imported modules for feature extraction from the image data
load_img, img_to_array - used for loading the image and converting the image to a numpy array
Tokenizer - used for loading the text as convert them into a token
pad_sequences - used for equal distribution of words in sentences filling the remaining spaces with zeros
plot_model - used to visualize the architecture of the model through different images
Now we must set the directories to use the data
BASE_DIR = '/kaggle/input/flickr8k'
WORKING_DIR = '/kaggle/working'
Extract Image Features
We have to load and restructure the model
# load vgg16 model
model = VGG16()
# restructure the model
model = Model(inputs=model.inputs, outputs=model.layers[-2].output)
# summarize
print(model.summary())
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels.h5 553467904/553467096 [==============================] - 3s 0us/step 553476096/553467096 [==============================] - 3s 0us/step Model: "model" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, 224, 224, 3)] 0 _________________________________________________________________ block1_conv1 (Conv2D) (None, 224, 224, 64) 1792 _________________________________________________________________ block1_conv2 (Conv2D) (None, 224, 224, 64) 36928 _________________________________________________________________ block1_pool (MaxPooling2D) (None, 112, 112, 64) 0 _________________________________________________________________ block2_conv1 (Conv2D) (None, 112, 112, 128) 73856 _________________________________________________________________ block2_conv2 (Conv2D) (None, 112, 112, 128) 147584 _________________________________________________________________ block2_pool (MaxPooling2D) (None, 56, 56, 128) 0 _________________________________________________________________ block3_conv1 (Conv2D) (None, 56, 56, 256) 295168 _________________________________________________________________ block3_conv2 (Conv2D) (None, 56, 56, 256) 590080 _________________________________________________________________ block3_conv3 (Conv2D) (None, 56, 56, 256) 590080 _________________________________________________________________ block3_pool (MaxPooling2D) (None, 28, 28, 256) 0 _________________________________________________________________ block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160 _________________________________________________________________ block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808 _________________________________________________________________ block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808 _________________________________________________________________ block4_pool (MaxPooling2D) (None, 14, 14, 512) 0 _________________________________________________________________ block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808 _________________________________________________________________ block5_pool (MaxPooling2D) (None, 7, 7, 512) 0 _________________________________________________________________ flatten (Flatten) (None, 25088) 0 _________________________________________________________________ fc1 (Dense) (None, 4096) 102764544 _________________________________________________________________ fc2 (Dense) (None, 4096) 16781312 ================================================================= Total params: 134,260,544 Trainable params: 134,260,544 Non-trainable params: 0 _________________________________________________________________ None
Fully connected layer of the VGG16 model is not needed, just the previous layers to extract feature results.
By preference you may include more layers, but for quicker results avoid adding the unnecessary layers.
Now we extract the image features and load the data for preprocess
# extract features from image
features = {}
directory = os.path.join(BASE_DIR, 'Images')
for img_name in tqdm(os.listdir(directory)):
# load the image from file
img_path = directory + '/' + img_name
image = load_img(img_path, target_size=(224, 224))
# convert image pixels to numpy array
image = img_to_array(image)
# reshape data for model
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
# preprocess image for vgg
image = preprocess_input(image)
# extract features
feature = model.predict(image, verbose=0)
# get image ID
image_id = img_name.split('.')[0]
# store feature
features[image_id] = feature
Dictionary 'features' is created and will be loaded with the extracted features of image data
load_img(img_path, target_size=(224, 224)) - custom dimension to resize the image when loaded to the array
image.reshape((1, image.shape[0], image.shape[1], image.shape[2])) - reshaping the image data to preprocess in a RGB type image.
model.predict(image, verbose=0) - extraction of features from the image
img_name.split('.')[0] - split of the image name from the extension to load only the image name.
# store features in pickle
pickle.dump(features, open(os.path.join(WORKING_DIR, 'features.pkl'), 'wb'))
Extracted features are not stored in the disk, so re-extraction of features can extend running time
Dumps and store your dictionary in a pickle for reloading it to save time
# load features from pickle
with open(os.path.join(WORKING_DIR, 'features.pkl'), 'rb') as f:
features = pickle.load(f)
Load all your stored feature data to your project for quicker runtime
Load the Captions Data
Let us store the captions data from the text file
with open(os.path.join(BASE_DIR, 'captions.txt'), 'r') as f:
next(f)
captions_doc = f.read()
Now we split and append the captions data with the image
# create mapping of image to captions
mapping = {}
# process lines
for line in tqdm(captions_doc.split('\n')):
# split the line by comma(,)
tokens = line.split(',')
if len(line) < 2:
continue
image_id, caption = tokens[0], tokens[1:]
# remove extension from image ID
image_id = image_id.split('.')[0]
# convert caption list to string
caption = " ".join(caption)
# create list if needed
if image_id not in mapping:
mapping[image_id] = []
# store the caption
mapping[image_id].append(caption)
Dictionary 'mapping' is created with key as image_id and values as the corresponding caption text
Same image may have multiple captions, if image_id not in mapping: mapping[image_id] = [] creates a list for appending captions to the corresponding image
Now let us see the no. of images loaded
len(mapping)
8091
Preprocess Text Data
def clean(mapping):
for key, captions in mapping.items():
for i in range(len(captions)):
# take one caption at a time
caption = captions[i]
# preprocessing steps
# convert to lowercase
caption = caption.lower()
# delete digits, special chars, etc.,
caption = caption.replace('[^A-Za-z]', '')
# delete additional spaces
caption = caption.replace('\s+', ' ')
# add start and end tags to the caption
caption = 'startseq ' + " ".join([word for word in caption.split() if len(word)>1]) + ' endseq'
captions[i] = caption
Defined to clean and convert the text for quicker process and better results
Let us visualize the text before and after cleaning
# before preprocess of text
mapping['1000268201_693b08cb0e']
['A child in a pink dress is climbing up a set of stairs in an entry way .', 'A girl going into a wooden building .', 'A little girl climbing into a wooden playhouse .', 'A little girl climbing the stairs to her playhouse .', 'A little girl in a pink dress going into a wooden cabin .']
# preprocess the text
clean(mapping)
# after preprocess of text
mapping['1000268201_693b08cb0e']
['startseq child in pink dress is climbing up set of stairs in an entry way endseq', 'startseq girl going into wooden building endseq', 'startseq little girl climbing into wooden playhouse endseq', 'startseq little girl climbing the stairs to her playhouse endseq', 'startseq little girl in pink dress going into wooden cabin endseq']
Words with one letter was deleted
All special characters were deleted
'startseq' and 'endseq' tags were added to indicate the start and end of a caption for easier processing
Next we will store the preprocessed captions into a list
all_captions = []
for key in mapping:
for caption in mapping[key]:
all_captions.append(caption)
len(all_captions)
40455
No. of unique captions stored
Let us see the first ten captions
all_captions[:10]
['startseq child in pink dress is climbing up set of stairs in an entry way endseq', 'startseq girl going into wooden building endseq', 'startseq little girl climbing into wooden playhouse endseq', 'startseq little girl climbing the stairs to her playhouse endseq', 'startseq little girl in pink dress going into wooden cabin endseq', 'startseq black dog and spotted dog are fighting endseq', 'startseq black dog and tri-colored dog playing with each other on the road endseq', 'startseq black dog and white dog with brown spots are staring at each other in the street endseq', 'startseq two dogs of different breeds looking at each other on the road endseq', 'startseq two dogs on pavement moving toward each other endseq']
Now we start processing the text data
# tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_captions)
vocab_size = len(tokenizer.word_index) + 1
vocab_size
8485
No. of unique words
# get maximum length of the caption available
max_length = max(len(caption.split()) for caption in all_captions)
max_length
35
Finding the maximum length of the captions, used for reference for the padding sequence.
Train Test Split
After preprocessing the data now we will train, test and split
image_ids = list(mapping.keys())
split = int(len(image_ids) * 0.90)
train = image_ids[:split]
test = image_ids[split:]
Note: Depending on the data size it can crash your session if you don't have enough memory on your system. Creating and loading the data on a batch is very helpful if you have less than 16 GB of memory.
Explanatory example of the sequence split into pairs
# startseq girl going into wooden building endseq
# X y
# startseq girl
# startseq girl going
# startseq girl going into
# ...........
# startseq girl going into wooden building endseq
Now we will define a batch and include the padding sequence
# create data generator to get data in batch (avoids session crash)
def data_generator(data_keys, mapping, features, tokenizer, max_length, vocab_size, batch_size):
# loop over images
X1, X2, y = list(), list(), list()
n = 0
while 1:
for key in data_keys:
n += 1
captions = mapping[key]
# process each caption
for caption in captions:
# encode the sequence
seq = tokenizer.texts_to_sequences([caption])[0]
# split the sequence into X, y pairs
for i in range(1, len(seq)):
# split into input and output pairs
in_seq, out_seq = seq[:i], seq[i]
# pad input sequence
in_seq = pad_sequences([in_seq], maxlen=max_length)
[0]
# encode output sequence
out_seq = to_categorical([out_seq],
num_classes=vocab_size)[0]
# store the sequences
X1.append(features[key][0])
X2.append(in_seq)
y.append(out_seq)
if n == batch_size:
X1, X2, y = np.array(X1), np.array(X2), np.array(y)
yield [X1, X2], y
X1, X2, y = list(), list(), list()
n = 0
Padding sequence normalizes the size of all captions to the max size filling them with zeros for better results.
Model Creation
# encoder model
# image feature layers
inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.4)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
# sequence feature layers
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.4)(se1)
se3 = LSTM(256)(se2)
# decoder model
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)
model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam')
# plot the model
plot_model(model, show_shapes=True)

shape=(4096,) - output length of the features from the VGG model
Dense - single dimension linear layer array
Dropout() - used to add regularization to the data, avoiding over fitting & dropping out a fraction of the data from the layers
model.compile() - compilation of the model
loss=’sparse_categorical_crossentropy’ - loss function for category outputs
optimizer=’adam’ - automatically adjust the learning rate for the model over the no. of epochs
Model plot shows the concatenation of the inputs and outputs into a single layer
Feature extraction of image was already done using VGG, no CNN model was needed in this step.
Now let us train the model
# train the model
epochs = 20
batch_size = 32
steps = len(train) // batch_size
for i in range(epochs):
# create data generator
generator = data_generator(train, mapping, features, tokenizer, max_length, vocab_size, batch_size)
# fit for one epoch
model.fit(generator, epochs=1, steps_per_epoch=steps, verbose=1)
227/227 [==============================] - 68s 285ms/step - loss: 5.2210 227/227 [==============================] - 66s 291ms/step - loss: 4.0199 227/227 [==============================] - 66s 292ms/step - loss: 3.5781 227/227 [==============================] - 65s 287ms/step - loss: 3.3090 227/227 [==============================] - 66s 292ms/step - loss: 3.1080 227/227 [==============================] - 65s 286ms/step - loss: 2.9619 227/227 [==============================] - 63s 276ms/step - loss: 2.8491 227/227 [==============================] - 64s 282ms/step - loss: 2.7516 227/227 [==============================] - 64s 282ms/step - loss: 2.6670 227/227 [==============================] - 65s 286ms/step - loss: 2.5966 227/227 [==============================] - 66s 290ms/step - loss: 2.5327 227/227 [==============================] - 61s 270ms/step - loss: 2.4774 227/227 [==============================] - 65s 288ms/step - loss: 2.4307 227/227 [==============================] - 66s 289ms/step - loss: 2.3873 227/227 [==============================] - 62s 274ms/step - loss: 2.3451 227/227 [==============================] - 65s 285ms/step - loss: 2.3081 227/227 [==============================] - 65s 288ms/step - loss: 2.2678 227/227 [==============================] - 66s 292ms/step - loss: 2.2323 227/227 [==============================] - 65s 285ms/step - loss: 2.1992 227/227 [==============================] - 66s 291ms/step - loss: 2.1702
steps = len(train) // batch_size - back propagation and fetch the next data
Loss decreases gradually over the iterations
Increase the no. of epochs for better results
Assign the no. of epochs and batch size accordingly for quicker results
You can save the model in the working directory for reuse
# save the model
model.save(WORKING_DIR+'/best_model.h5')
Generate Captions for the Image
def idx_to_word(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
Convert the predicted index from the model into a word
# generate caption for an image
def predict_caption(model, image, tokenizer, max_length):
# add start tag for generation process
in_text = 'startseq'
# iterate over the max length