Image Captioning with Attention Mechanism
Image Captioning with Attention Mechanism
CONTENTS
1 INTRODUCTION…………………………………………………………………………………….
2 PROBLEM DEFINITION……………………………………………………………………………
3 LITERATURE REVIEW……………………………………………………………………………..9
4 DATASET…………………………..…………………………………………………………………13
5 DATA PREPROCESSING………………………………………………………………………. 16
6 METHODOLOGY………………………………………………………………………………….18
7 RESULTS AND
CONCLUSIONS……………………………………………………………….25
8 CONCLUSION ………………………………….
………………………………………………….28
9 REFERENCES………………………………………………………………………………………..30
10 PROGRAM(CODE)…………………………………………………………………………….32
FIGURE 1 ENCODER
FIGURE 3 DECODER
FIGURE 6 RESULT
Abstract
This research introduces a novel image captioning model leveraging a convolutional neural
network (CNN) and recurrent neural network (RNN) architecture. The model, implemented
using TensorFlow and TensorFlow Hub, utilizes the powerful InceptionResNetV2 as a
feature extractor. The dataset comprises COCO image-caption pairs, and the preprocessing
involves resizing images and tokenizing captions. The model's architecture includes an
attention mechanism for enhanced context understanding. Training employs a custom loss
function and the Adam optimizer, demonstrating impressive results in generating captions for
unseen images. The developed probabilistic prediction component utilizes a trained model to
generate diverse and contextually relevant captions for a given image. The research
contributes to the field of computer vision, showcasing the potential of attention-based image
captioning models.
CHAPTER 1:
INTRODUCTION
CHAPTER 1:
INTRODUCTION
This dynamic discipline has gained prominence due to its potential applications in diverse
domains, such as assistive technologies, content retrieval, and human-machine interaction.
The challenge lies in developing algorithms that not only recognize objects and scenes within
images but also understand their contextual relationships and nuances. Over the years,
various approaches have emerged, ranging from traditional rule-based methods to state-of-
the-art deep learning techniques.
Among these, attention mechanisms have played a pivotal role, allowing models to
selectively focus on different regions of an image while generating descriptive captions. This
introduction sets the stage for exploring the evolution, challenges, and advancements in the
captivating realm of image captioning.
CHAPTER 2:
PROBLEM DEFINITION
CHAPTER 2:
PROBLEM DEFINITION
The problem of image captioning with attention mechanisms using machine learning (ML)
algorithms revolves around the need for automated systems to generate accurate and
contextually relevant textual descriptions for visual content. Traditional image captioning
methods often struggle to capture intricate details and contextual relationships in complex
scenes.
The overarching goal is to enhance the synergy between visual understanding and natural
language processing, creating robust and interpretable systems capable of providing
meaningful descriptions for diverse visual
CHAPTER 3
LITERATURE REVIEW
CHAPTER 3:
LITERATURE REVIEW
CHAPTER 4
DATASET
CHAPTER 4:
DATASET
The Microsoft Common Objects in Context (MS COCO) dataset is a widely used benchmark
in the field of computer vision and specifically in tasks related to image understanding and
scene understanding. Created and maintained by Microsoft, the MS COCO dataset is
designed to address the limitations of previous datasets by offering a more comprehensive
and diverse collection of images with rich annotations.
Image Collection and Diversity: The dataset consists of a vast collection of images,
currently containing over 200,000 images covering a wide range of object categories. These
images are sourced from everyday scenes and capture diverse contexts, including indoor and
outdoor environments. The diversity of the dataset is a key strength, making it suitable for
training models that need to recognize objects and scenes in a variety of real-world scenarios.
Annotation Types: One of the distinctive features of MS COCO is its detailed and extensive
annotation schema. Each image in the dataset is annotated with multiple captions, providing
textual descriptions that describe different aspects of the scene. This multimodal annotation
approach goes beyond traditional datasets, allowing models not only to recognize objects but
also to understand their relationships and interactions within a scene. The annotations are
created by human annotators, ensuring high-quality and contextually rich descriptions.
Object Categories: MS COCO is labeled with a wide range of object categories, spanning
from common everyday objects to more complex scenes. The dataset includes 80 different
object categories, covering a broad spectrum of items such as people, animals, vehicles,
household items, and outdoor scenes. This diversity ensures that models trained on MS
COCO can generalize well across various domains and object types.
CHAPTER 5:
DATA PREPROCESSING
CHAPTER 5:
DATA PREPROCESSING
Tokenized captions are adapted to the model's vocabulary, ensuring compatibility during
training and inference. This meticulous preprocessing ensures that both image and caption
inputs are suitably prepared for the subsequent stages of the model, facilitating effective
learning and generation of meaningful captions.
CHAPTER 6:
METHODOLOGY
CHAPTER 6:
METHODOLOGY
The methodology encompasses dataset loading, image and caption preprocessing, model
architecture design, training configuration, and probabilistic caption generation. The use of a
pre-trained InceptionResNetV2 as a feature extractor ensures the model captures rich image
representations.
The incorporation of attention mechanisms in the GRU-based decoder enhances the model's
ability to attend to relevant image regions. Training involves the utilization of a custom loss
function that considers sentence lengths, optimizing the model for coherent caption
generation. The final section explores the probabilistic nature of the trained model,
demonstrating its ability to generate diverse captions for a given image.
CHAPTER 6.1 :
MODEL ARCHITECTURE
The attention mechanism fosters a dynamic relationship between visual and textual
information, allowing the model to adaptively focus on different parts of the image during
caption generation. The inclusion of embedding layers further enriches the model's
understanding of semantic relationships within the captions. Overall, the model architecture
reflects a thoughtful integration of state-of-the-art components tailored to address the
complexities of image captioning.
Figure 1Encoder
Figure 3 Decoder
CHAPTER 6.2:
MODEL TRAINING
The model is trained using an Adam optimizer and sparse categorical cross-entropy loss
function. The training process involves optimizing the model's parameters to minimize the
discrepancy between predicted and actual captions.
The research emphasizes the importance of custom loss functions and sequence-aware
padding to handle variable-length captions. The training loop iterates through the dataset,
updating the model weights to enhance its ability to generate accurate and contextually
relevant captions.
By leveraging GPU acceleration and batching techniques, the code achieves an efficient
training process. The model's performance is evaluated using a probabilistic prediction
mechanism, demonstrating its capability to generate diverse and meaningful captions for
input images.
CHAPTER 7:
RESULTS AND ANALYSIS
CHAPTER 7:
RESULTS AND ANALYSIS
The model's efficacy is demonstrated through the generation of captions for sample images.
The probabilistic prediction mechanism allows for diverse and contextually rich captions,
showcasing the model's versatility. By utilizing attention mechanisms, the model excels in
capturing fine-grained details in images, producing captions that align with human-like
understanding.
The analysis highlights the model's potential for real-world applications, such as content
retrieval and assistive technologies. The efficient training process and integration of state-of-
the-art components contribute to the model's robustness, paving the way for further
advancements in image captioning research.
Figure 6 Result
CHAPTER 8:
CONCLUSIONS
CHAPTER 8:
CONCLUSION
In conclusion, the presented image captioning model showcases the fusion of cutting-edge
computer vision and natural language processing techniques. The custom attention-enhanced
GRU architecture, coupled with InceptionResNetV2 for feature extraction, contributes to the
model's ability to generate descriptive and contextually relevant captions for diverse images.
CHAPTER 9:
REFERENCES
CHAPTER 9:
REFERENCES
1. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
- Authors: Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhudinov, Richard Zemel, Yoshua Bengio
- Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Lukasz Kaiser, Illia Polosukhin
3. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
- Authors: Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen
Gould, Lei Zhang
- Conference: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
- Authors: Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James
Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár
7. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning
- Conference: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
- Authors: Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, Vaibhava Goel
- Conference: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
CHAPTER 10:
PROGRAM (CODE)
CHAPTER 10:
PROGRAM(CODE)
import time
from textwrap import wrap
print([Link])
We will use the TensorFlow datasets capability to read the [COCO captions]
([Link] dataset.
This version contains images, bounding boxes, labels, and captions from COCO
2014, split into the subsets defined by Karpathy and Li (2015) and takes
care of some data quality issues with the original dataset (for example, some
of the images in the original dataset did not have captions)
GCS_DIR = "gs://asl-public/data/tensorflow_datasets/"
BUFFER_SIZE = 1000
def get_image_label(example):
caption = example["captions"]["text"][0] # only the first caption per
image
img = example["image"]
img = [Link](img, (IMG_HEIGHT, IMG_WIDTH))
img = img / 255
return {"image_tensor": img, "caption": caption}
trainds = [Link](
get_image_label, num_parallel_calls=[Link]
).shuffle(BUFFER_SIZE)
trainds = [Link](buffer_size=[Link])
"""### Visualize
Let's take a look at images and sample captions in the dataset.
"""
We add special tokens to represent the starts (`<start>`) and the ends
(`<end>`) of sentences.<br>
Start and end tokens are added here because we are using an encoder-decoder
model and during prediction, to get the captioning started we use `<start>`
and since captions are of variable length, we terminate the prediction when we
see the `<end>` token.
def add_start_end_token(data):
start = tf.convert_to_tensor("<start>")
end = tf.convert_to_tensor("<end>")
data["caption"] = [Link](
[start, data["caption"], end], separator=" "
)
return data
trainds = [Link](add_start_end_token)
You will transform the text captions into integer sequences using the
[TextVectorization]([Link]
layers/TextVectorization) layer, with the following steps:
* Use [adapt]([Link]
TextVectorization#adapt) to iterate over all captions, split the captions into
words, and compute a vocabulary of the top `VOCAB_SIZE` words.
* Tokenize all captions by mapping each word to its index in the vocabulary.
All output sequences will be padded to the length `MAX_CAPTION_LEN`. Here we
directly specify `64` number which is sufficient for this dataset, but please
note that this value should be computed by processing the entire dataset if
you don't want to cut down very long sentense in a dataset.
MAX_CAPTION_LEN = 64
return [Link].regex_replace(
inputs, r"[!\"#$%&\(\)\*\+.,-/:;=?@\[\\\]^_`{|}~]?", ""
)
# Choose the most frequent words from the vocabulary & remove punctuation
etc.
tokenizer = TextVectorization(
max_tokens=VOCAB_SIZE,
standardize=standardize,
output_sequence_length=MAX_CAPTION_LEN,
)
[Link]([Link](lambda x: x["caption"]))
"""
Let's try to tokenize a sample text"""
sample_captions = []
for d in [Link](5):
sample_captions.append(d["caption"].numpy())
sample_captions
print(tokenizer(sample_captions))
"""Please note that all the sentenses starts and ends with the same token
(e.g. '3' and '4'). These values represent start tokens and end tokens
respectively.
mask_token="", vocabulary=tokenizer.get_vocabulary()
)
Here note that we are also creating labels by shifting texts from feature
captions.<br>
If we have an input caption `"<start> I love cats <end>"`, its label should be
`"I love cats <end> <padding>"`.<br>
With that, our model can try to learn to predict `I` from `<start>`.
The dataset should return tuples, where the first elements are features
(`image_tensor` and `caption`) and the second elements are labels (target).
"""
BATCH_SIZE = 32
def create_ds_fn(data):
img_tensor = data["image_tensor"]
caption = tokenizer(data["caption"])
batched_ds = (
[Link](create_ds_fn)
.batch(BATCH_SIZE, drop_remainder=True)
.prefetch(buffer_size=[Link])
)
"""## Model
Now let's design an image captioning model.<br>
It consists of an image encoder, followed by a caption decoder.
"""
FEATURE_EXTRACTOR.trainable = False
<img src="[Link]
[Link]" alt="attention equation 1"
width="800">
<img src="[Link]
[Link]" alt="attention equation 2"
width="800">
Where:
The equations:
Last is the $score$ function. Its job is to calculate a scalar logit-score for
each key-query pair. There are two common approaches:
<img src="[Link]
[Link]" alt="attention equation 4"
width="800">
The decoder's job is to generate predictions for the next output token.
We can define all the steps in Keras Functional API, but please note that here
we instantiate layers that have trainable parameters so that we reuse the
layers and the weights in inference phase.
"""
decoder_gru = GRU(
ATTENTION_DIM,
return_sequences=True,
return_state=True,
)
gru_output, gru_state = decoder_gru(embed_x)
decoder_attention = Attention()
context_vector = decoder_attention([gru_output, encoder_output])
layer_norm = LayerNormalization(axis=-1)
layer_norm_out = layer_norm(addition)
decoder_output_dense = Dense(VOCAB_SIZE)
decoder_output = decoder_output_dense(layer_norm_out)
decoder = [Link](
inputs=[word_input, encoder_output], outputs=decoder_output
)
[Link].plot_model(decoder)
[Link]()
Now we defined the encoder and the decoder. Let's combine them into an image
model for training.<br>
It has two inputs (`image_input` and `word_input`, and an output
(`decoder_output`). This definition should correspond to the definition of the
dataset pipeline.
"""
image_caption_train_model = [Link](
inputs=[image_input, word_input], outputs=decoder_output
)
loss_object = [Link](
from_logits=True, reduction="none"
)
return tf.reduce_mean(loss_, 1)
image_caption_train_model.compile(
optimizer="adam",
loss=loss_function,
)
Now we can train the model using the standard `[Link]` API.<br>
It takes around 15-20 minutes with NVIDIA T4 GPU to train 1 epoch.
"""
"""## Caption!
The predict step is different from the training, since we need to keep track
of the GRU state during the caption generation, and pass a predicted word to
the decoder as an input at the next time step.
In order to do so, let's define another model for prediction while using the
trained weights, so that it can keep and update the GRU state during the
caption generation.
"""
decoder_output = decoder_output_dense(layer_norm_output)
"""
MINIMUM_SENTENCE_LENGTH = 5
img = [Link].decode_jpeg([Link].read_file(filename),
channels=IMG_CHANNELS)
img = [Link](img, (IMG_HEIGHT, IMG_WIDTH))
img = img / 255
[Link](tokenizer.get_vocabulary()[predicted_id])
if predicted_id == word_to_index("<end>"):
return img, result
dec_input = tf.expand_dims([predicted_id], 1)
"""Let's caption!"""
for i in range(5):
image, caption = predict_caption(filename)
print(" ".join(caption[:-1]) + ".")