0% found this document useful (0 votes)

31 views45 pages

Image Captioning with Attention Mechanism

best image caption generation paper

Uploaded by

My Email

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views45 pages

Image Captioning with Attention Mechanism

best image caption generation paper

Uploaded by

My Email

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CONTENTS

1 INTRODUCTION…………………………………………………………………………………….

2 PROBLEM DEFINITION……………………………………………………………………………

3 LITERATURE REVIEW……………………………………………………………………………..9

4 DATASET…………………………..…………………………………………………………………13

5 DATA PREPROCESSING………………………………………………………………………. 16

6 METHODOLOGY………………………………………………………………………………….18

6.1 MODEL ARCHITECTURE……………………………………20

6.2 MODEL TRAINING……………………………………………24

7 RESULTS AND

CONCLUSIONS……………………………………………………………….25

DEpartment of cse(ai-ml), soe, dsu 1

IMAGE CAPTIONING WITH ATTENTION MECHANISM

8 CONCLUSION ………………………………….

………………………………………………….28

9 REFERENCES………………………………………………………………………………………..30

10 PROGRAM(CODE)…………………………………………………………………………….32

DEpartment of cse(ai-ml), soe, dsu 2

IMAGE CAPTIONING WITH ATTENTION MECHANISM

LIST OF FIGURES AND TABLES

FIGURE 1 ENCODER

FIGURE 2 ATTENTION MECHANISM

FIGURE 3 DECODER

FIGURE 4 DECODER PARAMETERS

FIGURE 5 ENTROPY LOSS

FIGURE 6 RESULT

DEpartment of cse(ai-ml), soe, dsu 3

IMAGE CAPTIONING WITH ATTENTION MECHANISM

Abstract

This research introduces a novel image captioning model leveraging a convolutional neural
network (CNN) and recurrent neural network (RNN) architecture. The model, implemented
using TensorFlow and TensorFlow Hub, utilizes the powerful InceptionResNetV2 as a
feature extractor. The dataset comprises COCO image-caption pairs, and the preprocessing
involves resizing images and tokenizing captions. The model's architecture includes an
attention mechanism for enhanced context understanding. Training employs a custom loss
function and the Adam optimizer, demonstrating impressive results in generating captions for
unseen images. The developed probabilistic prediction component utilizes a trained model to
generate diverse and contextually relevant captions for a given image. The research
contributes to the field of computer vision, showcasing the potential of attention-based image
captioning models.

DEpartment of cse(ai-ml), soe, dsu 4

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 1:

INTRODUCTION

DEpartment of cse(ai-ml), soe, dsu 5

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 1:
INTRODUCTION

Image captioning is a multidisciplinary field at the intersection of computer vision and

natural language processing, designed to impart machines with the ability to generate human-
like textual descriptions for visual content. The overarching goal is to bridge the semantic gap
between images and natural language, enabling machines to comprehend and communicate
the intricacies of visual scenes.

This dynamic discipline has gained prominence due to its potential applications in diverse
domains, such as assistive technologies, content retrieval, and human-machine interaction.
The challenge lies in developing algorithms that not only recognize objects and scenes within
images but also understand their contextual relationships and nuances. Over the years,
various approaches have emerged, ranging from traditional rule-based methods to state-of-
the-art deep learning techniques.

Among these, attention mechanisms have played a pivotal role, allowing models to
selectively focus on different regions of an image while generating descriptive captions. This
introduction sets the stage for exploring the evolution, challenges, and advancements in the
captivating realm of image captioning.

DEpartment of cse(ai-ml), soe, dsu 6

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 2:
PROBLEM DEFINITION

DEpartment of cse(ai-ml), soe, dsu 7

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 2:
PROBLEM DEFINITION

The problem of image captioning with attention mechanisms using machine learning (ML)
algorithms revolves around the need for automated systems to generate accurate and
contextually relevant textual descriptions for visual content. Traditional image captioning
methods often struggle to capture intricate details and contextual relationships in complex
scenes.

Attention mechanisms, a key component of modern ML algorithms, aim to address this

limitation by dynamically focusing on different regions of an image while generating
captions. However, challenges persist in optimizing these attention mechanisms to strike a
balance between capturing salient features and maintaining coherence in the generated
captions. Additionally, scalability and computational efficiency are crucial considerations in
deploying attention-based image captioning models in real-world applications.

The overarching goal is to enhance the synergy between visual understanding and natural
language processing, creating robust and interpretable systems capable of providing
meaningful descriptions for diverse visual

DEpartment of cse(ai-ml), soe, dsu 8

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 3

LITERATURE REVIEW

DEpartment of cse(ai-ml), soe, dsu 9

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 3:
LITERATURE REVIEW

[Link] How Encoder-decoder Architectures Attend

-By Kyle Aitken, Vinay V Ramasesh (NeurIPS2023)
The research paper investigates encoder-decoder architectures and their attention
mechanisms. It delves into how these models effectively capture and weigh input information
during encoding and decoding processes. By understanding the attention mechanisms, the
paper aims to enhance the overall performance of encoder-decoder architectures in various
applications.

2. Encoder-Decoder Recurrent Neural Network Models for Neural Machine Translation

-Jason Brownlee (Deep learning for NLP 2019)
The research paper explores Encoder-Decoder Recurrent Neural Network models in the
context of Neural Machine Translation. It investigates how these models, consisting of an
encoder to understand the source language and a decoder to generate the target language,
contribute to the advancement of machine translation systems, enhancing accuracy and
efficiency.

3. Attention Is All You Need

- Ashish Vaswani, Noam Shazeer (2017)
"Attention Is All You Need" is a seminal research paper in machine learning that introduced
the Transformer model, revolutionizing natural language processing and various AI tasks.

DEpartment of cse(ai-ml), soe, dsu 10

IMAGE CAPTIONING WITH ATTENTION MECHANISM

Published in 2017 by Vaswani et al., it emphasized self-attention mechanisms, enabling

parallelization and improved performance, becoming foundational in modern deep learning
architectures.

4. Deep Residual Learning for Image Recognition

- Kaiming He Jian Sun (2015)
The research paper "Deep Residual Learning for Image Recognition" introduces a
groundbreaking neural network architecture known as ResNet. Developed by Microsoft
Research in 2015, ResNet employs residual blocks to address the vanishing gradient problem,
enabling the training of extremely deep convolutional neural networks for improved image
recognition accuracy.

5. CSPNet: A New Backbone that can Enhance Learning Capability of CNN

- Chien-Yao Wang, Hong-Yuan Mark Liao
The CSPNet proposes a novel convolutional neural network (CNN) backbone, designed to
boost learning capabilities. This innovative architecture enhances information flow, fostering
improved feature extraction. Through comprehensive experiments, CSPNet demonstrates
superior performance, offering a promising advancement in CNNs for diverse applications,
marking a significant stride in the field of deep learning.

DEpartment of cse(ai-ml), soe, dsu 11

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 4
DATASET

DEpartment of cse(ai-ml), soe, dsu 12

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 4:
DATASET

The Microsoft Common Objects in Context (MS COCO) dataset is a widely used benchmark
in the field of computer vision and specifically in tasks related to image understanding and
scene understanding. Created and maintained by Microsoft, the MS COCO dataset is
designed to address the limitations of previous datasets by offering a more comprehensive
and diverse collection of images with rich annotations.

Image Collection and Diversity: The dataset consists of a vast collection of images,
currently containing over 200,000 images covering a wide range of object categories. These
images are sourced from everyday scenes and capture diverse contexts, including indoor and
outdoor environments. The diversity of the dataset is a key strength, making it suitable for
training models that need to recognize objects and scenes in a variety of real-world scenarios.

DEpartment of cse(ai-ml), soe, dsu 13

IMAGE CAPTIONING WITH ATTENTION MECHANISM

Annotation Types: One of the distinctive features of MS COCO is its detailed and extensive
annotation schema. Each image in the dataset is annotated with multiple captions, providing
textual descriptions that describe different aspects of the scene. This multimodal annotation
approach goes beyond traditional datasets, allowing models not only to recognize objects but

also to understand their relationships and interactions within a scene. The annotations are
created by human annotators, ensuring high-quality and contextually rich descriptions.

Object Categories: MS COCO is labeled with a wide range of object categories, spanning
from common everyday objects to more complex scenes. The dataset includes 80 different
object categories, covering a broad spectrum of items such as people, animals, vehicles,
household items, and outdoor scenes. This diversity ensures that models trained on MS
COCO can generalize well across various domains and object types.

DEpartment of cse(ai-ml), soe, dsu 14

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 5:
DATA PREPROCESSING

DEpartment of cse(ai-ml), soe, dsu 15

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 5:
DATA PREPROCESSING

The preprocessing pipeline involves resizing images to a standardized format and

normalizing pixel values. Captions are tokenized using a TextVectorization layer, and special
tokens ("<start>" and "<end>") are added to mark the beginning and end of each sequence.
The standardization of captions involves lowercasing and removal of punctuation, enhancing
the model's robustness to variations in input text.

Tokenized captions are adapted to the model's vocabulary, ensuring compatibility during
training and inference. This meticulous preprocessing ensures that both image and caption
inputs are suitably prepared for the subsequent stages of the model, facilitating effective
learning and generation of meaningful captions.

DEpartment of cse(ai-ml), soe, dsu 16

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 6:
METHODOLOGY

DEpartment of cse(ai-ml), soe, dsu 17

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 6:
METHODOLOGY

The methodology encompasses dataset loading, image and caption preprocessing, model
architecture design, training configuration, and probabilistic caption generation. The use of a
pre-trained InceptionResNetV2 as a feature extractor ensures the model captures rich image
representations.

The incorporation of attention mechanisms in the GRU-based decoder enhances the model's
ability to attend to relevant image regions. Training involves the utilization of a custom loss
function that considers sentence lengths, optimizing the model for coherent caption
generation. The final section explores the probabilistic nature of the trained model,
demonstrating its ability to generate diverse captions for a given image.

The methodology serves as a comprehensive guide to the processes involved in developing

and training the image captioning model.

DEpartment of cse(ai-ml), soe, dsu 18

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 6.1 :
MODEL ARCHITECTURE

The model architecture comprises an InceptionResNetV2-based feature extractor and a

custom attention-enhanced GRU network for caption generation. The feature extractor
transforms input images into a fixed-size feature vector, while the attention mechanism
refines this representation by focusing on relevant image regions during caption generation.

The GRU network processes tokenized captions, incorporating contextual information

through attention. This architecture encourages the model to capture fine-grained details in
images and generate coherent and contextually relevant captions.

The attention mechanism fosters a dynamic relationship between visual and textual
information, allowing the model to adaptively focus on different parts of the image during
caption generation. The inclusion of embedding layers further enriches the model's
understanding of semantic relationships within the captions. Overall, the model architecture
reflects a thoughtful integration of state-of-the-art components tailored to address the
complexities of image captioning.

DEpartment of cse(ai-ml), soe, dsu 19

IMAGE CAPTIONING WITH ATTENTION MECHANISM

Figure 1Encoder

Figure 2 Attention Mechanism

DEpartment of cse(ai-ml), soe, dsu 20

IMAGE CAPTIONING WITH ATTENTION MECHANISM

Figure 3 Decoder

DEpartment of cse(ai-ml), soe, dsu 21

IMAGE CAPTIONING WITH ATTENTION MECHANISM

Figure 4 Decoder Parameters

DEpartment of cse(ai-ml), soe, dsu 22

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 6.2:
MODEL TRAINING

The model is trained using an Adam optimizer and sparse categorical cross-entropy loss
function. The training process involves optimizing the model's parameters to minimize the
discrepancy between predicted and actual captions.

The research emphasizes the importance of custom loss functions and sequence-aware
padding to handle variable-length captions. The training loop iterates through the dataset,
updating the model weights to enhance its ability to generate accurate and contextually
relevant captions.

By leveraging GPU acceleration and batching techniques, the code achieves an efficient
training process. The model's performance is evaluated using a probabilistic prediction
mechanism, demonstrating its capability to generate diverse and meaningful captions for
input images.

DEpartment of cse(ai-ml), soe, dsu 23

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 7:
RESULTS AND ANALYSIS

DEpartment of cse(ai-ml), soe, dsu 24

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 7:
RESULTS AND ANALYSIS
The model's efficacy is demonstrated through the generation of captions for sample images.
The probabilistic prediction mechanism allows for diverse and contextually rich captions,
showcasing the model's versatility. By utilizing attention mechanisms, the model excels in
capturing fine-grained details in images, producing captions that align with human-like
understanding.

The analysis highlights the model's potential for real-world applications, such as content
retrieval and assistive technologies. The efficient training process and integration of state-of-
the-art components contribute to the model's robustness, paving the way for further
advancements in image captioning research.

Figure 5 cross entropy loss

The model was able to give a cross entropy loss of 0.5714.

DEpartment of cse(ai-ml), soe, dsu 25

IMAGE CAPTIONING WITH ATTENTION MECHANISM

Figure 6 Result

DEpartment of cse(ai-ml), soe, dsu 26

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 8:
CONCLUSIONS

DEpartment of cse(ai-ml), soe, dsu 27

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 8:
CONCLUSION

In conclusion, the presented image captioning model showcases the fusion of cutting-edge
computer vision and natural language processing techniques. The custom attention-enhanced
GRU architecture, coupled with InceptionResNetV2 for feature extraction, contributes to the
model's ability to generate descriptive and contextually relevant captions for diverse images.

The research underscores the significance of attention mechanisms in refining feature

representations, emphasizing their role in capturing intricate visual relationships. The code's
modular design and integration of pre-processing steps make it a valuable resource for
researchers and practitioners interested in advancing image captioning capability

CHAPTER 9:

DEpartment of cse(ai-ml), soe, dsu 28

IMAGE CAPTIONING WITH ATTENTION MECHANISM

REFERENCES

CHAPTER 9:

DEpartment of cse(ai-ml), soe, dsu 29

IMAGE CAPTIONING WITH ATTENTION MECHANISM

REFERENCES

1. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

- Authors: Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhudinov, Richard Zemel, Yoshua Bengio

- Conference: International Conference on Machine Learning (ICML), 2015

2. Attention is All You Need

- Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Lukasz Kaiser, Illia Polosukhin

- Conference: Advances in Neural Information Processing Systems (NeurIPS), 2017

3. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

- Authors: Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen
Gould, Lei Zhang

- Conference: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

4. Neural Machine Translation by Jointly Learning to Align and Translate

- Authors: Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

- Journal: arXiv preprint arXiv:1409.0473, 2014

5. Adam: A Method for Stochastic Optimization

DEpartment of cse(ai-ml), soe, dsu 30

IMAGE CAPTIONING WITH ATTENTION MECHANISM

- Authors: Diederik P. Kingma, Jimmy Ba

- Journal: arXiv preprint arXiv:1412.6980, 2014

6. Microsoft COCO: Common Objects in Context

- Authors: Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James
Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár

- Conference: European Conference on Computer Vision (ECCV), 2014

7. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning

- Authors: Jiasen Lu, Caiming Xiong, Devi Parikh, Richard Socher

- Conference: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

8. Self-critical Sequence Training for Image Captioning

- Authors: Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, Vaibhava Goel

- Conference: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

DEpartment of cse(ai-ml), soe, dsu 31

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 10:

PROGRAM (CODE)

DEpartment of cse(ai-ml), soe, dsu 32

IMAGE CAPTIONING WITH ATTENTION MECHANISM

CHAPTER 10:
PROGRAM(CODE)

import time
from textwrap import wrap

print([Link])

"""## Read and prepare dataset

We will use the TensorFlow datasets capability to read the [COCO captions]
([Link] dataset.
This version contains images, bounding boxes, labels, and captions from COCO
2014, split into the subsets defined by Karpathy and Li (2015) and takes
care of some data quality issues with the original dataset (for example, some
of the images in the original dataset did not have captions)

First, let's define some constants.

DEpartment of cse(ai-ml), soe, dsu 33

IMAGE CAPTIONING WITH ATTENTION MECHANISM

In this lab, we will use a pretrained

[InceptionResNetV2]([Link]
applications/inception_resnet_v2/InceptionResNetV2) model from
`[Link]` as a feature extractor, so some constants are comming
from the InceptionResNetV2 model definition. 
So if you want to use other type of base model, please make sure to change
these constants as well.

`[Link]` is a pretrained model repository like [TensorFlow Hub]

([Link] but while Tensorflow Hub hosts models for different
modalities including image, text, audio, and so on, `[Link]`
only hosts popular and stable models for images. 
However, `[Link]` is more flexible as it contains model
metadata and it allow us to access and control the model behavior, while most
of the TensorFlow Hub based models that only contains compiled
SavedModels. 
So, for example, we can get output not only from the final layer of the model
(e.g. flattend 1D Tensor output of CNN models), but also from intermediate
layers (e.g. intermediate 3D Tensor) by accessing layer metadata.
"""

# Change these to control the accuracy/speed

VOCAB_SIZE = 20000 # use fewer words to speed up convergence
ATTENTION_DIM = 512 # size of dense layer in Attention
WORD_EMBEDDING_DIM = 128

# InceptionResNetV2 takes (299, 299, 3) image as inputs

# and return features in (8, 8, 1536) shape
FEATURE_EXTRACTOR =
[Link].inception_resnet_v2.InceptionResNetV2(
include_top=False, weights="imagenet"
)
IMG_HEIGHT = 299
IMG_WIDTH = 299
IMG_CHANNELS = 3
FEATURES_SHAPE = (8, 8, 1536)

"""### Filter and Preprocess

Here we preprocess the dataset. The function below:
- resize image to (ÌMG_HEIGHT`, ÌMG_WIDTH`) shape
- rescale pixel values from [0, 255] to [0, 1]
- return image(ìmage_tensor`) and captions(`captions`) dictionary.

DEpartment of cse(ai-ml), soe, dsu 34

IMAGE CAPTIONING WITH ATTENTION MECHANISM

Note: This dataset is too large to store in an local environment.

Therefore, It is stored in a public GCS bucket located in us-central1.
So if you access it from a Notebook outside the US, it will be (a) slow and
(b) subject to a network charge.
"""

GCS_DIR = "gs://asl-public/data/tensorflow_datasets/"
BUFFER_SIZE = 1000

def get_image_label(example):
caption = example["captions"]["text"][0] # only the first caption per
image
img = example["image"]
img = [Link](img, (IMG_HEIGHT, IMG_WIDTH))
img = img / 255
return {"image_tensor": img, "caption": caption}

trainds = [Link]("coco_captions", split="train", data_dir=GCS_DIR)

trainds = [Link](
get_image_label, num_parallel_calls=[Link]
).shuffle(BUFFER_SIZE)
trainds = [Link](buffer_size=[Link])

"""### Visualize
Let's take a look at images and sample captions in the dataset.
"""

f, ax = [Link](1, 4, figsize=(20, 5))

for idx, data in enumerate([Link](4)):
ax[idx].imshow(data["image_tensor"].numpy())
caption = "\n".join(wrap(data["caption"].numpy().decode("utf-8"), 30))
ax[idx].set_title(caption)
ax[idx].axis("off")

"""## Text Preprocessing

We add special tokens to represent the starts (`<start>`) and the ends
(`<end>`) of sentences. 
Start and end tokens are added here because we are using an encoder-decoder
model and during prediction, to get the captioning started we use `<start>`

DEpartment of cse(ai-ml), soe, dsu 35

IMAGE CAPTIONING WITH ATTENTION MECHANISM

and since captions are of variable length, we terminate the prediction when we
see the `<end>` token.

Then create a full list of the captions for further preprocessing.

"""

def add_start_end_token(data):
start = tf.convert_to_tensor("<start>")
end = tf.convert_to_tensor("<end>")
data["caption"] = [Link](
[start, data["caption"], end], separator=" "
)
return data

trainds = [Link](add_start_end_token)

"""## Preprocess and tokenize the captions

You will transform the text captions into integer sequences using the
[TextVectorization]([Link]
layers/TextVectorization) layer, with the following steps:

* Use [adapt]([Link]
TextVectorization#adapt) to iterate over all captions, split the captions into
words, and compute a vocabulary of the top `VOCAB_SIZE` words.
* Tokenize all captions by mapping each word to its index in the vocabulary.
All output sequences will be padded to the length `MAX_CAPTION_LEN`. Here we
directly specify `64` number which is sufficient for this dataset, but please
note that this value should be computed by processing the entire dataset if
you don't want to cut down very long sentense in a dataset.

Note: This process takes around 5 minutes.

"""

MAX_CAPTION_LEN = 64

# We will override the default standardization of TextVectorization to

preserve
# "<>" characters, so we preserve the tokens for the <start> and <end>.
def standardize(inputs):
inputs = [Link](inputs)

DEpartment of cse(ai-ml), soe, dsu 36

IMAGE CAPTIONING WITH ATTENTION MECHANISM

return [Link].regex_replace(
inputs, r"[!\"#$%&\*\+.,-/:;=?@\[\\\]^_`{|}~]?", ""
)

# Choose the most frequent words from the vocabulary & remove punctuation
etc.
tokenizer = TextVectorization(
max_tokens=VOCAB_SIZE,
standardize=standardize,
output_sequence_length=MAX_CAPTION_LEN,
)

[Link]([Link](lambda x: x["caption"]))

"""
Let's try to tokenize a sample text"""

tokenizer(["<start> This is a sentence <end>"])

sample_captions = []
for d in [Link](5):
sample_captions.append(d["caption"].numpy())

sample_captions

print(tokenizer(sample_captions))

"""Please note that all the sentenses starts and ends with the same token
(e.g. '3' and '4'). These values represent start tokens and end tokens
respectively.

You can also convert ids to original text.

"""

for wordid in tokenizer([sample_captions[0]])[0]:

print(tokenizer.get_vocabulary()[wordid], end=" ")

"""Also, we can create Word <-> Index converters using `StringLookup`

layer."""

# Lookup table: Word -> Index

word_to_index = StringLookup(

DEpartment of cse(ai-ml), soe, dsu 37

IMAGE CAPTIONING WITH ATTENTION MECHANISM

mask_token="", vocabulary=tokenizer.get_vocabulary()
)

# Lookup table: Index -> Word

index_to_word = StringLookup(
mask_token="", vocabulary=tokenizer.get_vocabulary(), invert=True
)

"""### Create a [Link] dataset for training

Now Let's apply the adapted tokenization to all the examples and create
[Link] Dataset for training.

Here note that we are also creating labels by shifting texts from feature
captions. 
If we have an input caption `"<start> I love cats <end>"`, its label should be
`"I love cats <end> <padding>"`. 
With that, our model can try to learn to predict `I` from `<start>`.

The dataset should return tuples, where the first elements are features
(`image_tensor` and `caption`) and the second elements are labels (target).
"""

BATCH_SIZE = 32

def create_ds_fn(data):
img_tensor = data["image_tensor"]
caption = tokenizer(data["caption"])

target = [Link](caption, -1, 0)

zeros = [Link]([1], dtype=tf.int64)
target = [Link]((target[:-1], zeros), axis=-1)
return (img_tensor, caption), target

batched_ds = (
[Link](create_ds_fn)
.batch(BATCH_SIZE, drop_remainder=True)
.prefetch(buffer_size=[Link])
)

"""Let's take a look at some examples."""

DEpartment of cse(ai-ml), soe, dsu 38

IMAGE CAPTIONING WITH ATTENTION MECHANISM

for (img, caption), label in batched_ds.take(2):

print(f"Image shape: {[Link]}")
print(f"Caption shape: {[Link]}")
print(f"Label shape: {[Link]}")
print(caption[0])
print(label[0])

"""## Model
Now let's design an image captioning model. 
It consists of an image encoder, followed by a caption decoder.

### Image Encoder

The image encoder model is very simple. It extracts features through a pre-
trained model and passes them to a fully connected layer.

1. In this example, we extract the features from convolutional layers of

InceptionResNetV2 which gives us a vector of (Batch Size, 8, 8, 1536).
1. We reshape the vector to (Batch Size, 64, 1536)
1. We squash it to a length of `ATTENTION_DIM` with a Dense Layer and return
(Batch Size, 64, ATTENTION_DIM)
1. Later, the Attention layer attends over the image to predict the next word.

"""

FEATURE_EXTRACTOR.trainable = False

image_input = Input(shape=(IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))

image_features = FEATURE_EXTRACTOR(image_input)

x = Reshape((FEATURES_SHAPE[0] * FEATURES_SHAPE[1], FEATURES_SHAPE[2]))(

image_features
)
encoder_output = Dense(ATTENTION_DIM, activation="relu")(x)

encoder = [Link](inputs=image_input, outputs=encoder_output)

[Link]()

"""### Caption Decoder

The caption decoder incorporates an attention mechanism that focuses on
different parts of the input image.

#### The attention head

DEpartment of cse(ai-ml), soe, dsu 39

IMAGE CAPTIONING WITH ATTENTION MECHANISM

The decoder uses attention to selectively focus on parts of the input

sequence.
The attention takes a sequence of vectors as input for each example and
returns an "attention" vector for each example.

Let's look at how this works:

Where:

* $s$ is the encoder index.

* $t$ is the decoder index.
* $\alpha_{ts}$ is the attention weights.
* $h_s$ is the sequence of encoder outputs being attended to (the attention
"key" and "value" in transformer terminology).
* $h_t$ is the decoder state attending to the sequence (the attention "query"
in transformer terminology).
* $c_t$ is the resulting context vector.
* $a_t$ is the final output combining the "context" and "query".

The equations:

1. Calculates the attention weights, $\alpha_{ts}$, as a softmax across the

encoder's output sequence.
2. Calculates the context vector as the weighted sum of the encoder outputs.

Last is the $score$ function. Its job is to calculate a scalar logit-score for
each key-query pair. There are two common approaches:

This notebook implement Luong-style attention using pre-defined

`[Link]`.

DEpartment of cse(ai-ml), soe, dsu 40

IMAGE CAPTIONING WITH ATTENTION MECHANISM

#### Decoder Steps

The decoder's job is to generate predictions for the next output token.

1. The decoder receives current word tokens as a batch.

1. It embeds the word tokens to `ATTENTION_DIM` dimension.
1. GRU layer keeps track of the word embeddings, and returns GRU outputs and
states.
1. Bahdanau-style attention attends over the encoder's output feature by using
GRU outputs as a query.
1. The attention outputs and GRU outputs are added (skip connection), and
normalized in a layer normalization layer.
1. It generates logit predictions for the next token based on the GRU output.

We can define all the steps in Keras Functional API, but please note that here
we instantiate layers that have trainable parameters so that we reuse the
layers and the weights in inference phase.
"""

word_input = Input(shape=(MAX_CAPTION_LEN), name="words")

embed_x = Embedding(VOCAB_SIZE, ATTENTION_DIM)(word_input)

decoder_gru = GRU(
ATTENTION_DIM,
return_sequences=True,
return_state=True,
)
gru_output, gru_state = decoder_gru(embed_x)

decoder_attention = Attention()
context_vector = decoder_attention([gru_output, encoder_output])

addition = Add()([gru_output, context_vector])

layer_norm = LayerNormalization(axis=-1)
layer_norm_out = layer_norm(addition)

decoder_output_dense = Dense(VOCAB_SIZE)
decoder_output = decoder_output_dense(layer_norm_out)

decoder = [Link](
inputs=[word_input, encoder_output], outputs=decoder_output
)

DEpartment of cse(ai-ml), soe, dsu 41

IMAGE CAPTIONING WITH ATTENTION MECHANISM

[Link].plot_model(decoder)

[Link]()

"""### Training Model

Now we defined the encoder and the decoder. Let's combine them into an image
model for training. 
It has two inputs (`image_input` and `word_input`, and an output
(`decoder_output`). This definition should correspond to the definition of the
dataset pipeline.
"""

image_caption_train_model = [Link](
inputs=[image_input, word_input], outputs=decoder_output
)

"""### Loss Function

The loss function is a simple cross-entropy, but we need to remove padding
(`0`) when calculating it. 
So here we extract the length of the sentence (non-0 part), and compute the
average of the loss only over the valid sentence part.
"""

loss_object = [Link](
from_logits=True, reduction="none"
)

def loss_function(real, pred):

loss_ = loss_object(real, pred)

# returns 1 to word index and 0 to padding (e.g.

[1,1,1,1,1,0,0,0,0,...,0])
mask = [Link].logical_not([Link](real, 0))
mask = [Link](mask, dtype=tf.int32)
sentence_len = tf.reduce_sum(mask)
loss_ = loss_[:sentence_len]

return tf.reduce_mean(loss_, 1)

image_caption_train_model.compile(
optimizer="adam",

DEpartment of cse(ai-ml), soe, dsu 42

IMAGE CAPTIONING WITH ATTENTION MECHANISM

loss=loss_function,
)

"""## Training loop

Now we can train the model using the standard `[Link]` API. 
It takes around 15-20 minutes with NVIDIA T4 GPU to train 1 epoch.
"""

# Commented out IPython magic to ensure Python compatibility.

# %%time
# history = image_caption_train_model.fit(batched_ds, epochs=1)

"""## Caption!

The predict step is different from the training, since we need to keep track
of the GRU state during the caption generation, and pass a predicted word to
the decoder as an input at the next time step.

In order to do so, let's define another model for prediction while using the
trained weights, so that it can keep and update the GRU state during the
caption generation.
"""

gru_state_input = Input(shape=(ATTENTION_DIM), name="gru_state_input")

# Reuse trained GRU, but update it so that it can receive states.

gru_output, gru_state = decoder_gru(embed_x, initial_state=gru_state_input)

# Reuse other layers as well

context_vector = decoder_attention([gru_output, encoder_output])
addition_output = Add()([gru_output, context_vector])
layer_norm_output = layer_norm(addition_output)

decoder_output = decoder_output_dense(layer_norm_output)

# Define prediction Model with state input and output

decoder_pred_model = [Link](
inputs=[word_input, gru_state_input, encoder_output],
outputs=[decoder_output, gru_state],
)

"""

DEpartment of cse(ai-ml), soe, dsu 43

IMAGE CAPTIONING WITH ATTENTION MECHANISM

1. Initialize the GRU states as zero vectors.

1. Preprocess an input image, pass it to the encoder, and extract image
features.
1. Setup word tokens of `<start>` to start captioning.
1. In the for loop, we
- pass word tokens (`dec_input`), GRU states (`gru_state`) and image
features (`features`) to the prediction decoder and get predictions
(`predictions`), and the updated GRU states.
- select Top-K words from logits, and choose a word probabilistically so
that we avoid computing softmax over VOCAB_SIZE-sized vector.
- stop predicting when the model predicts the `<end>` token.
- replace the input word token with the predicted word token for the next
step."""

MINIMUM_SENTENCE_LENGTH = 5

## Probabilistic prediction using the trained model

def predict_caption(filename):
gru_state = [Link]((1, ATTENTION_DIM))

img = [Link].decode_jpeg([Link].read_file(filename),
channels=IMG_CHANNELS)
img = [Link](img, (IMG_HEIGHT, IMG_WIDTH))
img = img / 255

features = encoder(tf.expand_dims(img, axis=0))

dec_input = tf.expand_dims([word_to_index("<start>")], 1)
result = []
for i in range(MAX_CAPTION_LEN):
predictions, gru_state = decoder_pred_model(
[dec_input, gru_state, features]
)

# draws from log distribution given by predictions

top_probs, top_idxs = [Link].top_k(
input=predictions[0][0], k=10, sorted=False
)
chosen_id = [Link]([top_probs], 1)[0].numpy()
predicted_id = top_idxs.numpy()[chosen_id][0]

[Link](tokenizer.get_vocabulary()[predicted_id])

DEpartment of cse(ai-ml), soe, dsu 44

IMAGE CAPTIONING WITH ATTENTION MECHANISM

if predicted_id == word_to_index("<end>"):
return img, result

dec_input = tf.expand_dims([predicted_id], 1)

return img, result

"""Let's caption!"""

filename ="/content/[Link]" # you can also try [Link]

for i in range(5):
image, caption = predict_caption(filename)
print(" ".join(caption[:-1]) + ".")

img = [Link].decode_jpeg([Link].read_file(filename), channels=IMG_CHANNELS)

[Link](img)
[Link]("off");

DEpartment of cse(ai-ml), soe, dsu 45

Image Captioning Model Using Attention and Object
No ratings yet
Image Captioning Model Using Attention and Object
17 pages
Adaptive Attention for Image Captioning
No ratings yet
Adaptive Attention for Image Captioning
12 pages
Fast Image Captioning with ExpansionNet
No ratings yet
Fast Image Captioning with ExpansionNet
10 pages
LSTM and Attention in Image Captioning
No ratings yet
LSTM and Attention in Image Captioning
12 pages
Deep Learning for Image Captioning
No ratings yet
Deep Learning for Image Captioning
18 pages
Attention-Based Image Captioning Model
No ratings yet
Attention-Based Image Captioning Model
4 pages
Recent Advances in Image Captioning
No ratings yet
Recent Advances in Image Captioning
6 pages
Synopsis3 0
No ratings yet
Synopsis3 0
9 pages
Automated Image Captioning with CNN-RNN
No ratings yet
Automated Image Captioning with CNN-RNN
17 pages
Transformer-Based Deep Learning for Image Captioning
No ratings yet
Transformer-Based Deep Learning for Image Captioning
16 pages
Image Captioning with Neural Networks
No ratings yet
Image Captioning with Neural Networks
17 pages
Data Science Interview Questions (#Day27)
No ratings yet
Data Science Interview Questions (#Day27)
18 pages
Deep Learning for Image Captioning
No ratings yet
Deep Learning for Image Captioning
18 pages
Local Graph Semantic Attention in Image Captioning
No ratings yet
Local Graph Semantic Attention in Image Captioning
12 pages
Image Captioning with Adaptive Transformer
No ratings yet
Image Captioning with Adaptive Transformer
6 pages
Multi-Gate Attention Network for Image Captioning
No ratings yet
Multi-Gate Attention Network for Image Captioning
11 pages
Automated Image Captioning for the Blind
No ratings yet
Automated Image Captioning for the Blind
7 pages
Meshed-Memory Transformer for Captioning
No ratings yet
Meshed-Memory Transformer for Captioning
15 pages
Comparative Study of Image Captioning Models
No ratings yet
Comparative Study of Image Captioning Models
6 pages
AoANet for VizWiz Image Captioning
No ratings yet
AoANet for VizWiz Image Captioning
3 pages
Image Captioning with CNN and LSTM
No ratings yet
Image Captioning with CNN and LSTM
11 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
Image Caption Detection Using CNN and LSTM
No ratings yet
Image Caption Detection Using CNN and LSTM
14 pages
Attention-Based Image Captioning Model
No ratings yet
Attention-Based Image Captioning Model
10 pages
Retrieval-Augmented Image Captioning
No ratings yet
Retrieval-Augmented Image Captioning
22 pages
Image Captioning with Encoder Networks
No ratings yet
Image Captioning with Encoder Networks
4 pages
Transformer-Based Healthcare Image Captioning
No ratings yet
Transformer-Based Healthcare Image Captioning
9 pages
CNN and LSTM for Image Captioning
No ratings yet
CNN and LSTM for Image Captioning
5 pages
Dense Captioning with FCLN Model
No ratings yet
Dense Captioning with FCLN Model
10 pages
Multimodal Deep Learning for Image Captioning
No ratings yet
Multimodal Deep Learning for Image Captioning
18 pages
Convolutional Techniques for Image Captioning
No ratings yet
Convolutional Techniques for Image Captioning
10 pages
CATANIC: Enhanced Image Captioning Model
No ratings yet
CATANIC: Enhanced Image Captioning Model
13 pages
Image Captioning with Bahdanau Attention
No ratings yet
Image Captioning with Bahdanau Attention
19 pages
Medical Image Captioning Evaluation Techniques
No ratings yet
Medical Image Captioning Evaluation Techniques
10 pages
Neural Image Captioning with Attention
No ratings yet
Neural Image Captioning with Attention
25 pages
Hierarchical Attention for Image Captioning
No ratings yet
Hierarchical Attention for Image Captioning
13 pages
Hybrid Model for Image Captioning
No ratings yet
Hybrid Model for Image Captioning
6 pages
Image Captioning with CNN and LSTM
No ratings yet
Image Captioning with CNN and LSTM
8 pages
CNN and LSTM for Image Captioning
No ratings yet
CNN and LSTM for Image Captioning
4 pages
Deep Learning for Image Captioning
No ratings yet
Deep Learning for Image Captioning
6 pages
Image Captioning Techniques Overview
No ratings yet
Image Captioning Techniques Overview
53 pages
Bidirectional LSTM for Image Captioning
No ratings yet
Bidirectional LSTM for Image Captioning
17 pages
Neural Image Captioning Project Report
No ratings yet
Neural Image Captioning Project Report
10 pages
Research Paper Final
No ratings yet
Research Paper Final
5 pages
Image Captioning with CNN and LSTM
No ratings yet
Image Captioning with CNN and LSTM
2 pages
Deep Reinforcement Learning for Image Captioning
No ratings yet
Deep Reinforcement Learning for Image Captioning
5 pages
Semantic Attention in Image Captioning
No ratings yet
Semantic Attention in Image Captioning
9 pages
Visual Image Captioning for Accessibility
No ratings yet
Visual Image Captioning for Accessibility
6 pages
Semantic Attention in Image Captioning
No ratings yet
Semantic Attention in Image Captioning
9 pages
Image Captioning with CNN & RNN
No ratings yet
Image Captioning with CNN & RNN
53 pages
AI Image Captioning Techniques
No ratings yet
AI Image Captioning Techniques
17 pages
Image Caption Generation with Deep Learning
No ratings yet
Image Caption Generation with Deep Learning
7 pages
Multi-Scale Residual Network for Dense Captioning
No ratings yet
Multi-Scale Residual Network for Dense Captioning
16 pages
Deep Learning for Image Captioning
No ratings yet
Deep Learning for Image Captioning
6 pages
Automatic Arabic Image Captioning
No ratings yet
Automatic Arabic Image Captioning
34 pages
Attention Models in Deep Learning
No ratings yet
Attention Models in Deep Learning
53 pages
Image Captioning Project Overview
No ratings yet
Image Captioning Project Overview
18 pages
Text-Conditional Attention in Image Captioning
No ratings yet
Text-Conditional Attention in Image Captioning
9 pages
Image Captioning with CNNs Explained
No ratings yet
Image Captioning with CNNs Explained
59 pages
Professional Internship Report Structure
No ratings yet
Professional Internship Report Structure
3 pages
John Donne: Metaphysical Poet Overview
No ratings yet
John Donne: Metaphysical Poet Overview
17 pages
Introduction to Matrices and Linear Systems
No ratings yet
Introduction to Matrices and Linear Systems
37 pages
Types of Semantic Sentences Explained
No ratings yet
Types of Semantic Sentences Explained
12 pages
Compiler Design: Syntax Analysis Techniques
No ratings yet
Compiler Design: Syntax Analysis Techniques
15 pages
Expense Tracker Application Report
No ratings yet
Expense Tracker Application Report
15 pages
Understanding Machine Learning Algorithms
No ratings yet
Understanding Machine Learning Algorithms
20 pages
Razzle & Dazzle Overview in Dandy's World
No ratings yet
Razzle & Dazzle Overview in Dandy's World
32 pages
Data Structures Question Bank - B.Tech II Sem
No ratings yet
Data Structures Question Bank - B.Tech II Sem
2 pages
JLPT N5 Complete Vocabulary List
100% (2)
JLPT N5 Complete Vocabulary List
31 pages
Basic Equations Worksheet
No ratings yet
Basic Equations Worksheet
4 pages
Essential Bedtime Duas in Islam
No ratings yet
Essential Bedtime Duas in Islam
8 pages
English Grammar MCQs and Assessments
No ratings yet
English Grammar MCQs and Assessments
2 pages
Alishtr Navoiy Va Xxi Asr. Xalqaro Anjuman. 2022
No ratings yet
Alishtr Navoiy Va Xxi Asr. Xalqaro Anjuman. 2022
597 pages
Java 8 Stream vs Parallel Stream Guide
No ratings yet
Java 8 Stream vs Parallel Stream Guide
11 pages
Advanced Java Scientific Calculator
No ratings yet
Advanced Java Scientific Calculator
15 pages
System Design Specification Template
No ratings yet
System Design Specification Template
4 pages
DBA Profile at Sanvan Software Ltd.
No ratings yet
DBA Profile at Sanvan Software Ltd.
3 pages
Ohm's Law Lab: Voltage and Current Analysis
No ratings yet
Ohm's Law Lab: Voltage and Current Analysis
4 pages
P7 Mathematics Pre-Mock 2024 Exam
No ratings yet
P7 Mathematics Pre-Mock 2024 Exam
11 pages
EE229 Third Problem Assignment
No ratings yet
EE229 Third Problem Assignment
5 pages
Tobacco Exporter Contact List 2019
No ratings yet
Tobacco Exporter Contact List 2019
34 pages
Describing Desalination Processes
No ratings yet
Describing Desalination Processes
3 pages
Understanding Mercy and Forgiveness
No ratings yet
Understanding Mercy and Forgiveness
9 pages
Overview of MPEG-2 Video Standard
No ratings yet
Overview of MPEG-2 Video Standard
21 pages
Item Writing Basics For Item Writers - Version 2.3 - 9-26-2019
100% (1)
Item Writing Basics For Item Writers - Version 2.3 - 9-26-2019
12 pages
Jainism and Buddhism: Key Insights
No ratings yet
Jainism and Buddhism: Key Insights
18 pages
Present Simple: Structure and Usage
No ratings yet
Present Simple: Structure and Usage
11 pages
Key Features of a Research Report
No ratings yet
Key Features of a Research Report
31 pages
Faten Moussa: ESL Teacher Profile
No ratings yet
Faten Moussa: ESL Teacher Profile
5 pages