This example shows how to train a deep learning model for image captioning using attention.
Most pretrained deep learning networks are configured for single-label classification. For example, given an image of a typical office desk, the network might predict the single class "keyboard" or "mouse". In contrast, an image captioning model combines convolutional and recurrent operations to produce a textual description of what is in the image, rather than a single label.
This model trained in this example uses an encoder-decoder architecture. The encoder is a pretrained Inception-v3 network used as a feature extractor. The decoder is a recurrent neural network (RNN) that takes the extracted features as input and generates a caption. The decoder incorporates an
attention mechanism
that allows the decoder to focus on parts of the encoded input while generating the caption.
The encoder model is a pretrained Inception-v3 model that extracts features from the
"mixed10"
layer, followed by fully connected and ReLU operations.
The decoder model consists of a word embedding, an attention mechanism, a gated recurrent unit (GRU), and two fully connected operations.
Load Pretrained Network
Load a pretrained Incetion-v3 network. This step requires the Deep Learning Toolbox™ Model
for Inception-v3 Network
support package. If you do not have the required support package installed, then the software provides a download link.
Remove the last three layers, leaving the
"mixed10"
layer as the last layer.
View the input layer of the network. The Inception-v3 network uses symmetric-rescale normalization with a minimum value of 0 and a maximum value of 255.
ans =
ImageInputLayer with properties:
Name: 'input_1'
InputSize: [299 299 3]
SplitComplexInputs: 0
Hyperparameters
DataAugmentation: 'none'
Normalization: 'rescale-symmetric'
NormalizationDimension: 'auto'
Max: 255
Min: 0
Custom training does not support this normalization, so you must disable normalization in the network and perform the normalization in the custom training loop instead. Save the minimum and maximum values as doubles in variables named
inputMin
and
inputMax
, respectively, and replace the input layer with an image input layer without normalization.
Initialize the network.
Determine the output size of the network. Use the
analyzeNetwork
function to see the activation sizes of the last layer.
Create a variable named
outputSizeNet
containing the network output size.
Import COCO Data Set
Download images and annotations from the data sets "2014 Train images" and "2014 Train/val annotations," respectively, from
https://cocodataset.org/#download
. Extract the images and annotations into a folder named
"coco"
. The COCO 2014 data set was collected by
Coco Consortium
.
Extract the captions from the file
"captions_train2014.json"
using the
jsondecode
function.
data = struct with fields:
info: [1×1 struct]
images: [82783×1 struct]
licenses: [8×1 struct]
annotations: [414113×1 struct]
The
annotations
field of the struct contains the data required for image captioning.
ans=414113×1 struct array with fields:
image_id
caption
The data set contains multiple captions for each image. To ensure the same images do not appear in both training and validation sets, identify the unique images in the data set using the
unique
function by using the IDs in the
image_id
field of the annotations field of the data, then view the number of unique images.
numObservationsAll = 414113
Each image has at least five captions. Create a struct
annotationsAll
with these fields:
-
ImageID
— Image ID
-
Filename
— File name of the image
-
Captions
— String array of raw captions
-
CaptionIDs
— Vector of indices of the corresponding captions in
data.annotations
To make merging easier, sort the annotations by the image IDs.
Loop over the annotations and merge multiple annotations when necessary.
Partition the data into training and validation sets. Hold out 5% of the observations for testing.
The struct contains three fields:
-
id
— Unique identifier for the caption
-
caption
— Image caption, specified as a character vector
-
image_id
— Unique identifier of the image corresponding to the caption
To view the image and the corresponding caption, locate the image file with file name
"train2014\COCO_train2014_XXXXXXXXXXXX.jpg"
, where
"XXXXXXXXXXXX"
corresponds to the image ID left-padded with zeros to have length 12.
To view the image, use the
imread
and
imshow
functions.
img = imread(filename);
figure
imshow(img)
title(captions)
Prepare Data for Training
Prepare the captions for training and testing. Extract the text from the
Captions
field of the struct containing both the training and test data (
annotationsAll
), erase the punctuation, and convert the text to lowercase.
In order to generate captions, the RNN decoder requires special start and stop tokens to indicate when to start and stop generating text, respectively. Add the custom tokens
"<start>"
and
"<stop>"
to the beginnings and ends of the captions, respectively.
Tokenize the captions using the
tokenizedDocument
function and specify the start and stop tokens using the
CustomTokens
option.
Create a
wordEncoding
object that maps words to numeric indices and back. Reduce the memory requirements by specifying a vocabulary size of 5000 corresponding to the most frequently observed words in the training data. To avoid bias, use only the documents corresponding to the training set.
Create an augmented image datastore containing the images corresponding to the captions. Set the output size to match the input size of the convolutional network. To keep the images synchronized with the captions, specify a table of file names for the datastore by reconstructing the file names using the image ID. To return grayscale images as 3-channel RGB images, set the
ColorPreprocessing
option to
"gray2rgb"
.
augimdsTrain =
augmentedImageDatastore with properties:
NumObservations: 78644
MiniBatchSize: 1
DataAugmentation: 'none'
ColorPreprocessing: 'gray2rgb'
OutputSize: [299 299]
OutputSizeMode: 'resize'
DispatchInBackground: 0
Initialize Model Parameters
Initialize the model parameters. Specify 512 hidden units with a word embedding dimension of 256.
Initialize a struct containing the parameters for the encoder model.
-
Initialize the weights of the fully connected operations using the Glorot initializer, specified by the
initializeGlorot
function, listed at the end of the example. Specify the output size to match the embedding dimension of the decoder (256) and an input size to match the number of output channels of the pretrained network. The
'mixed10'
layer of the Inception-v3 network outputs data with 2048 channels.
Initialize a struct containing parameters for the decoder model.
-
Initialize the word embedding weights with the size given by the embedding dimension and the vocabulary size plus one, where the extra entry corresponds to the padding value.
-
Initialize the weights and biases for the Bahdanau attention mechanism with sizes corresponding to the number of hidden units of the GRU operation.
-
Initialize the weights and bias of the GRU operation.
-
Initialize the weights and biases of two fully connected operations.
For the model decoder parameters, initialize each of the weighs and biases with the Glorot initializer and zeros, respectively.
Define Model Functions
Create the functions
modelEncoder
and
modelDecoder
, listed at the end of the example, which compute the outputs of the encoder and decoder models, respectively.
The
modelEncoder
function, listed in the
Encoder Model Function
section of the example, takes as input an array of activations
X
from the output of the pretrained network and passes it through a fully connected operation and a ReLU operation. Because the pretrained network does not need to be traced for automatic differentiation, extracting the features outside the encoder model function is more computationally efficient.
The
modelDecoder
function, listed in the
Decoder Model Function
section of the example, takes as input a single input time-step corresponding to an input word, the decoder model parameters, the features from the encoder, and the network state, and returns the predictions for the next time step, the updated network state, and the attention weights.
Specify Training Options
Specify the options for training. Train for 30 epochs with a mini-batch size of 128 and display the training progress in a plot.
Train on a GPU if one is available. Using a GPU requires Parallel Computing Toolbox™ and a supported GPU device. For information on supported devices, see
GPU Computing Requirements
(Parallel Computing Toolbox)
.
Check whether a GPU is available for training.
NVIDIA RTX A5000 GPU detected and available for training.
Train Network
Train the network using a custom training loop.
At the beginning of each epoch, shuffle the input data. To keep the images in the augmented image datastore and the captions synchronized, create an array of shuffled indices that indexes into both data sets.
For each mini-batch:
-
Rescale the images to the size that the pretrained network expects.
-
For each image, select a random caption.
-
Convert the captions to sequences of word indices. Specify right-padding of the sequences with the padding value corresponding to the index of the padding token.
-
Convert the data to
dlarray
objects. For the images, specify dimension labels
"SSCB"
(spatial, spatial, channel, batch).
-
For GPU training, convert the data to
gpuArray
objects.
-
Extract the image features using the pretrained network and reshape them to the size the encoder expects.
-
Evaluate the model loss and gradients using the
dlfeval
and
modelLoss
functions.
-
Update the encoder and decoder model parameters using the
adamupdate
function.
-
Display the training progress in a plot.
Initialize the parameters for the Adam optimizer.
Initialize the
TrainingProgressMonitor
object. Because the timer starts when you create the monitor object, make sure that you create the object close to the training loop.
Train the model.
Predict New Captions
The caption generation process is different from the process for training. During training, at each time step, the decoder uses the true value of the previous time step as input. This is known as "teacher forcing". When making predictions on new data, the decoder uses the previous predicted values instead of the true values.
Predicting the most likely word for each step in the sequence can lead to suboptimal results. For example, if the decoder predicts the first word of a caption is "a" when given an image of an elephant, then the probability of predicting "elephant" for the next word becomes much more unlikely because of the extremely low probability of the phrase "a elephant" appearing in English text.
To address this issue, you can use the beam search algorithm: instead of taking the most likely prediction for each step in the sequence, take the top
k
predictions (the beam index) and for each following step, keep the top
k
predicted sequences so far according to the overall score.
Generate a caption of a new image by extracting the image features, inputting them into the encoder, and then using the
beamSearch
function, listed in the
Beam Search Function
section of the example.
caption =
"a small white dog standing on a lush green grass covered field"
Display the image with the caption.
Predict Captions for Data Set
To predict captions for a collection of images, loop over mini-batches of data in the datastore and extract the features from the images using the
extractImageFeatures
function. Then, loop over the images in the mini-batch and generate captions using the
beamSearch
function.
Create an augmented image datastore and set the output size to match the input size of the convolutional network. To output grayscale images as 3-channel RGB images, set the
ColorPreprocessing
option to
"gray2rgb"
.
augimdsTest =
augmentedImageDatastore with properties:
NumObservations: 4139
MiniBatchSize: 1
DataAugmentation: 'none'
ColorPreprocessing: 'gray2rgb'
OutputSize: [299 299]
OutputSizeMode: 'resize'
DispatchInBackground: 0
Generate captions for the test data. Predicting captions on a large data set can take some time. If you have Parallel Computing Toolbox™, then you can make predictions in parallel by generating captions inside a
parfor
loop. If you do not have Parallel Computing Toolbox. then the
parfor
loop runs in serial.
To view a test image with the corresponding caption, use the
imshow
function and set the title to the predicted caption.
idx = 1;
tbl = readByIndex(augimdsTest,idx);
img = tbl.input{1};
figure
imshow(img)
title(captionsTestPred(idx))
Evaluate Model Accuracy
To evaluate the accuracy of the captions using the BLEU score, calculate the BLEU score for each caption (the candidate) against the corresponding captions in the test set (the references) using the
bleuEvaluationScore
function. Using the
bleuEvaluationScore
function, you can compare a single candidate document to multiple reference documents.
The
bleuEvaluationScore
function, by default, scores similarity using n-grams of length one through four. As the captions are short, this behavior can lead to uninformative results as most scores are close to zero. Set the n-gram length to one through two by setting the
NgramWeights
option to a two-element vector with equal weights.
View the mean BLEU score.
Visualize the scores in a histogram.
Attention Function
The
attention
function calculates the context vector and the attention weights using Bahdanau attention.
Embedding Function
The
embedding
function maps an array of indices to a sequence of embedding vectors.
Feature Extraction Function
The
extractImageFeatures
function takes as input a trained
dlnetwork
object, an input image, statistics for image rescaling, and the execution environment, and returns a
dlarray
containing the features extracted from the pretrained network.
Batch Creation Function
The
createBatch
function takes as input a mini-batch of data, tokenized captions, a pretrained network, statistics for image rescaling, a word encoding, and the execution environment, and returns a mini-batch of data corresponding to the extracted image features and captions for training.
Encoder Model Function
The
modelEncoder
function takes as input an array of activations
X
and passes it through a fully connected operation and a ReLU operation. For the fully connected operation, operate on the channel dimension only. To apply the fully connected operation across the channel dimension only, flatten the other channels into a single dimension and specify this dimension as the batch dimension using the
DataFormat
option of the
fullyconnect
function.
Decoder Model Function
The
modelDecoder
function takes as input a single time-step
X
, the decoder model parameters, the features from the encoder, and the network state, and returns the predictions for the next time step, the updated network state, and the attention weights.
Model Loss
The
modelLoss
function takes as input the encoder and decoder parameters, the encoder features
X
, and the target caption
T
, and returns the loss, the gradients of the encoder and decoder parameters with respect to the loss, and the predictions.
Sparse Cross Entropy and Softmax Loss Function
The
sparseCrossEntropyAndSoftmax
takes as input the predictions
Y
, corresponding targets
T
, and sequence padding mask, and applies the
softmax
functions and returns the cross-entropy loss.
Beam Search Function
The
beamSearch
function takes as input the image features
X
, a beam index, the parameters for the encoder and decoder networks, a word encoding, and a maximum sequence length, and returns the caption words for the image using the beam search algorithm.
Glorot Weight Initialization Function
The
initializeGlorot
function generates an array of weights according to Glorot initialization.