Term Project Proposal — Intro to Deep Learning

Course: Intro to Deep Learning (Spring 2026) Instructor: Heeyoul “Henry” Choi


Student(s)

Jungwook Kim (22000168) — Handong Global University

Title

Reproducing Show, Attend and Tell: Soft-Attention Image Captioning on Flickr8k

Summary

People effortlessly describe a photograph in natural language, but teaching a machine to do the same requires solving three problems at once: extracting visual structure, generating fluent text, and aligning the two modalities word by word. Image captioning sits at this intersection and remains a foundational testbed for multimodal learning.

This project reproduces Show, Attend and Tell (Xu et al., 2015), the encoder–decoder architecture that introduced visual soft attention. A pre-trained ResNet-101 encoder converts each image into a 14×14 grid of feature vectors, and an LSTM decoder generates one word at a time. At every step, an attention module computes a weighted sum over the grid, letting the decoder focus on different regions for different words.

Experiments run on Flickr8k. The encoder is frozen; only the attention module, decoder, and word embeddings are trained. Beam search is used at inference. The report will quantify performance with BLEU-1 through BLEU-4 and analyze attention heatmaps qualitatively, highlighting alignment between generated words and image regions. A beam-size ablation is included.

This work integrates CNN feature extraction (Ch. 9), LSTM sequence modeling (Ch. 10), and the attention mechanism foundational to modern multimodal LLMs such as GPT-4V and Gemini Vision.