PARAGRAPH IMAGE CAPTIONING USING DEEP LEARNING

GUPTA, SUYASH

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/22144

Title:	PARAGRAPH IMAGE CAPTIONING USING DEEP LEARNING
Authors:	GUPTA, SUYASH
Keywords:	PARAGRAPH IMAGE CAPTIONING DEEP LEARNING FLICKR8K
Issue Date:	May-2025
Series/Report no.:	TD-8126;
Abstract:	In recent years, automatic image captioning has really taken off, capturing a lot of interest because it has the potential to connect visual understanding with natural language generation. By merging the latest advancements in computer vision and natural language processing, these image captioning systems strive to create descriptive and contextually relevant sentences that reflect the content of an image. This interdisciplinary challenge is crucial for various applications, including helping the visually impaired, image indexing, moderating social media content, and improving human-computer interaction. This thesis offers a thorough comparative analysis of image captioning models tested on three popular datasets—Flickr8k, Flickr30k, and the Stanford Paragraph Captioning dataset. Each dataset comes with its own set of challenges and linguistic structures: while Flickr8k and Flickr30k feature short, single-sentence captions for each image, the Stanford Paragraph dataset includes paragraph-level annotations that require a deeper understanding of semantics and continuity in language generation. We’ve examined a range of cutting-edge models and systematically compared their performance using standard evaluation metrics like BLEU-1, BLEU- 2, BLEU-3, BLEU-4, and METEOR. These metrics help us measure the quality of the generated captions by comparing them to human-written references. Our analysis not only looks at the final scores but also dives into the training behaviors of these models, showcasing trends in training and validation accuracy/loss over 50 epochs, which provides a well-rounded perspective on model convergence. In the final section, the thesis tackles some tough challenges, like the scarcity of data in paragraph-level datasets, the risk of overfitting in smaller models, and the shortcomings of traditional n-gram metrics when it comes to assessing generative diversity and fluency. By examining learning curves, score summaries, and example image-caption pairs, this thesis offers a deeper insight into what these models can do and where they might fall short.
URI:	http://dspace.dtu.ac.in:8080/jspui/handle/repository/22144
Appears in Collections:	M.E./M.Tech. Electronics & Communication Engineering

Files in This Item:

File	Description	Size	Format
SUYASH GUPTA M.Tech.pdf		1.5 MB	Adobe PDF	View/Open

Show full item record