Please use this identifier to cite or link to this item:
http://dspace.dtu.ac.in:8080/jspui/handle/repository/22144
Title: | PARAGRAPH IMAGE CAPTIONING USING DEEP LEARNING |
Authors: | GUPTA, SUYASH |
Keywords: | PARAGRAPH IMAGE CAPTIONING DEEP LEARNING FLICKR8K |
Issue Date: | May-2025 |
Series/Report no.: | TD-8126; |
Abstract: | In recent years, automatic image captioning has really taken off, capturing a lot of interest because it has the potential to connect visual understanding with natural language generation. By merging the latest advancements in computer vision and natural language processing, these image captioning systems strive to create descriptive and contextually relevant sentences that reflect the content of an image. This interdisciplinary challenge is crucial for various applications, including helping the visually impaired, image indexing, moderating social media content, and improving human-computer interaction. This thesis offers a thorough comparative analysis of image captioning models tested on three popular datasets—Flickr8k, Flickr30k, and the Stanford Paragraph Captioning dataset. Each dataset comes with its own set of challenges and linguistic structures: while Flickr8k and Flickr30k feature short, single-sentence captions for each image, the Stanford Paragraph dataset includes paragraph-level annotations that require a deeper understanding of semantics and continuity in language generation. We’ve examined a range of cutting-edge models and systematically compared their performance using standard evaluation metrics like BLEU-1, BLEU- 2, BLEU-3, BLEU-4, and METEOR. These metrics help us measure the quality of the generated captions by comparing them to human-written references. Our analysis not only looks at the final scores but also dives into the training behaviors of these models, showcasing trends in training and validation accuracy/loss over 50 epochs, which provides a well-rounded perspective on model convergence. In the final section, the thesis tackles some tough challenges, like the scarcity of data in paragraph-level datasets, the risk of overfitting in smaller models, and the shortcomings of traditional n-gram metrics when it comes to assessing generative diversity and fluency. By examining learning curves, score summaries, and example image-caption pairs, this thesis offers a deeper insight into what these models can do and where they might fall short. |
URI: | http://dspace.dtu.ac.in:8080/jspui/handle/repository/22144 |
Appears in Collections: | M.E./M.Tech. Electronics & Communication Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
SUYASH GUPTA M.Tech.pdf | 1.5 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.