ADVANCING IMAGE CAPTIONING WITH TRANSFORMER BASED TECHNIQUES

RAWAT, AMAN

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/21763

Title:	ADVANCING IMAGE CAPTIONING WITH TRANSFORMER BASED TECHNIQUES
Authors:	RAWAT, AMAN
Keywords:	ADVANCING IMAGE CAPTIONING TRANSFORMER VISION TRANSFORMER (ViTs) BERT ViBERT
Issue Date:	May-2025
Series/Report no.:	TD-8037;
Abstract:	Image captioning relates to the automatic generation of natural language descriptions for visual content, and It has seen major progress through the acceptance of deep learning methods.. This thesis critically explores the transformation of image captioning methods, with a particular focus on the transformative impact of Vision Transformers (ViTs) . While common methods employing CNNs and RNNs had provided initial advancements their basis, they are generally poor at understanding global context and relationships within the entire image. Vision Transformers overcome this deficit by employing self attention and allowing thorough understanding of fine detail as much as overall context of the image.This study compares ViT-based models with traditional techniques across a variety of architectures and benchmark datasets, particularly MS COCO. The findings indicate that ViT-based approaches significantly outperform conventional models in gen erating semantically rich and contextually accurate captions. Additionally, this thesis introduces a novel image captioning framework ViBERT, which merges advantages of both Vision transformer and Bidirectional Encoder Representations from Transformers in an encoder-decoder architecture.Sometimes traditional models often fail in capturing the long range semantic dependencies and global visual setting, ViBERT effectively lever ages ViT’s visual attention and BERT’s deep contextual understanding to generate more strong and semantic correct description. The performance of the proposed model is cal culate using standard performance measures
URI:	http://dspace.dtu.ac.in:8080/jspui/handle/repository/21763
Appears in Collections:	MTech Data Science

Files in This Item:

File	Description	Size	Format
AMAN RAWAT M.Tech..pdf		3.31 MB	Adobe PDF	View/Open

Show full item record