ADVANCING IMAGE CAPTIONING WITH TRANSFORMER BASED TECHNIQUES

RAWAT, AMAN

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/21763

Full metadata record

DC Field	Value	Language
dc.contributor.author	RAWAT, AMAN	-
dc.date.accessioned	2025-07-01T06:33:47Z	-
dc.date.available	2025-07-01T06:33:47Z	-
dc.date.issued	2025-05	-
dc.identifier.uri	http://dspace.dtu.ac.in:8080/jspui/handle/repository/21763	-
dc.description.abstract	Image captioning relates to the automatic generation of natural language descriptions for visual content, and It has seen major progress through the acceptance of deep learning methods.. This thesis critically explores the transformation of image captioning methods, with a particular focus on the transformative impact of Vision Transformers (ViTs) . While common methods employing CNNs and RNNs had provided initial advancements their basis, they are generally poor at understanding global context and relationships within the entire image. Vision Transformers overcome this deficit by employing self attention and allowing thorough understanding of fine detail as much as overall context of the image.This study compares ViT-based models with traditional techniques across a variety of architectures and benchmark datasets, particularly MS COCO. The findings indicate that ViT-based approaches significantly outperform conventional models in gen erating semantically rich and contextually accurate captions. Additionally, this thesis introduces a novel image captioning framework ViBERT, which merges advantages of both Vision transformer and Bidirectional Encoder Representations from Transformers in an encoder-decoder architecture.Sometimes traditional models often fail in capturing the long range semantic dependencies and global visual setting, ViBERT effectively lever ages ViT’s visual attention and BERT’s deep contextual understanding to generate more strong and semantic correct description. The performance of the proposed model is cal culate using standard performance measures	en_US
dc.language.iso	en	en_US
dc.relation.ispartofseries	TD-8037;	-
dc.subject	ADVANCING IMAGE CAPTIONING	en_US
dc.subject	TRANSFORMER	en_US
dc.subject	VISION TRANSFORMER (ViTs)	en_US
dc.subject	BERT	en_US
dc.subject	ViBERT	en_US
dc.title	ADVANCING IMAGE CAPTIONING WITH TRANSFORMER BASED TECHNIQUES	en_US
dc.type	Thesis	en_US
Appears in Collections:	MTech Data Science

Files in This Item:

File	Description	Size	Format
AMAN RAWAT M.Tech..pdf		3.31 MB	Adobe PDF	View/Open

Show simple item record