Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/21763
Full metadata record
DC FieldValueLanguage
dc.contributor.authorRAWAT, AMAN-
dc.date.accessioned2025-07-01T06:33:47Z-
dc.date.available2025-07-01T06:33:47Z-
dc.date.issued2025-05-
dc.identifier.urihttp://dspace.dtu.ac.in:8080/jspui/handle/repository/21763-
dc.description.abstractImage captioning relates to the automatic generation of natural language descriptions for visual content, and It has seen major progress through the acceptance of deep learning methods.. This thesis critically explores the transformation of image captioning methods, with a particular focus on the transformative impact of Vision Transformers (ViTs) . While common methods employing CNNs and RNNs had provided initial advancements their basis, they are generally poor at understanding global context and relationships within the entire image. Vision Transformers overcome this deficit by employing self attention and allowing thorough understanding of fine detail as much as overall context of the image.This study compares ViT-based models with traditional techniques across a variety of architectures and benchmark datasets, particularly MS COCO. The findings indicate that ViT-based approaches significantly outperform conventional models in gen erating semantically rich and contextually accurate captions. Additionally, this thesis introduces a novel image captioning framework ViBERT, which merges advantages of both Vision transformer and Bidirectional Encoder Representations from Transformers in an encoder-decoder architecture.Sometimes traditional models often fail in capturing the long range semantic dependencies and global visual setting, ViBERT effectively lever ages ViT’s visual attention and BERT’s deep contextual understanding to generate more strong and semantic correct description. The performance of the proposed model is cal culate using standard performance measuresen_US
dc.language.isoenen_US
dc.relation.ispartofseriesTD-8037;-
dc.subjectADVANCING IMAGE CAPTIONINGen_US
dc.subjectTRANSFORMERen_US
dc.subjectVISION TRANSFORMER (ViTs)en_US
dc.subjectBERTen_US
dc.subjectViBERTen_US
dc.titleADVANCING IMAGE CAPTIONING WITH TRANSFORMER BASED TECHNIQUESen_US
dc.typeThesisen_US
Appears in Collections:MTech Data Science

Files in This Item:
File Description SizeFormat 
AMAN RAWAT M.Tech..pdf3.31 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.