Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/21822
Full metadata record
DC FieldValueLanguage
dc.contributor.authorMISHRA, ANKITA-
dc.date.accessioned2025-07-08T08:45:56Z-
dc.date.available2025-07-08T08:45:56Z-
dc.date.issued2025-05-
dc.identifier.urihttp://dspace.dtu.ac.in:8080/jspui/handle/repository/21822-
dc.description.abstractImage captioning is an important task in today’s world which leverages computer vision and Natural language processing for generating meaningful coherent description of images. Earlier works based on convolutional and recurrent neural networks have shown prominent results but have faced lots of challenges with respect to understanding complicated scenes of images and long-range dependencies. To overcome these challenges, we have proposed a model which is Multi Stage Vision language Transformer (MVLT) which combines state-of-art deep learning architectures for improved image captioning. Our model leverages ViT-G and CLIP for extracting high resolution visual features and Flamingo-style perceiver Resampler for efficient vision-language fusion and LLaVA (Large Language & Vision Assistant) for caption generation with context awareness. Our model has been trained on MS COCO and conceptual captions datasets which is further evaluated on Flickr30k and Visual Genome and has shown promising performance across multiple benchmarks. The proposed MVLT model have achieved a performance that have outperformed previous state-of-art models in BLEU, CIDEr and METEOR scores and have successfully achieves more accurate, relevant, coherent and rich in semantic captions. This work has laid a foundation for advance vision language understanding, with potential application in assistive technology, content creation and AI driven media annotation.en_US
dc.language.isoenen_US
dc.relation.ispartofseriesTD-8039;-
dc.subjectIMAGE CAPTIONINGen_US
dc.subjectVISION-LANGUAGE MODELSen_US
dc.subjectTRANSFORMERSen_US
dc.subjectDEEP LEARNINGen_US
dc.subjectMULTIMODAL LEARNINGen_US
dc.subjectNATURAL LANGUAGE PROCESSINGen_US
dc.subjectFLAMINGOen_US
dc.subjectLLAVAen_US
dc.subjectVIT-Gen_US
dc.subjectCLIPen_US
dc.titleMULTI-STAGE VISION-LANGUAGE TRANSFORMER (MVLT) FOR ENHANCED IMAGE CAPTIONINGen_US
dc.typeThesisen_US
Appears in Collections:M.E./M.Tech. Computer Engineering

Files in This Item:
File Description SizeFormat 
Ankita Mishra M.Tech.pdf977.47 kBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.