Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/21822
Title: MULTI-STAGE VISION-LANGUAGE TRANSFORMER (MVLT) FOR ENHANCED IMAGE CAPTIONING
Authors: MISHRA, ANKITA
Keywords: IMAGE CAPTIONING
VISION-LANGUAGE MODELS
TRANSFORMERS
DEEP LEARNING
MULTIMODAL LEARNING
NATURAL LANGUAGE PROCESSING
FLAMINGO
LLAVA
VIT-G
CLIP
Issue Date: May-2025
Series/Report no.: TD-8039;
Abstract: Image captioning is an important task in today’s world which leverages computer vision and Natural language processing for generating meaningful coherent description of images. Earlier works based on convolutional and recurrent neural networks have shown prominent results but have faced lots of challenges with respect to understanding complicated scenes of images and long-range dependencies. To overcome these challenges, we have proposed a model which is Multi Stage Vision language Transformer (MVLT) which combines state-of-art deep learning architectures for improved image captioning. Our model leverages ViT-G and CLIP for extracting high resolution visual features and Flamingo-style perceiver Resampler for efficient vision-language fusion and LLaVA (Large Language & Vision Assistant) for caption generation with context awareness. Our model has been trained on MS COCO and conceptual captions datasets which is further evaluated on Flickr30k and Visual Genome and has shown promising performance across multiple benchmarks. The proposed MVLT model have achieved a performance that have outperformed previous state-of-art models in BLEU, CIDEr and METEOR scores and have successfully achieves more accurate, relevant, coherent and rich in semantic captions. This work has laid a foundation for advance vision language understanding, with potential application in assistive technology, content creation and AI driven media annotation.
URI: http://dspace.dtu.ac.in:8080/jspui/handle/repository/21822
Appears in Collections:M.E./M.Tech. Computer Engineering

Files in This Item:
File Description SizeFormat 
Ankita Mishra M.Tech.pdf977.47 kBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.