MULTI-STAGE VISION-LANGUAGE TRANSFORMER (MVLT) FOR ENHANCED IMAGE CAPTIONING

MISHRA, ANKITA

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/21822

Title:	MULTI-STAGE VISION-LANGUAGE TRANSFORMER (MVLT) FOR ENHANCED IMAGE CAPTIONING
Authors:	MISHRA, ANKITA
Keywords:	IMAGE CAPTIONING VISION-LANGUAGE MODELS TRANSFORMERS DEEP LEARNING MULTIMODAL LEARNING NATURAL LANGUAGE PROCESSING FLAMINGO LLAVA VIT-G CLIP
Issue Date:	May-2025
Series/Report no.:	TD-8039;
Abstract:	Image captioning is an important task in today’s world which leverages computer vision and Natural language processing for generating meaningful coherent description of images. Earlier works based on convolutional and recurrent neural networks have shown prominent results but have faced lots of challenges with respect to understanding complicated scenes of images and long-range dependencies. To overcome these challenges, we have proposed a model which is Multi Stage Vision language Transformer (MVLT) which combines state-of-art deep learning architectures for improved image captioning. Our model leverages ViT-G and CLIP for extracting high resolution visual features and Flamingo-style perceiver Resampler for efficient vision-language fusion and LLaVA (Large Language & Vision Assistant) for caption generation with context awareness. Our model has been trained on MS COCO and conceptual captions datasets which is further evaluated on Flickr30k and Visual Genome and has shown promising performance across multiple benchmarks. The proposed MVLT model have achieved a performance that have outperformed previous state-of-art models in BLEU, CIDEr and METEOR scores and have successfully achieves more accurate, relevant, coherent and rich in semantic captions. This work has laid a foundation for advance vision language understanding, with potential application in assistive technology, content creation and AI driven media annotation.
URI:	http://dspace.dtu.ac.in:8080/jspui/handle/repository/21822
Appears in Collections:	M.E./M.Tech. Computer Engineering

Files in This Item:

File	Description	Size	Format
Ankita Mishra M.Tech.pdf		977.47 kB	Adobe PDF	View/Open

Show full item record