MULTI-STAGE VISION-LANGUAGE TRANSFORMER (MVLT) FOR ENHANCED IMAGE CAPTIONING

MISHRA, ANKITA

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/21822

Full metadata record

DC Field	Value	Language
dc.contributor.author	MISHRA, ANKITA	-
dc.date.accessioned	2025-07-08T08:45:56Z	-
dc.date.available	2025-07-08T08:45:56Z	-
dc.date.issued	2025-05	-
dc.identifier.uri	http://dspace.dtu.ac.in:8080/jspui/handle/repository/21822	-
dc.description.abstract	Image captioning is an important task in today’s world which leverages computer vision and Natural language processing for generating meaningful coherent description of images. Earlier works based on convolutional and recurrent neural networks have shown prominent results but have faced lots of challenges with respect to understanding complicated scenes of images and long-range dependencies. To overcome these challenges, we have proposed a model which is Multi Stage Vision language Transformer (MVLT) which combines state-of-art deep learning architectures for improved image captioning. Our model leverages ViT-G and CLIP for extracting high resolution visual features and Flamingo-style perceiver Resampler for efficient vision-language fusion and LLaVA (Large Language & Vision Assistant) for caption generation with context awareness. Our model has been trained on MS COCO and conceptual captions datasets which is further evaluated on Flickr30k and Visual Genome and has shown promising performance across multiple benchmarks. The proposed MVLT model have achieved a performance that have outperformed previous state-of-art models in BLEU, CIDEr and METEOR scores and have successfully achieves more accurate, relevant, coherent and rich in semantic captions. This work has laid a foundation for advance vision language understanding, with potential application in assistive technology, content creation and AI driven media annotation.	en_US
dc.language.iso	en	en_US
dc.relation.ispartofseries	TD-8039;	-
dc.subject	IMAGE CAPTIONING	en_US
dc.subject	VISION-LANGUAGE MODELS	en_US
dc.subject	TRANSFORMERS	en_US
dc.subject	DEEP LEARNING	en_US
dc.subject	MULTIMODAL LEARNING	en_US
dc.subject	NATURAL LANGUAGE PROCESSING	en_US
dc.subject	FLAMINGO	en_US
dc.subject	LLAVA	en_US
dc.subject	VIT-G	en_US
dc.subject	CLIP	en_US
dc.title	MULTI-STAGE VISION-LANGUAGE TRANSFORMER (MVLT) FOR ENHANCED IMAGE CAPTIONING	en_US
dc.type	Thesis	en_US
Appears in Collections:	M.E./M.Tech. Computer Engineering

Files in This Item:

File	Description	Size	Format
Ankita Mishra M.Tech.pdf		977.47 kB	Adobe PDF	View/Open

Show simple item record