Please use this identifier to cite or link to this item:
http://dspace.dtu.ac.in:8080/jspui/handle/repository/21822
Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.author | MISHRA, ANKITA | - |
dc.date.accessioned | 2025-07-08T08:45:56Z | - |
dc.date.available | 2025-07-08T08:45:56Z | - |
dc.date.issued | 2025-05 | - |
dc.identifier.uri | http://dspace.dtu.ac.in:8080/jspui/handle/repository/21822 | - |
dc.description.abstract | Image captioning is an important task in today’s world which leverages computer vision and Natural language processing for generating meaningful coherent description of images. Earlier works based on convolutional and recurrent neural networks have shown prominent results but have faced lots of challenges with respect to understanding complicated scenes of images and long-range dependencies. To overcome these challenges, we have proposed a model which is Multi Stage Vision language Transformer (MVLT) which combines state-of-art deep learning architectures for improved image captioning. Our model leverages ViT-G and CLIP for extracting high resolution visual features and Flamingo-style perceiver Resampler for efficient vision-language fusion and LLaVA (Large Language & Vision Assistant) for caption generation with context awareness. Our model has been trained on MS COCO and conceptual captions datasets which is further evaluated on Flickr30k and Visual Genome and has shown promising performance across multiple benchmarks. The proposed MVLT model have achieved a performance that have outperformed previous state-of-art models in BLEU, CIDEr and METEOR scores and have successfully achieves more accurate, relevant, coherent and rich in semantic captions. This work has laid a foundation for advance vision language understanding, with potential application in assistive technology, content creation and AI driven media annotation. | en_US |
dc.language.iso | en | en_US |
dc.relation.ispartofseries | TD-8039; | - |
dc.subject | IMAGE CAPTIONING | en_US |
dc.subject | VISION-LANGUAGE MODELS | en_US |
dc.subject | TRANSFORMERS | en_US |
dc.subject | DEEP LEARNING | en_US |
dc.subject | MULTIMODAL LEARNING | en_US |
dc.subject | NATURAL LANGUAGE PROCESSING | en_US |
dc.subject | FLAMINGO | en_US |
dc.subject | LLAVA | en_US |
dc.subject | VIT-G | en_US |
dc.subject | CLIP | en_US |
dc.title | MULTI-STAGE VISION-LANGUAGE TRANSFORMER (MVLT) FOR ENHANCED IMAGE CAPTIONING | en_US |
dc.type | Thesis | en_US |
Appears in Collections: | M.E./M.Tech. Computer Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Ankita Mishra M.Tech.pdf | 977.47 kB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.