DESIGN A FRAMEWORK FOR GENERATION OF IMAGE DESCRIPTION USING DEEP LEARNING

AGARWAL, LAKSHITA

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/22190

Title:	DESIGN A FRAMEWORK FOR GENERATION OF IMAGE DESCRIPTION USING DEEP LEARNING
Authors:	AGARWAL, LAKSHITA
Keywords:	DESIGN A FRAMEWORK GENERATION OF IMAGE DEEP LEARNING VGG16
Issue Date:	Jun-2025
Series/Report no.:	TD-8216;
Abstract:	Image description generation, an intricate cross-disciplinary work between com- puter vision and natural language processing (NLP), is intended to produce contex- tually precise and semantically rich textual descriptions of visual information. The proposed work is dedicated to filling significant research gaps in computerised image captioning by suggesting sophisticated deep-learning architectures that promote con- textual knowledge, semantic density, and generality across multimedia applications. The research is organised into three main tasks: (1) creating an automatic system for producing contextually and semantically rich image descriptions; (2) construct- ing a deep learning system to enhance description accuracy and prediction scores; and (3) designing image description models specific to multimedia uses. For the purpose of fulfilling these objectives, the thesis proposes several novel models. The VGG16-SceneGraph-BiGRU model integrates VGG16 for visual feature extraction, scene graphs for object relationship capture, and a BiGRU network for sequential language modelling, resulting in coherent and contextually enriched descriptions. Ad- ditionally, the Tri-FusionNet model combines a Vision Transformer (ViT) encoder, two attention mechanisms, a RoBERTa decoder, and a CLIP module to support im- proved feature extraction and multimodal alignment, enhancing description accuracy. Domain-specific use cases, such as medical imaging and autonomous driving, are also examined in the thesis with models designed specifically for the application, such as a ViT-GPT4 framework for chest X-ray analysis and a ResNet50 with a GPT2- based system to describe video-based behaviour. The proposed models are tested on benchmark data like MS COCO, Flickr8k, Flickr30k, IU Chest X-ray, NIH Chest X-ray, MSVD, and BDD-X Vehicular dataset on metrics like BLEU (1-4), CIDEr, METEOR, and ROUGE-L. The results show significant gains in description quality, semantic completeness, and contextual accuracy, setting new state-of-the-art image description generation benchmarks.
URI:	http://dspace.dtu.ac.in:8080/jspui/handle/repository/22190
Appears in Collections:	Ph.D. Information Technology

Files in This Item:

File	Description	Size	Format
LAKSHITA AGARWAL Ph.D..pdf		2.54 MB	Adobe PDF	View/Open

Show full item record