Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/22190
Full metadata record
DC FieldValueLanguage
dc.contributor.authorAGARWAL, LAKSHITA-
dc.date.accessioned2025-09-02T06:41:02Z-
dc.date.available2025-09-02T06:41:02Z-
dc.date.issued2025-06-
dc.identifier.urihttp://dspace.dtu.ac.in:8080/jspui/handle/repository/22190-
dc.description.abstractImage description generation, an intricate cross-disciplinary work between com- puter vision and natural language processing (NLP), is intended to produce contex- tually precise and semantically rich textual descriptions of visual information. The proposed work is dedicated to filling significant research gaps in computerised image captioning by suggesting sophisticated deep-learning architectures that promote con- textual knowledge, semantic density, and generality across multimedia applications. The research is organised into three main tasks: (1) creating an automatic system for producing contextually and semantically rich image descriptions; (2) construct- ing a deep learning system to enhance description accuracy and prediction scores; and (3) designing image description models specific to multimedia uses. For the purpose of fulfilling these objectives, the thesis proposes several novel models. The VGG16-SceneGraph-BiGRU model integrates VGG16 for visual feature extraction, scene graphs for object relationship capture, and a BiGRU network for sequential language modelling, resulting in coherent and contextually enriched descriptions. Ad- ditionally, the Tri-FusionNet model combines a Vision Transformer (ViT) encoder, two attention mechanisms, a RoBERTa decoder, and a CLIP module to support im- proved feature extraction and multimodal alignment, enhancing description accuracy. Domain-specific use cases, such as medical imaging and autonomous driving, are also examined in the thesis with models designed specifically for the application, such as a ViT-GPT4 framework for chest X-ray analysis and a ResNet50 with a GPT2- based system to describe video-based behaviour. The proposed models are tested on benchmark data like MS COCO, Flickr8k, Flickr30k, IU Chest X-ray, NIH Chest X-ray, MSVD, and BDD-X Vehicular dataset on metrics like BLEU (1-4), CIDEr, METEOR, and ROUGE-L. The results show significant gains in description quality, semantic completeness, and contextual accuracy, setting new state-of-the-art image description generation benchmarks.en_US
dc.language.isoenen_US
dc.relation.ispartofseriesTD-8216;-
dc.subjectDESIGN A FRAMEWORKen_US
dc.subjectGENERATION OF IMAGEen_US
dc.subjectDEEP LEARNINGen_US
dc.subjectVGG16en_US
dc.titleDESIGN A FRAMEWORK FOR GENERATION OF IMAGE DESCRIPTION USING DEEP LEARNINGen_US
dc.typeThesisen_US
Appears in Collections:Ph.D. Information Technology

Files in This Item:
File Description SizeFormat 
LAKSHITA AGARWAL Ph.D..pdf2.54 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.