DESIGN A FRAMEWORK FOR GENERATION OF IMAGE DESCRIPTION USING DEEP LEARNING

AGARWAL, LAKSHITA

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/22190

Full metadata record

DC Field	Value	Language
dc.contributor.author	AGARWAL, LAKSHITA	-
dc.date.accessioned	2025-09-02T06:41:02Z	-
dc.date.available	2025-09-02T06:41:02Z	-
dc.date.issued	2025-06	-
dc.identifier.uri	http://dspace.dtu.ac.in:8080/jspui/handle/repository/22190	-
dc.description.abstract	Image description generation, an intricate cross-disciplinary work between com- puter vision and natural language processing (NLP), is intended to produce contex- tually precise and semantically rich textual descriptions of visual information. The proposed work is dedicated to filling significant research gaps in computerised image captioning by suggesting sophisticated deep-learning architectures that promote con- textual knowledge, semantic density, and generality across multimedia applications. The research is organised into three main tasks: (1) creating an automatic system for producing contextually and semantically rich image descriptions; (2) construct- ing a deep learning system to enhance description accuracy and prediction scores; and (3) designing image description models specific to multimedia uses. For the purpose of fulfilling these objectives, the thesis proposes several novel models. The VGG16-SceneGraph-BiGRU model integrates VGG16 for visual feature extraction, scene graphs for object relationship capture, and a BiGRU network for sequential language modelling, resulting in coherent and contextually enriched descriptions. Ad- ditionally, the Tri-FusionNet model combines a Vision Transformer (ViT) encoder, two attention mechanisms, a RoBERTa decoder, and a CLIP module to support im- proved feature extraction and multimodal alignment, enhancing description accuracy. Domain-specific use cases, such as medical imaging and autonomous driving, are also examined in the thesis with models designed specifically for the application, such as a ViT-GPT4 framework for chest X-ray analysis and a ResNet50 with a GPT2- based system to describe video-based behaviour. The proposed models are tested on benchmark data like MS COCO, Flickr8k, Flickr30k, IU Chest X-ray, NIH Chest X-ray, MSVD, and BDD-X Vehicular dataset on metrics like BLEU (1-4), CIDEr, METEOR, and ROUGE-L. The results show significant gains in description quality, semantic completeness, and contextual accuracy, setting new state-of-the-art image description generation benchmarks.	en_US
dc.language.iso	en	en_US
dc.relation.ispartofseries	TD-8216;	-
dc.subject	DESIGN A FRAMEWORK	en_US
dc.subject	GENERATION OF IMAGE	en_US
dc.subject	DEEP LEARNING	en_US
dc.subject	VGG16	en_US
dc.title	DESIGN A FRAMEWORK FOR GENERATION OF IMAGE DESCRIPTION USING DEEP LEARNING	en_US
dc.type	Thesis	en_US
Appears in Collections:	Ph.D. Information Technology

Files in This Item:

File	Description	Size	Format
LAKSHITA AGARWAL Ph.D..pdf		2.54 MB	Adobe PDF	View/Open

Show simple item record