Please use this identifier to cite or link to this item:
http://dspace.dtu.ac.in:8080/jspui/handle/repository/21439
Title: | IMAGE CAPTIONING USING DEEP LEARNING TECHNIQUES |
Authors: | SHARMA, DHRUV |
Keywords: | IMAGE CAPTIONING DEEP LEARNING TECHNIQUES NATURAL LANGUAGE PROCESSING (NLP) MSCOCO LSTM |
Issue Date: | Dec-2024 |
Series/Report no.: | TD-7746; |
Abstract: | Image Caption generation is a description of the contents of an image in the form of natural language sentences. It is evolving as an active research area of Computer Vision (CV) and Natural Language Processing (NLP). It generates syntactically and semantically correct sentences by describing important objects, attributes, and their relationships with each other. The very nature of it makes it suitable for applications such as image retrieval [1] [2], human-robot interaction [3] [4], aid to the blind [5], and visual question answering [6]. With the advent of technology and proliferating demand of society, automatic and intelligent image captioning based systems have become the need of the hour. Many Convolutional Neural Network (CNN)-based architectures are utilized at the encoder for efficient extraction of image features while Long Short-Term Memory (LSTM)-based decoder is further utilized for the generation of captions. Different variants of LSTM and CNNs with attention mechanism are utilized by the traditional methods for generation of meaningful and accurate descriptions of images. Though the captions generated by the traditional methods are simple yet they sometimes have some limitations which is mainly due to repetitive words or inaccurate descriptions of the scene, which usually may not reflect in natural language usage. Also, the traditional image captioning techniques fails to capture the relationship between objects and surroundings, thereby, neglecting the fine-grained details and diverse scenes, hence, leading to the ambiguities in generated language. To overcome the challenges faced by the traditional captioning models, this vii work focusses on generation of image captions using advanced deep-learning techniques. These techniques help to better understand the visual content and context of images by providing contextually relevant information rather than just listing objects. Therefore, leading to the model’s capability to generate meaningful, nuanced, and detailed descriptions for different real-world applications. The work in this thesis thus investigates different deep-learning based models or frameworks for factual, stylized and paragraph-based description of images. For generation of factual-based description of images, this thesis first discusses about Lightweight Transformer with GRU integrated decoder for image captioning. The proposed Lightweight Transformer exploits a single encoder-decoder based transformer model for generation of factual captions. Extensive experiments on MSCOCO dataset demonstrated that the proposed approach achieves a competitive score on all the evaluation metrics. For efficient description of the captions, it becomes necessary to learn higher order interactions between detected objects and the relationship among them. Most of the existing models take into account the first order interactions while ignoring the higher order ones. It is challenging to extract discriminant higher order semantics visual features in images with highly populated objects for caption generation. In this direction, an efficient higher order interaction learning framework is proposed in this study using encoder-decoder based image captioning. To leverage higher order interactions among multiple objects, an efficient XGL Transformer (XGL-T) model is introduced that exploits both spatial and channel wise attention. The proposed XGL-T model captures rich semantic concepts from objects, attributes, and their relationships. Extensive experiments are conducted on publicly available MSCOCO Karapathy test split and the best performance of the work viii is observed as 81.5 BLEU@1, 67.1 BLEU@2, 51.6 BLEU@3, 39.9 BLEU@4, 134 CIDEr, 59.9 ROUGE-L, 29.8 METEOR, 23.8 SPICE using CIDEr-D Score Optimization Strategy. Methods developed in the recent past focused mainly on the description of factual contents in images thereby ignoring the different emotions and styles (romantic, humorous, angry, etc.) associated with the image. To overcome this, few works incorporated style-based caption generation that captures the variability in the generated descriptions. This thesis presents a Style Embedding-based Variational Autoencoder for Controlled Stylized Caption Generation Framework (RFCG+SE VAE-CSCG). It generates controlled text-based stylized descriptions of images. It works in two phases i.e., ( 𝑖 ) Refined Factual Caption Generation (RFCG), and ( 𝑖𝑖 ) SE-VAE-CSCG. The former defines an encoder-decoder model for the generation of refined factual captions whereas, the latter presents a style embedding based variational autoencoder for controlled stylized caption generation. More so, with the use of a controlled text generation model, the proposed work efficiently learns disentangled representations and generates realistic stylized descriptions of images. Experiments on MSCOCO, Flickr30K, and FlickrStyle10K provide state-of-the-art results for both refined and style-based caption generation. Further, multi-level Variational Autoencoder Transformer (VAT)-based framework, 𝑀𝑟𝐴 2𝑉𝐴𝑇, is also proposed in this work for the generation of descriptions of images in the form of a paragraph. The proposed framework utilizes a combination of visual and spatial features which are further attended by the proposed multi resolution multi-head attention (𝑀2𝐴) to capture the relationships between the query ix representation and different attention granularities. To increase the language diversity and to remove the redundant sentences from the generated paragraph, the proposed framework also leverages a language discriminator. Extensive experiments on the Stanford Paragraph Dataset are conducted that provide superior results on all evaluation metrics with or without a language discriminator. Different deep-learning techniques are devised for the development of factual and stylized image captioning models. Previous models focused more on the generation of factual and stylized captions separately providing more than one caption for a single image. To address this issue, a novel Unified Attention and Multi-Head Attention-driven Caption Summarization Transformer (𝑈𝑛𝑀𝐴-CapSumT) based Captioning Framework is discussed in this thesis which integrates different captioning methods to describe the contents of an image with factual and stylized (romantic and humorous) elements. The proposed framework exploits both factual captions and stylized captions generated by the Modified Adaptive Attention-based factual image captioning model (MAA-FIC) and Style Factored Bi-LSTM with attention (SF-Bi ALSTM) driven stylized image captioning model respectively. Further, summarization transformer 𝑈𝑛𝑀𝐻𝐴 − 𝑆𝑇 combines both factual and stylized descriptions of an input image to generate styled rich coherent summarized captions. Extensive experiments are conducted on Flickr8K and a subset of FlickrStyle10K with supporting ablation studies to prove the efficiency and efficacy of the proposed framework. Also, this work presents two main application areas of the image captioning namely: medical image captioning and aid-to-the-blind. For medical image captioning, 𝐹𝐷𝑇 − 𝐷𝑟 2𝑇 framework is proposed which leverages the fusion of texture features x and deep features in the first stage by incorporating ISCM-LBP + PCA-HOG feature extraction algorithm and Convolutional Triple Attention-based Efficient XceptionNet (𝐶 − 𝑇𝑎𝑋𝑁𝑒𝑡). Further, fused features from the FDT module are utilized by the Dense Radiology Report Generation Transformer (𝐷𝑟 2𝑇) model with modified multi-head attention generating dense radiology reports by highlighting specific crucial abnormalities. For aid to the blind, an adaptive attention mechanism and Bi-LSTM based automated Image caption generation framework is proposed in this study. The proposed model exploits Inception-V3 to extract various global spatial features and the adaptive attention module helps to decide whether to attend to the image (and if so, to which regions) or to the visual sentinel maps. Further, at the decoding end, a Bi LSTM network refines the text description. The proposed model performance is evaluated in terms of BLEU scores on Flickr8K and Visual Assistance Dataset. |
URI: | http://dspace.dtu.ac.in:8080/jspui/handle/repository/21439 |
Appears in Collections: | Ph.D. Electronics & Communication Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
DHRUV SHARMA Ph.D..pdf | 4.29 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.