BEYOND PIXELS: THE SYNERGY OF VISION AND LANGUAGE IN IMAGE CAPTIONING

GANDHI, NIRDOSH

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/20722

Title:	BEYOND PIXELS: THE SYNERGY OF VISION AND LANGUAGE IN IMAGE CAPTIONING
Authors:	GANDHI, NIRDOSH
Keywords:	BEYOND PIXELS IMAGE CAPTIONING SYNERGY OF VISION LSTM CNN
Issue Date:	May-2024
Series/Report no.:	TD-7223;
Abstract:	By integrating computer vision and natural language pro cessing, the field of image captioning, has witnessed remarkable advancements driven by deep learning techniques like Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks (LSTMs). This paper delves into the complexities of teaching machines to interpret visual input and generate meaningful captions, mirroring the human ability to describe images. The integration of cutting-edge technology and methodologies is essential for bridging the gap between visual understanding and linguistic expression in artificial in telligence. Computer Vision and Deep Learning have advanced significantly, thanks to improvements in deep learning algorithms, the availability of large datasets such as the Flickr8k dataset, and enhanced computing power. These developments have facilitated the creation of sophisticated models capable of accurately analyzing and understand ing images, leading to applications such as image captioning. The paper focuses on the architecture of CNN-RNN models, particularly CNNs for image feature extraction and LSTMs for generating coherent and contextually relevant captions. The synergis tic combination of these techniques enables image captioning systems to capture both visual semantics and linguistic nuances, resulting in accurate and meaningful descrip tions. The key technologies and libraries used are TensorFlow and Keras for model development, NLTK for natural language processing tasks, and PIL for image prepro cessing. The proposed methodology involves data preprocessing, feature extraction using VGG16, text preprocessing, and model training using an encoder-decoder frame work. The evaluation of the image captioning model demonstrates its effectiveness in generating precise, natural-sounding and appropriate captions for diverse images. The model achieves promising BLEU scores, indicating a high degree of similarity between generated captions and human-authored reference captions. This study contributes to iv the ongoing advancements in computer vision, natural language processing, and multi media analytics by elucidating the intricate workings of image captioning systems and showcasing their practical applications.
URI:	http://dspace.dtu.ac.in:8080/jspui/handle/repository/20722
Appears in Collections:	M.E./M.Tech. Computer Engineering

Files in This Item:

File	Description	Size	Format
NIRDOSH GANDHI M.Tech..pdf		2.4 MB	Adobe PDF	View/Open

Show full item record