BEYOND PIXELS: THE SYNERGY OF VISION AND LANGUAGE IN IMAGE CAPTIONING

GANDHI, NIRDOSH

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/20722

Full metadata record

DC Field	Value	Language
dc.contributor.author	GANDHI, NIRDOSH	-
dc.date.accessioned	2024-08-05T08:40:44Z	-
dc.date.available	2024-08-05T08:40:44Z	-
dc.date.issued	2024-05	-
dc.identifier.uri	http://dspace.dtu.ac.in:8080/jspui/handle/repository/20722	-
dc.description.abstract	By integrating computer vision and natural language pro cessing, the field of image captioning, has witnessed remarkable advancements driven by deep learning techniques like Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks (LSTMs). This paper delves into the complexities of teaching machines to interpret visual input and generate meaningful captions, mirroring the human ability to describe images. The integration of cutting-edge technology and methodologies is essential for bridging the gap between visual understanding and linguistic expression in artificial in telligence. Computer Vision and Deep Learning have advanced significantly, thanks to improvements in deep learning algorithms, the availability of large datasets such as the Flickr8k dataset, and enhanced computing power. These developments have facilitated the creation of sophisticated models capable of accurately analyzing and understand ing images, leading to applications such as image captioning. The paper focuses on the architecture of CNN-RNN models, particularly CNNs for image feature extraction and LSTMs for generating coherent and contextually relevant captions. The synergis tic combination of these techniques enables image captioning systems to capture both visual semantics and linguistic nuances, resulting in accurate and meaningful descrip tions. The key technologies and libraries used are TensorFlow and Keras for model development, NLTK for natural language processing tasks, and PIL for image prepro cessing. The proposed methodology involves data preprocessing, feature extraction using VGG16, text preprocessing, and model training using an encoder-decoder frame work. The evaluation of the image captioning model demonstrates its effectiveness in generating precise, natural-sounding and appropriate captions for diverse images. The model achieves promising BLEU scores, indicating a high degree of similarity between generated captions and human-authored reference captions. This study contributes to iv the ongoing advancements in computer vision, natural language processing, and multi media analytics by elucidating the intricate workings of image captioning systems and showcasing their practical applications.	en_US
dc.language.iso	en	en_US
dc.relation.ispartofseries	TD-7223;	-
dc.subject	BEYOND PIXELS	en_US
dc.subject	IMAGE CAPTIONING	en_US
dc.subject	SYNERGY OF VISION	en_US
dc.subject	LSTM	en_US
dc.subject	CNN	en_US
dc.title	BEYOND PIXELS: THE SYNERGY OF VISION AND LANGUAGE IN IMAGE CAPTIONING	en_US
dc.type	Thesis	en_US
Appears in Collections:	M.E./M.Tech. Computer Engineering

Files in This Item:

File	Description	Size	Format
NIRDOSH GANDHI M.Tech..pdf		2.4 MB	Adobe PDF	View/Open

Show simple item record