Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/20722
Full metadata record
DC FieldValueLanguage
dc.contributor.authorGANDHI, NIRDOSH-
dc.date.accessioned2024-08-05T08:40:44Z-
dc.date.available2024-08-05T08:40:44Z-
dc.date.issued2024-05-
dc.identifier.urihttp://dspace.dtu.ac.in:8080/jspui/handle/repository/20722-
dc.description.abstractBy integrating computer vision and natural language pro cessing, the field of image captioning, has witnessed remarkable advancements driven by deep learning techniques like Convolutional Neural Networks (CNNs) and Long Short-Term Memory networks (LSTMs). This paper delves into the complexities of teaching machines to interpret visual input and generate meaningful captions, mirroring the human ability to describe images. The integration of cutting-edge technology and methodologies is essential for bridging the gap between visual understanding and linguistic expression in artificial in telligence. Computer Vision and Deep Learning have advanced significantly, thanks to improvements in deep learning algorithms, the availability of large datasets such as the Flickr8k dataset, and enhanced computing power. These developments have facilitated the creation of sophisticated models capable of accurately analyzing and understand ing images, leading to applications such as image captioning. The paper focuses on the architecture of CNN-RNN models, particularly CNNs for image feature extraction and LSTMs for generating coherent and contextually relevant captions. The synergis tic combination of these techniques enables image captioning systems to capture both visual semantics and linguistic nuances, resulting in accurate and meaningful descrip tions. The key technologies and libraries used are TensorFlow and Keras for model development, NLTK for natural language processing tasks, and PIL for image prepro cessing. The proposed methodology involves data preprocessing, feature extraction using VGG16, text preprocessing, and model training using an encoder-decoder frame work. The evaluation of the image captioning model demonstrates its effectiveness in generating precise, natural-sounding and appropriate captions for diverse images. The model achieves promising BLEU scores, indicating a high degree of similarity between generated captions and human-authored reference captions. This study contributes to iv the ongoing advancements in computer vision, natural language processing, and multi media analytics by elucidating the intricate workings of image captioning systems and showcasing their practical applications.en_US
dc.language.isoenen_US
dc.relation.ispartofseriesTD-7223;-
dc.subjectBEYOND PIXELSen_US
dc.subjectIMAGE CAPTIONINGen_US
dc.subjectSYNERGY OF VISIONen_US
dc.subjectLSTMen_US
dc.subjectCNNen_US
dc.titleBEYOND PIXELS: THE SYNERGY OF VISION AND LANGUAGE IN IMAGE CAPTIONINGen_US
dc.typeThesisen_US
Appears in Collections:M.E./M.Tech. Computer Engineering

Files in This Item:
File Description SizeFormat 
NIRDOSH GANDHI M.Tech..pdf2.4 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.