VISUAL TRANSFORMERS FOR IMAGE UNDERSTANDING

BISHT, HIMANSHU

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/20834

Full metadata record

DC Field	Value	Language
dc.contributor.author	BISHT, HIMANSHU	-
dc.date.accessioned	2024-08-05T09:02:03Z	-
dc.date.available	2024-08-05T09:02:03Z	-
dc.date.issued	2024-05	-
dc.identifier.uri	http://dspace.dtu.ac.in:8080/jspui/handle/repository/20834	-
dc.description.abstract	Image captioning is a complex undertaking that combines computer vision and natural language processing, with the goal of producing descriptive text for visual stimuli that mimics human language. In this study, we investigate the symbiotic relationship between the EfficientNetB2 image encoder and a Transformer-based language model in the context of image captioning. The utilization of the Efficient NetB2 model serves to capture intricate features within images, while the Transformer model contributes towards the formulation of well-structured and contextually apt captions. The dataset used for training and evaluation is [Flikr8k]. This dataset consists of a diverse collection of images matched with their respective captions. Extensive preprocessing is conducted on both images and captions to ensure compatibility with the selected model architecture. This process involves refining and preparing the data prior to input into the model, in order to optimize the overall performance and accuracy of the system. The image captioning model integrates the Efficient NetB2 image encoder with a customized Transformer-based language model. The model is trained on the prepared dataset with careful consideration given to hyperparameters such as batch size, learning rate, and the number of training epochs. This ensures that the model is optimized for performance and accuracy. Results from the training and evaluation phases are presented, emphasizing the model's proficiency in producing captions that accurately correspond with the visual information. Training and validation metrics, in conjunction with caption quality scores, play a key role in providing a thorough evaluation of the efficacy of the model. This study makes a significant contribution to the field of image captioning by demonstrating the efficacy of integrating EfficientNetB2 and Transformer models. The findings of this project provide valuable opportunities for future research and optimization within the field of integrating computer vision and natural language processing. These insights offer potential avenues for further exploration and development in this interdisciplinary area of study.	en_US
dc.language.iso	en	en_US
dc.relation.ispartofseries	TD-7363;	-
dc.subject	VISUAL TRANSFORMERS	en_US
dc.subject	IMAGE UNDERSTANDING	en_US
dc.subject	EFFICIENTNetB2	en_US
dc.title	VISUAL TRANSFORMERS FOR IMAGE UNDERSTANDING	en_US
dc.type	Thesis	en_US
Appears in Collections:	M.E./M.Tech. Electronics & Communication Engineering

Files in This Item:

File	Description	Size	Format
Himanshu M.Tech.pdf		11.86 MB	Adobe PDF	View/Open

Show simple item record