IMAGE CAPTIONING USING YOLO'S OUTPUT AS INPUT TO ENCODER-DECODER LSTM

KUMAR, NAVNEET

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/16596

Title:	IMAGE CAPTIONING USING YOLO'S OUTPUT AS INPUT TO ENCODER-DECODER LSTM
Authors:	KUMAR, NAVNEET
Keywords:	IMAGE CAPTIONING ENCODER-DECODER LSTM RECURRENT NEURAL NETWORK
Issue Date:	May-2018
Series/Report no.:	TD-4462;
Abstract:	The task of caption generation for image has recently received considerable attention. In this thesis we will see how we can make computers to look at an image and output a description for the same. This process has many potential applications in real life. A noteworthy one would be to save the captions of an image so that it can be retrieved easily at a later stage just on the basis of this description. With few modifications this system can also assist visually-impaired persons with their daily chores. The task of caption generation is straightforward – Given an input image our algorithm is expected to describe what is there in the image. By description we mean that the system will tell us about the objects present in the image, and the tasks that are being performed by the objects. Tasks like these are trivial for humans, but non-trivial for computers. Thanks to advancement in deep learning, computers are now reaching human level performance. This thesis work introduces a generic end-to-end trainable Convolutional Neural Network (CNN) -Recurrent Neural Network (RNN) Fusion-based technique to solve the problem of image captioning. In particular, we feed an image into a CNN and the output of CNN then gets fed to an Encoder-Decoder. The task of CNN is to output set of objects and their location. EncoderDecoder network takes the output of CNN as input and feed that into its Encoder. The Encoder uses a two-stream RNN to encode the information coming from CNN, the coded information then gets passed to decoder. Decoder uses a standard LSTM neural network to generate text. Standard MS-COCO captioning task dataset is used for this task.
URI:	http://dspace.dtu.ac.in:8080/jspui/handle/repository/16596
Appears in Collections:	M.E./M.Tech. Electronics & Communication Engineering

Files in This Item:

File	Description	Size	Format
navneet_thesis_final.pdf		4.85 MB	Adobe PDF	View/Open

Show full item record