Please use this identifier to cite or link to this item:
http://dspace.dtu.ac.in:8080/jspui/handle/repository/18777
Title: | HUMAN ACTION AND ACTIVITY RECOGNITION USING VIDEO SEQUENCES |
Authors: | SINGH, TEJ |
Keywords: | VIDEO SEQUENCES VISION-BASED UNDERSTANDING ARTIFICIAL INTELLIGENCE SOCIETY SPATIAL EDGE DISTRIBUTION OF GRADIENTS (SEDGs) |
Issue Date: | Aug-2020 |
Publisher: | DELHI TECHNOLOGICAL UNIVERSITY |
Series/Report no.: | TD - 5274; |
Abstract: | Recently, the vision-based understanding in video sequences entices numerous real-life applications such as gaming, robotics, patients monitoring, content-based retrieval, video surveillance, and security. One of the ultimate aims of artificial intelligence society is to develop an automatic system that can be recognized and understand human behaviour and activities in video sequences accurately. Over the decade, many efforts are made to recognize the human activity in videos but still, it is a challenging task due to intra-class action similarity, occlusions, view variations and environmental conditions. To analyse and address the issue involved in the recognition of human ac-tivity in video sequences. Initially, we have reviewed the most popular and prom-inent state-of-the-art solutions, compared and presented. Based on the literature survey, these solutions are categorized into handcrafted features based descriptors and automatically learned feature based on deep architectures. In this thesis work, the proposed action recognition framework is divided into handcrafted and deep learning-based architectures which are then utilized throughout this work by em-bedding the new algorithms for activity recognition, both in the handcrafted and automatic learned features domains. First, a novel handcrafted feature based descriptor is presented. This method addressed the major challenges such as abrupt scene change phenomena, clutter background and viewpoints variations by presented a novel visual cogni-zance based multi-resolution descriptor for action recognition using key pose frames. This descriptor framework is constructed by computation of textural and spatial cues at multi-resolution in still images obtained from videos sequences. A fuzzy inference model is used to select the single key pose image from action video sequences using maximum histogram distance between stacks of frames. To rep-resent, these key pose images the textural traits at various orientations and scales vii are extracted using Gabor wavelet while shape traits are computed through a mul-tilevel approach called Spatial Edge Distribution of Gradients (SEDGs). Finally, a hybrid model of action descriptor is developed using shape and textural evidence, which is known as Extended Multi-Resolution Features (EMRFs) model. The ac-tion classification is carried through two most famous and efficient distinctive clas-sifiers known as SVM and k-NN. The performance of the EMRF is computed on four publically available datasets, and it shows outstanding accuracy as compared with earlier state-of-the-art approaches which show its applicability for real-time applications. Second, two deep learning-based ConvNet architectures are presented to overcome the limitations of handcrafted solutions. These ConvNets frameworks is based on transfer learning by utilized a pre-trained deep model for features ex-tractions to identify the human actions in video sequences. It is experimentally observed that deep pre-trained model trained on a large annotated dataset is ex-changeable to action recognition task with the smaller training dataset. In the first work, a deeply coupled ConvNet for human activity recognition proposed that utilize the RGB frames at the top layer with Bi-directional Long Short Term Memory (Bi-LSTM), and at the bottom layer, CNN model is trained with a single Dynamic Motion Image (DMI). For the RGB frames, the CNN-Bi-LSTM model is trained end-to-end learning to refine the feature of the pre-trained CNN while dy-namic images stream is fine-tuned with the top layers of the pre-trained model to extract temporal information in videos. The features obtained from both the data streams are fused at the decision level after the softmax layer with different late fusion techniques. The highest classification accuracies are achieved with signifi-cant margin through the proposed model on four human action datasets: SBU In-teraction, MIVIA Action, MSR Action Pair, and MSR Daily Activity as compared with similar state-of-the-arts and outperforms. viii In the second proposed framework, a deep bottleneck multimodal feature fusion (D-BMFF) technique is presented that utilized three different modalities RGB, RGB-D(depth) and 3D coordinates information for activity classification be-cause it helps for better recognition and complete utilization of information avail-able from a depth sensor video simultaneously. During the training process RGB and depth, frames are fed at regular intervals for an activity video while 3D coor-dinates are first converted into single RGB skeleton motion history image (RGB-SklMHI). The multimodal features obtained from bottleneck layers before the top layer are fused by using multiset discriminant correlation analysis (M-DCA), which helps for robust visual action modelling. Finally, the fused features are clas-sified using a linear multiclass support vector machine (SVM) technique. The pro-posed approach is evaluated over four standard RGB-D datasets: UT-Kinect, CAD-60, Florence 3D and SBU Interaction. Our method exhibits excellent results and outperforms the state-of-the-art approaches. Finally, this thesis work is concluded with significant findings and future research aspects in the field of human action recognition. |
URI: | http://dspace.dtu.ac.in:8080/jspui/handle/repository/18777 |
Appears in Collections: | Ph.D. Electronics & Communication Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Thesis_TejSingh.pdf | 5.75 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.