Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/22473
Full metadata record
DC FieldValueLanguage
dc.contributor.authorKURCHANIYA, DIKSHA-
dc.date.accessioned2025-12-29T08:36:55Z-
dc.date.available2025-12-29T08:36:55Z-
dc.date.issued2025-09-
dc.identifier.urihttp://dspace.dtu.ac.in:8080/jspui/handle/repository/22473-
dc.description.abstractHuman Activity Recognition (HAR) has emerged as a key area of research due to its broad spectrum of real-world applications in surveillance, healthcare monitoring, smart homes, and human-computer interaction. With the rise of vision-based systems and the increasing availability of data acquisition devices such as CCTV cameras and RGB-D sensors, there has been significant progress in developing intelligent systems capable of understanding human behavior. However, the deployment of HAR systems in unconstrained environments still faces numerous challenges, including occlusion of body parts, variations in viewpoints, incomplete or noisy skeleton data, and dynamic environmental conditions. These challenges limit the performance and generalizability of many existing models, particularly when operating in real-time or under limited data conditions. To overcome these limitations, this thesis proposes a series of deep learning-based frameworks designed to enhance the robustness, accuracy, and adaptability of vision- based HAR. The first work introduces a two-stream deep neural network architecture that independently extracts spatial and temporal features from RGB video frames us- ing the Xception model and Bidirectional Long Short-Term Memory (Bi-LSTM) net- works, respectively. This design enables the model to learn fine-grained contextual information along with motion dynamics, thereby improving recognition of complex and subtle human activities. In our second work, we propose a dual-stream HAR model that integrates channel- wise attention mechanisms with Motion of Oriented Gradients (MOG) and Appear- ance Information (AI). The model processes RGB and Gradient Motion Information (GMI) frames in parallel using the Xception network and fuses them via a point-wise ConvBi-LSTM. This dual-stream approach effectively emphasizes critical features us- ing channel-wise attention and captures spatiotemporal dependencies for improved classification. Validation on several benchmark datasets. Further, to investigate the performance, existing state-of-the-art HAR models are systematically evaluated on a synthesized dataset tailored to represent real-world complexity, providing insights into their generalization capabilities. In our third work, we address the challenge of occluded or missing skeleton data, which is a common problem in surveillance scenarios. A skeleton-based HAR frame- work is proposed that utilizes Generative Adversarial Imputation Networks (GAIN) to recover missing keypoints and integrates Bi-LSTM with spatial and temporal attention mechanisms. This approach not only reconstructs occluded skeleton data but also en- vi hances the overall feature representation, leading to higher classification accuracy in both occluded and non-occluded environments. To further improve occlusion handling, the fourth framework introduces a multi- stream Spatial Temporal Graph Convolutional Network (STGCN). This model is de- signed to dynamically redirect under-activated or weakly visible joints to additional processing streams using an activation-based selection mechanism. This multistream setup allows for a more complete and discriminative feature extraction process, even when some joints are partially or completely hidden from view. The fifth work targets the problem of viewpoint variation, which can significantly impact the recognition performance of vision-based HAR systems. A hybrid model is proposed that combines handcrafted Uniform Rotation-Invariant Local Binary Patterns (URI-LBP) with deep features extracted from a fine-tuned VGG16 network. These fused features are passed through a Spatio-Temporal LSTM to capture the temporal evolution of actions across different viewpoints. The resulting model achieves view- invariant activity recognition and demonstrates strong performance across multiple camera perspectives. Also, a synthesized occlusion dataset was developed to simulate realistic occlusion scenarios and enable rigorous evaluation of the proposed methods under controlled conditions. The experimental validation of all models was conducted on multiple benchmark datasets, including UCF Crime, NTU-RGB+D, PKU-MMD, IXMAS, and CASIA, as well as the synthesized occlusion dataset. The results show consistent im- provements over state-of-the-art models in terms of accuracy, robustness to occlusion, and viewpoint generalization. This work highlights the capabilities of adaptive, AI-driven HAR systems in en- abling next-generation intelligent environments and lays a foundation for future explo- ration in personalized and context-aware activity modeling. The proposed solutions bridge the gap between controlled research settings and practical deployment by en- abling activity recognition systems to function effectively under occlusion, viewpoint variation, and data limitations. These contributions lay a solid foundation for future work in real-time, multimodal, and context-aware HAR systems capable of supporting intelligent applications in diverse domains.en_US
dc.language.isoenen_US
dc.relation.ispartofseriesTD-8301;-
dc.subjectHUMAN ACTIVITY RECOGNITIONen_US
dc.subjectDEEP LEARNINGen_US
dc.subjectACTIVITY RECOGNITIONen_US
dc.subjectBi-LSTMen_US
dc.subjectHAR MODELen_US
dc.titleDEVELOPMENT OF DEEP LEARNING BASED FRAMEWORK FOR ACTIVITY RECOGNITIONen_US
dc.typeThesisen_US
Appears in Collections:Ph.D. Computer Engineering

Files in This Item:
File Description SizeFormat 
HARSH MEHRA M.Des..pdf3.38 MBAdobe PDFView/Open
HARSH MEHRA Plag..pdf3.49 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.