A STUDY ON DEEP LEARNING AND TRANSFORMER BASED MODELS FOR HAND GESTURE AND ACTION RECOGNITION

SUTTY, SAHIL

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/21847

Title:	A STUDY ON DEEP LEARNING AND TRANSFORMER BASED MODELS FOR HAND GESTURE AND ACTION RECOGNITION
Authors:	SUTTY, SAHIL
Keywords:	DEEP LEARNING TRANSFORMER HAND GESTURE ACTION RECOGNITION
Issue Date:	Jun-2025
Series/Report no.:	TD-8070;
Abstract:	Fundamental technologies in the evolution of human-computer interaction (HCI), hand gestures and human action recognition enable more natural, intuitive, and accessible interfaces across sectors including assistive technologies, robotics, virtual reality, and surveillance. Using the MSRA Hand Gesture Dataset and the UCF101 Dataset, this paper presents a thorough comparative analysis of state-of- the-art deep learning and transformer-based models for hand gesture recognition and for human action recognition. Comprising 76,500 depth images distributed over 17 gesture classes, the MSRA Hand Gesture Dataset offers a strong basis for spatial feature extraction. ResNet101 obtained the highest F1-score (0.9978) among all architectures; closely followed by DenseNet 169 (0.9919) and DenseNet 201 (0.9901). MobileNetV2 demonstrated a good balance between computational efficiency and accuracy with an F1-score of 0.9847; VGG variants lagged since they lacked sophisticated architectural elements. Human action recognition using the UCF101 dataset—which consists of over 13,000 video clips in 101 action categories—was driven with an eye toward the 50 most frequent classes to guarantee computational feasibility and class balance.With F1-score 0.9997, transformer-based models especially ViT Tiny Patch surpassed even the deepest CNNs. While MobileNetV2 once shown efficiency in settings with limited resources, VGG16bn’s performance revealed the limits of older CNN architectures for demanding tasks. The results underline how architectural innovations including residual connections, dense connectivity, and attention mechanisms help to raise recognition accuracy and computational efficiency. The paper claims that transformer-based models are redefin ing benchmarks even if deep CNNs continue to be strong candidates. More particularly, considering hybrid CNN-transformer designs, explicit temporal modeling, and advanced augmentation techniques helps to increase recognition capacities in pragmatic settings.
URI:	http://dspace.dtu.ac.in:8080/jspui/handle/repository/21847
Appears in Collections:	M.E./M.Tech. Information Technology

Files in This Item:

File	Description	Size	Format
SAHIL SUTTY M.Tech.pdf		5.59 MB	Adobe PDF	View/Open

Show full item record