A STUDY ON DEEP LEARNING AND TRANSFORMER BASED MODELS FOR HAND GESTURE AND ACTION RECOGNITION

SUTTY, SAHIL

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/21847

Full metadata record

DC Field	Value	Language
dc.contributor.author	SUTTY, SAHIL	-
dc.date.accessioned	2025-07-08T08:48:56Z	-
dc.date.available	2025-07-08T08:48:56Z	-
dc.date.issued	2025-06	-
dc.identifier.uri	http://dspace.dtu.ac.in:8080/jspui/handle/repository/21847	-
dc.description.abstract	Fundamental technologies in the evolution of human-computer interaction (HCI), hand gestures and human action recognition enable more natural, intuitive, and accessible interfaces across sectors including assistive technologies, robotics, virtual reality, and surveillance. Using the MSRA Hand Gesture Dataset and the UCF101 Dataset, this paper presents a thorough comparative analysis of state-of- the-art deep learning and transformer-based models for hand gesture recognition and for human action recognition. Comprising 76,500 depth images distributed over 17 gesture classes, the MSRA Hand Gesture Dataset offers a strong basis for spatial feature extraction. ResNet101 obtained the highest F1-score (0.9978) among all architectures; closely followed by DenseNet 169 (0.9919) and DenseNet 201 (0.9901). MobileNetV2 demonstrated a good balance between computational efficiency and accuracy with an F1-score of 0.9847; VGG variants lagged since they lacked sophisticated architectural elements. Human action recognition using the UCF101 dataset—which consists of over 13,000 video clips in 101 action categories—was driven with an eye toward the 50 most frequent classes to guarantee computational feasibility and class balance.With F1-score 0.9997, transformer-based models especially ViT Tiny Patch surpassed even the deepest CNNs. While MobileNetV2 once shown efficiency in settings with limited resources, VGG16bn’s performance revealed the limits of older CNN architectures for demanding tasks. The results underline how architectural innovations including residual connections, dense connectivity, and attention mechanisms help to raise recognition accuracy and computational efficiency. The paper claims that transformer-based models are redefin ing benchmarks even if deep CNNs continue to be strong candidates. More particularly, considering hybrid CNN-transformer designs, explicit temporal modeling, and advanced augmentation techniques helps to increase recognition capacities in pragmatic settings.	en_US
dc.language.iso	en	en_US
dc.relation.ispartofseries	TD-8070;	-
dc.subject	DEEP LEARNING	en_US
dc.subject	TRANSFORMER	en_US
dc.subject	HAND GESTURE	en_US
dc.subject	ACTION RECOGNITION	en_US
dc.title	A STUDY ON DEEP LEARNING AND TRANSFORMER BASED MODELS FOR HAND GESTURE AND ACTION RECOGNITION	en_US
dc.type	Thesis	en_US
Appears in Collections:	M.E./M.Tech. Information Technology

Files in This Item:

File	Description	Size	Format
SAHIL SUTTY M.Tech.pdf		5.59 MB	Adobe PDF	View/Open

Show simple item record