Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/21847
Full metadata record
DC FieldValueLanguage
dc.contributor.authorSUTTY, SAHIL-
dc.date.accessioned2025-07-08T08:48:56Z-
dc.date.available2025-07-08T08:48:56Z-
dc.date.issued2025-06-
dc.identifier.urihttp://dspace.dtu.ac.in:8080/jspui/handle/repository/21847-
dc.description.abstractFundamental technologies in the evolution of human-computer interaction (HCI), hand gestures and human action recognition enable more natural, intuitive, and accessible interfaces across sectors including assistive technologies, robotics, virtual reality, and surveillance. Using the MSRA Hand Gesture Dataset and the UCF101 Dataset, this paper presents a thorough comparative analysis of state-of- the-art deep learning and transformer-based models for hand gesture recognition and for human action recognition. Comprising 76,500 depth images distributed over 17 gesture classes, the MSRA Hand Gesture Dataset offers a strong basis for spatial feature extraction. ResNet101 obtained the highest F1-score (0.9978) among all architectures; closely followed by DenseNet 169 (0.9919) and DenseNet 201 (0.9901). MobileNetV2 demonstrated a good balance between computational efficiency and accuracy with an F1-score of 0.9847; VGG variants lagged since they lacked sophisticated architectural elements. Human action recognition using the UCF101 dataset—which consists of over 13,000 video clips in 101 action categories—was driven with an eye toward the 50 most frequent classes to guarantee computational feasibility and class balance.With F1-score 0.9997, transformer-based models especially ViT Tiny Patch surpassed even the deepest CNNs. While MobileNetV2 once shown efficiency in settings with limited resources, VGG16bn’s performance revealed the limits of older CNN architectures for demanding tasks. The results underline how architectural innovations including residual connections, dense connectivity, and attention mechanisms help to raise recognition accuracy and computational efficiency. The paper claims that transformer-based models are redefin ing benchmarks even if deep CNNs continue to be strong candidates. More particularly, considering hybrid CNN-transformer designs, explicit temporal modeling, and advanced augmentation techniques helps to increase recognition capacities in pragmatic settings.en_US
dc.language.isoenen_US
dc.relation.ispartofseriesTD-8070;-
dc.subjectDEEP LEARNINGen_US
dc.subjectTRANSFORMERen_US
dc.subjectHAND GESTUREen_US
dc.subjectACTION RECOGNITIONen_US
dc.titleA STUDY ON DEEP LEARNING AND TRANSFORMER BASED MODELS FOR HAND GESTURE AND ACTION RECOGNITIONen_US
dc.typeThesisen_US
Appears in Collections:M.E./M.Tech. Information Technology

Files in This Item:
File Description SizeFormat 
SAHIL SUTTY M.Tech.pdf5.59 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.