DSpace Community:

DSpace Community: http://dspace.dtu.ac.in:8080/jspui/handle/123456789/96 2026-07-25T11:35:28Z SEMANTIC-GUIDED DEEP LEARNING FRAMEWORKS FOR SCENE RECOGNITION: A COMPARATIVE STUDY OF CNN AND TRANSFORMER MODELS http://dspace.dtu.ac.in:8080/jspui/handle/repository/23006 Title: SEMANTIC-GUIDED DEEP LEARNING FRAMEWORKS FOR SCENE RECOGNITION: A COMPARATIVE STUDY OF CNN AND TRANSFORMER MODELS Authors: ALAM, MUHEET; Susan, Seba (SUPERVISOR) Abstract: Indoor scene recognition remains a challenging problem in computer vision due to large intra-class variation, strong inter-class similarity, and the complex contextual relationships that exist between objects and spatial layouts within indoor environments. Unlike object recognition, scene understanding requires the model to interpret not only the presence of semantic entities but also their spatial organization and contextual interactions. Conventional visual recognition approaches based solely on appearance features often struggle to capture these higher-level semantic relationships, particularly in scenes where multiple categories share similar visual structures. Semantic guidance has therefore emerged as an effective strategy for improving scene understanding by incorporating object-level contextual information into the recognition process. This thesis investigates how semantic supervision interacts with different deep neural representation architectures for indoor scene recognition. Rather than focusing solely on improving classification accuracy through larger models or architectural complexity, the study examines how the underlying representation structure of a backbone influences the effectiveness of semantic-guided feature learning. The work is structured as a progressive investigation across convolutional and transformer-based architectures under a consistent semantic-aware learning framework. The first phase of the study explores semantic-guided scene recognition using convolutional neural networks. A dual-branch framework consisting of an RGB branch and a semantic branch is employed, where semantic features derived from segmentation maps are integrated with visual representations through attention-based fusion. Within this framework, the effect of backbone architecture is analyzed by comparing ResNet 50 and ResNeXt-50 (32×4d) under identical training and fusion conditions. Experimental observations show that ResNeXt produces stronger scene representations and achieves improved recognition performance on the MIT Indoor-67 dataset. The results suggest that aggregated residual transformations and increased representational diversity enable more effective semantic-guided feature interaction than standard residual learning. Building upon these observations, the second phase extends the investigation to transformer-based architectures in order to analyze how different representation formats respond to semantic supervision. The study evaluates Vision Transformers and hierarchical Swin Transformers within a representation-aligned semantic learning framework. Since transformer architectures organize visual information differently,semantic representations are adapted to match the native structure of each backbone. Semantic maps are converted into token representations for Vision Transformers to enable token-level cross-attention, while hierarchical spatial semantic features are used for Swin Transformers to preserve locality and spatial alignment during fusion. Experimental results indicate that hierarchical transformer representations achieve more effective semantic-guided scene understanding than token-only representations. In particular, Swin-Tiny demonstrates stronger performance and more stable semantic interaction behavior compared to ViT-based models despite lower model complexity. Collectively, the findings of this thesis suggest that the effectiveness of semantic-aware scene recognition depends not only on the availability of semantic information, but also on how naturally the representation structure of the architecture supports semantic integration. Architectures that preserve spatial hierarchy and contextual locality appear to align more effectively with semantic scene cues than architectures relying purely on global token interactions. The study further highlights the importance of representation aware semantic encoding when designing multimodal scene understanding systems. Overall, this thesis presents a structured empirical investigation into semantic-guided representation learning across modern deep neural architectures for indoor scene recognition. The work establishes that semantic supervision becomes more effective when aligned with the native representation structure of the underlying backbone, and it provides insights that may guide future research in semantic-aware visual representation learning, multimodal scene understanding, and architecture-aware fusion design. 2026-05-01T00:00:00Z PARAKH:ACOMPREHENSIVE FRAMEWORKFOREVALUATINGSOCIAL BIAS IN HINDI-LANGUAGE LARGE LANGUAGEMODELS http://dspace.dtu.ac.in:8080/jspui/handle/repository/23001 Title: PARAKH:ACOMPREHENSIVE FRAMEWORKFOREVALUATINGSOCIAL BIAS IN HINDI-LANGUAGE LARGE LANGUAGEMODELS Authors: WEIKER, ASHWINI; Sharma, KAPIL (SUPERVISOR) Abstract: Large Language Models (LLMs) are increasingly deployed across India, yet infras tructure for evaluating their social biases in Indian languages remains absent. Exist ing benchmarks (BBQ, CrowS-Pairs, WinoBias, BOLD) are English-centric and miss India-specific bias axes such as caste discrimination, religious communalism, and re gional prejudice. This thesis presents PARAKH(ProbingAIResponsesAgainstKnownHindustani-societal biases), the first comprehensive Hindi-language LLM bias benchmark. PARAKH com prises 1,000 expert-crafted Hindi prompts spanning eight bias categories (Caste, Reli gious, Gender, Regional & Linguistic, Colorism, Class & Economic, LGBTQ+, Age &Disability), four difficulty levels, and five prompt types. Five LLMs are evaluated — Llama 3.18B,Qwen38B,Gemma29B,Gemini2.5Flash-Lite,andSarvam-12B—us ing a novel five-dimensional composite scoring rubric (Harm, Stereotype, Sycophancy, Refusal Quality, Counterfactual Fairness) with automated dual-judge validation. Evaluation of 1,048 judgments reveals significant inter-model variation. Gemma 2 9B performs best (mean composite 1.55, 76.8% proper refusal rate), while Qwen3 8B per forms worst (mean3.10, 30%failed-refusal rate). Sarvam-1 2B, despite only 2B param eters, matches Llama 3.1 8B (2.26 vs. 2.23), suggesting India-focused training partially compensates for size. Gender Bias is the hardest category for 3 of 5 models, and Role Play prompts most effectively bypass safety mechanisms (mean 2.92 vs. 1.57 for Opin ion Seeking). Inter-judge agreement (κ = 0.384) is consistent with human annotator levels in bias literature. Notably, one modelproducedanarrative justifying a Dalit engineer’s dismissal because “एकनीचीजातके कोऊंचीजातके ठेके दारोंको नदशदेनेकाअधकार नहीं” (a lower-caste person has no right to give orders to upper-caste contractors) — with no refusal mechanism activating. PARAKH establishes the first reproducible infras tructure for Hindi-language LLM bias evaluation. 2026-05-01T00:00:00Z DATA-DRIVEN FASHION: ENHANCING CONSUMER DECISIONS THROUGH TREND, PRICE, AND RATING ANALYSIS http://dspace.dtu.ac.in:8080/jspui/handle/repository/22994 Title: DATA-DRIVEN FASHION: ENHANCING CONSUMER DECISIONS THROUGH TREND, PRICE, AND RATING ANALYSIS Authors: LONARE, SAMEER; SHARMA, KAPIL (SUPERVISOR) Abstract: Due to the swift growth of e-commerce, an accurate and context-aware recommenda tion system is required, especially in the fashion sector where users’ preferences are related to both unique visual traits and categorical features. Most fashion retrieval approaches are based on either visual similarity or solely text-based metadata, but they do not account for the multi-dimensionality of fashion objects. This thesis introduces a novel Hybrid Rec ommendation System Architecture to overcome the semantic gap between visual aspect and contextual information. The proposed method uses a two-pass extraction approach. Global max pooling and L2 normalization are applied to these features to obtain robust image embeddings with a pre-trained ResNet50 deep learning backbone. At the same time, a categorical metadata pipeline performs sparse one-hot encoding of explicit item attributes.A categorical meta data pipeline is also performed concurrently, using sparse one-hot encoding of explicit item attributes. These two unique sets of features are fused with a customisable weighted fusion algorithm, which can be fine-tuned for the visual and textual significance. The system architecture also features an optimized o!ine serialization process for the system to be usable in the real world while maintaining low latency retrieval. Evidence shows that the proposed hybrid approach outperforms unimodal baseline approaches. Comparative ablation, in which all other methods were disabled except the hybrid model, yielded an outstanding Precision@5 score of 94%, which outperforms the visual-only retrieval and metadata-only retrieval. Overall, this study o”ers a scalable and e#cient platform that can be used in the modern web infrastructure, enhancing product discovery and automated fashion curation for users. 2026-05-01T00:00:00Z EARLY PREDICTION OF PUBLIC OPINION TRENDS IN THE 2024 U.S. PRESIDENTIAL ELECTION USING TOPIC MODELING, DENDROGRAM CLUSTERING, AND SENTIMENT ANALYSIS http://dspace.dtu.ac.in:8080/jspui/handle/repository/22988 Title: EARLY PREDICTION OF PUBLIC OPINION TRENDS IN THE 2024 U.S. PRESIDENTIAL ELECTION USING TOPIC MODELING, DENDROGRAM CLUSTERING, AND SENTIMENT ANALYSIS Authors: SRIVASTAVA, SHREYA Abstract: Traditional forecasting techniques include polls, focus groups and media commentary slow, expensive and unable to accurately determine what the average voter is really thinking. Twitter (now X) offers something different an enormous, real-time record of political opinion written spontaneously by millions of ordinary people, in their own words, without any filter. This thesis explores whether that stream of thought, specifically conversations on Twitter between May and July 2024, three months prior to the US Presidential Election, can provide an advance look at public sentiment. Two entirely independent methods were applied to a dataset of approximately 50,000 tweets drawn from that window. First, Latent Dirichlet Allocation was used to extract underlying themes from the corpus. Three topics emerged, with the one centred on Donald Trump and the MAGA movement proving the most coherent and internally consistent. Hierarchical clustering confirmed this distinctiveness, with "MAGA" forming its own separate cluster sitting apart even from closely associated terms like "GOP" and "republican". The second method analyzed sentiment using four lexicon-based tools: VADER, AFINN, TextBlob and SentiWordNet. Tweets mentioning Trump and tweets mentioning Biden were scored separately then normalized for fair comparison. Across all four tools, the data consistently showed a more positive tone in Trump-related tweets than in Biden-related ones. Critically, these two analyses never interacted in any way or shared information yet they arrived at the same conclusion. Well ahead of polling day, Twitter discourse surrounding Trump demonstrated both greater thematic coherence and a more favorable emotional tone than discourse surrounding Biden, and this independent convergence represents the central finding of this thesis. 2026-05-01T00:00:00Z