DSpace Collection:

STRUCTURED LATENT SPACE EXPLORATION WITH TRANSFORMER ENCODERS FOR DIVERSIFIED AND PERSONALIZED MULTIRATER MEDICAL IMAGE SEGMENTATION

2026-06-25T04:57:11Z

Title: STRUCTURED LATENT SPACE EXPLORATION WITH TRANSFORMER ENCODERS FOR DIVERSIFIED AND PERSONALIZED MULTIRATER MEDICAL IMAGE SEGMENTATION Authors: SHUKLA, KESHU; Verma, Bindu (SUPERVISOR) Abstract: Multi-rater medical image segmentation requires models that capture inter-annotator disagreement, not average it away. Standard probabilistic models process all rater annotation through one shared encoder: when four radiologists label the same nodule differently, their reconstruction gradients partially cancel inside that encoder and the latent code ends up as a gradient-weighted compromise across all four boundary deci sions. This is why prior samples from such a model cluster near the mean annotation rather than spanning the actual range of what qualified radiologists drew. We address this by replacing the shared posterior with N independent per rater posterior encoders qi(zi | x,yi), one per annotator. Each receives a 2-channel input: the image and a single rater mask. Gradient isolation follows from the per-rater ELBO decomposition, not from any regularisation: by the chain rule, ∂Li/∂zj = 0 for i ̸ = j, so rater i’s reconstruction gradient cannot reach rater j’s encoder. On LIDC-IDRI (1,018 CT scans, 4 radiologists, 1,609 nodule patches, 4-fold cross-validation), the per rater model (Stage 1 only) achieves GED 0.1444±0.0141 (−4.2%) and Dice_match 0.9112±0.0061 (+2.28% relative) over the full D-Persona two-stage pipeline. A systematic ablation tests transformer-based encoders (MiT-B2), orthogonality regu larisation, a discretised prior bank (k = 100), a dual diversity loss, and Stage 2 style vectors against the D-Persona baseline. Per-rater posteriors are the only modification that consistently improves both metrics at once. Transformer encoder capacity, tested as a direct competing architectural hypothesis, does not resolve the training-level gradient conflict. Dice_soft is unchanged at 0.9015: the gain comes from improved diversity and per-rater accuracy, not from higher average prediction quality. We test the model’s behaviour when not all annotators label every training image (the common clinical situation in multi-rater datasets). Under full sparsity (one annotator), the shared baseline undergoes gradient collapse: mean pairwise cosine similarity of reconstruction gradients rises from 0.167 (full annotation) to 0.976; within fold standard deviation shrinks approximately 19-fold (0.439 → 0.023). Per-rater posteriors maintain zero alignment by construction in all sparsity levels. The GED advantage grows with sparsity: +11.5% with three annotators, +17.8% with two, iv +21.4% with one. All 12 per-fold comparisons favour the per-rater model (sign-test p <0.001). At full annotation, both models are statistically equivalent (0.5% gap, within noise); the advantage is tied to sparsity, not general accuracy. On NPC-170 (170 nasopharyngeal carcinoma MRI cases, 4 annotators), the GED difference is 0.0011, within seed variance ±0.0085. The method works on a different anatomy and dataset. A third contribution analyses inter-rater annotation disagreement on 1,603 LIDC-IDRI cases using the nine per-rater clinical attribute ratings. Nodule margin clarity is the strongest predictor of inter-rater mask variance (Pearson r = 0.318, p < 0.001, confirmed across all four folds independently), followed by lobulation (r =0.243) and texture (r = 0.210). Malignancy is negatively correlated with mask variance (r =−0.202, p<0.001); a nodule rated highly suspicious need not have an ambiguous boundary, and one with an unclear margin need not look malignant. These findings point to where uncertainty-aware segmentation matters most: ill-defined, lobulated, part-solid nodules.

EFFICIENCY-DRIVEN SINGLE IMAGE SUPER-RESOLUTION USING ATTENTION-ENHANCED RESIDUAL FEATURE DISTILLATION NETWORKS

2026-06-25T04:55:57Z

Title: EFFICIENCY-DRIVEN SINGLE IMAGE SUPER-RESOLUTION USING ATTENTION-ENHANCED RESIDUAL FEATURE DISTILLATION NETWORKS Authors: MANGAL, ISHAN; Verma, Bindu (SUPERVISOR) Abstract: Image reconstruction based on low-resolution input appears to be a straightfor ward task but it is fundamentally ill-posed since there may exist many possible solu tions to the problem– all high-resolution images that could correspond to the observed downsampled version. The existing approaches in DL generate outstanding outputs; however, almost all of them require large amounts of computational power, which is incompatible with real-time inference on mobile devices, edge computing units, and camera hardware. ARFD-ESPCN is an image super-resolution architecture that builds upon the ES PCN sub-pixel upsampling structure but utilizes Feature Distillation Blocks, Squeeze and-Excitation channel-wise attention, and Global Feature Fusion layer. The key fea ture of the proposed architecture lies in preserving the speed of ESPCN by using only low-resolution convolutions and applying PixelShuffle once, at the final layer. This design replaces ESPCN’s shallow encoder-decoder with deeper and more competitive architecture that exploits six FDBs for splitting and merging convolution features with higher attention on the aspects not caught previously. The final model has 597,904parameters, needs 4.87 GFLOPsfora640×360input, and runs in 4.78 ms on a mid-rangeGPU.TrainingusedtheDF2Kdataset(DIV2Kplus Flickr2K, roughly 3,450 images) with L1 loss for 720 epochs and Charbonnier loss for the remaining 80, combined with flip and rotation augmentation. The optimiser is Adam with a cosine-annealing schedule that warms up over the first 50 epochs and decays toward 10−6 by the end. On standard benchmarks the model scores 31.57 dB / 0.8861 SSIM on Set5, 27.21 dB / 0.7447 on Set14, 26.22 dB / 0.7029 on BSD100, and 25.37 dB / 0.7606 on Urban100. That beats the original ESPCN by over 2 dB on Set5 and matches VDSR while using roughly 125× fewer floating-point operations. Compared with heavier efficient models like IMDN and RFDN the quality gap is about 0.6–0.7 dB, but ARFD-ESPCN runs 2–3× faster. A step-by-step ablation decomposes the total 2.11 dB gain into individual contribu tions: the training schedule accounts for 1.04 dB (the largest single factor), DF2K data adds 0.45 dB, SE attention with residual connections contributes 0.38 dB, augmenta tion plus L1 loss gives 0.15 dB, and the distillation block structure adds 0.07 dB. The practical lesson is that for models under one million parameters, getting the training pipeline right matters at least as much as architectural design. The model is small enough to fit on mobile accelerators and fast enough for 60 fps video pipelines, making it relevant for medical imaging, satellite photo enhancement, and streaming scenarios where latency budgets are tight. Possible extensions include real-world degradation handling via BSRGAN-style pipelines, perceptual and adver sarial losses for sharper visual textures, window-based spatial attention for periodic patterns, and INT8 quantisation for hardware without floating-point units.

ENHANCEMENT OF REVERSIBLE IMAGE STEGANOGRAPHY AND OPTIMIZATION OF QUANTUM IMAGE REPRESENTATION USING THE NEQR MODEL

2025-09-02T06:36:46Z

Title: ENHANCEMENT OF REVERSIBLE IMAGE STEGANOGRAPHY AND OPTIMIZATION OF QUANTUM IMAGE REPRESENTATION USING THE NEQR MODEL Authors: SINGH, SUMITRA Abstract: Reversible steganography allows for exact reconstruction of the cover media after hidden data extraction, making it vital for applications such as content authentication, medical imaging, and military communications. Various reversible steganography techniques include histogram shifting, image interpolation, and difference expansion. Histogram shifting methods apply shifting to pixel-domain histograms or prediction error histograms. Prediction error histogram methods offer higher embedding capacity, but they are more complex, lack a guaranteed lower bound on PSNR, and are more susceptible to histogram-based steganalysis. Pixel-domain histogram shifting techniques, though simpler and more efficient with a theoretical PSNR bound, generally have lower embedding capacity. Under this project, experiments are conducted on pixel-domain histogram shifting- based techniques. The capacity and histogram for varying number of non-overlapping image blocks and histogram blocks are analyzed. Experimental results show that embedding in image blocks does not significantly enhance the capacity compared to embedding in histogram blocks. Analysis of histogram blocks shows that embedding in two blocks yields the optimal results. A method is developed for making histogram shifting adaptive to payload size and a two layer embedding is developed for improved hiding capacity. Compared to previous methods, the two-layer embedding achieves higher capacity, better resistance to steganalysis, and maintains the PSNR acceptable for real-world applications. Quantum computing is an advancing field that offers significant speed advantages for certain computational tasks over classical computing. Notable examples include Shor’s algorithm, which efficiently solves integer factorization and discrete logarithm problems, and Grover’s algorithm, which accelerates the search process in unstructured databases. Quantum computing is based on quantum arithmetic operations where addition forms the core of all operations, as subtraction, multiplication, exponentiation, and division ix can all be reduced to repeated or modified forms of addition. Experiments are conducted for performance analysis of quantum addition on quantum hardware. Development of quantum circuits for addition and comparison, including half adders, full adders, Toffoli-based adders, QFT-based adders (utilizing the Quantum Fourier Transform), and quantum comparators is carried out using IBM Qiskit. The circuits are first validated on ideal simulators to confirm correctness, followed by testing on noisy simulators to emulate real quantum hardware conditions. Final execution is carried out on IBM's Eagle 127-qubit Quantum Processing Unit (QPU). Results show that computation accuracy on actual hardware is limited by physical constraints such as short qubit coherence times and instability. A performance comparison shows that Toffoli-based adders outperform QFT-based adders in terms of accuracy, making them more reliable for precise arithmetic computations. Quantum image representation provides exponential efficiency in image storage and processing. It relies on the fundamental principles of superposition and entanglement. NEQR (Novel Enhanced Quantum Representation) is a lossless encoding method used to represent digital images on a quantum computer. It is widely applicable in domains such as quantum machine learning, image steganography, and quantum image analysis. This work introduces two enhancements to the NEQR framework: (1) Optimizing the decomposition of Multi-Controlled NOT (MCX) gates into Toffoli gates, and (2) Parallelizing the NEQR by parallel bit-plane encoding of the NEQR circuit, where the NEQR circuit is simultaneously constructed for each of the eight bit-planes of an image, thereby reducing overall circuit depth. Experimental results demonstrate that these enhancements lead to reduced circuit depth and faster execution, thereby mitigating decoherence-related errors. Additionally, quantum image processing operations that demonstrate exponential speedup over classical approaches — such as image negation, rotation, and intensity superposition — are also implemented and evaluated as part of this work.

A STUDY ON DEEP LEARNING AND TRANSFORMER BASED MODELS FOR HAND GESTURE AND ACTION RECOGNITION

2025-07-08T08:48:56Z

Title: A STUDY ON DEEP LEARNING AND TRANSFORMER BASED MODELS FOR HAND GESTURE AND ACTION RECOGNITION Authors: SUTTY, SAHIL Abstract: Fundamental technologies in the evolution of human-computer interaction (HCI), hand gestures and human action recognition enable more natural, intuitive, and accessible interfaces across sectors including assistive technologies, robotics, virtual reality, and surveillance. Using the MSRA Hand Gesture Dataset and the UCF101 Dataset, this paper presents a thorough comparative analysis of state-of- the-art deep learning and transformer-based models for hand gesture recognition and for human action recognition. Comprising 76,500 depth images distributed over 17 gesture classes, the MSRA Hand Gesture Dataset offers a strong basis for spatial feature extraction. ResNet101 obtained the highest F1-score (0.9978) among all architectures; closely followed by DenseNet 169 (0.9919) and DenseNet 201 (0.9901). MobileNetV2 demonstrated a good balance between computational efficiency and accuracy with an F1-score of 0.9847; VGG variants lagged since they lacked sophisticated architectural elements. Human action recognition using the UCF101 dataset—which consists of over 13,000 video clips in 101 action categories—was driven with an eye toward the 50 most frequent classes to guarantee computational feasibility and class balance.With F1-score 0.9997, transformer-based models especially ViT Tiny Patch surpassed even the deepest CNNs. While MobileNetV2 once shown efficiency in settings with limited resources, VGG16bn’s performance revealed the limits of older CNN architectures for demanding tasks. The results underline how architectural innovations including residual connections, dense connectivity, and attention mechanisms help to raise recognition accuracy and computational efficiency. The paper claims that transformer-based models are redefin ing benchmarks even if deep CNNs continue to be strong candidates. More particularly, considering hybrid CNN-transformer designs, explicit temporal modeling, and advanced augmentation techniques helps to increase recognition capacities in pragmatic settings.