Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/22923
Full metadata record
DC FieldValueLanguage
dc.contributor.authorHAQ, INJAMAMUL-
dc.contributor.authorBansal, Nipun (SUPERVISOR)-
dc.date.accessioned2026-06-25T04:57:03Z-
dc.date.available2026-06-25T04:57:03Z-
dc.date.issued2025-05-
dc.identifier.urihttp://dspace.dtu.ac.in:8080/jspui/handle/repository/22923-
dc.description.abstractThe exponential growth of large-scale neural language models has opened up transforma tive possibilities across a broad range of natural language understanding problems. Sys tems built on the Transformer architecture [1] have demonstrated remarkable proficiency on tasks ranging from reading comprehension and open-domain question answering to code synthesis and multi-step reasoning. Yet the self-attention operation sitting at the heart of every Transformer block carries a computational burden that scales quadratically with input length, denoted O(n2). As contexts grow beyond a few thousand tokens — a routine re quirement in legal document analysis, multi-document question answering, and long-form summarisation — this quadratic growth translates into sluggish response times, swollen GPU memory footprints, and deployment costs that place the technology beyond reach for many organisations. Prior work on context compression has made genuine progress by shortening the input before it reaches the model [7, 8]. The unifying flaw across all fourteen methods examined in this thesis, however, is a deceptively simple assumption: that every incoming query deserves the same compression budget. A simple lookup question — “When was BERT published?” — can be answered from a single sentence. A comparative multi-hop question — “How do the pre-training strategies of BERT and GPT-3 differ in their downstream effect on reasoning tasks?” — may need a dozen passages spread across an entire document collection. Treating both with the same 40% removal rate is not an approximation; it is a category error that systematically harms the harder queries and wastes capacity on the easier ones. This thesis introduces the Query-Complexity-Aware Adaptive Context Compression (QCAC) framework, a model-agnostic preprocessing system that measures how demand ing an incoming question is and adjusts the compression ratio accordingly. QCAC com putes a per-query complexity score C(q) ∈ [0, 1] from three lightweight surface features of the question — its normalised length, vocabulary entropy, and the presence of multi-hop syntactic cues — and derives a per-query removal ratio r(q) = rmax − (rmax − rmin) · C(q). Every sentence in the document is then ranked using a weighted combination of three signals extracted from BERT [2]: the attention-magnitude score A(si), the attentional entropy score H(si), and the cosine similarity between the sentence and query embeddings sim(si, q), inspired by dense retrieval research [19]. Experiments conducted on SQuAD v1.1 [22] and HotpotQA [23] (300 samples each) using BERT-base-uncased [2] as the scorer and RoBERTa-base [5] as the downstream QA model on NVIDIA Tesla T4 hardware show that QCAC with weights (α = 0.10, β = 0.30, γ = 0.60) achieves 70.4% F1 on SQuAD at 36.8% sentence removal — a +16.2 percentage- iii point improvement over the prior Attn+Entropy baseline at equivalent compression. Adap tive behaviour is statistically validated without per-dataset tuning: HotpotQA queries re ceive a higher mean C(q) score (0.587 vs. 0.539, p < 0.001, t = 11.2) and automatically receive less compression. A seven-variant ablation study reveals that query-semantic simi larity is the strongest individual signal (58.0% F1 in isolation), while attention-only scoring actively underperforms random pruning (44.9%), consistent with the findings of Clark et al. [21]. End-to-end inference latency stands at 89.7 ms per sample — the lowest among all BERT-based compression methods evaluated — and the framework requires no modifi cation or retraining of the downstream language model.en_US
dc.language.isoenen_US
dc.relation.ispartofseriesTD-8831;-
dc.subjectLARGE LANGUAGE MODEL INFERENCEen_US
dc.subjectCONTEXT COMPRESSIONen_US
dc.subjectADAPTIVE COMPRESSIONen_US
dc.subjectQUERY COMPLEXITYen_US
dc.subjectSENTENCE SELECTIONen_US
dc.subjectSEMANTIC SIMILARITYen_US
dc.subjectHOTPOTQAen_US
dc.subjectBERTen_US
dc.subjectSQUADen_US
dc.titleADAPTIVE CONTEXT COMPRESSION TECHNIQUES FOR EFFICIENT LARGE LANGUAGE MODEL INFERENCE: A QUERY-COMPLEXITY-AWARE APPROACHen_US
dc.typeThesisen_US
Appears in Collections:M.E./M.Tech. Computer Engineering

Files in This Item:
File Description SizeFormat 
INJAMAMUL HAQ M.Tech.pdf1.75 MBAdobe PDFView/Open
INJAMAMUL HAQ plag.pdf1.69 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.