ADAPTIVE CONTEXT COMPRESSION TECHNIQUES FOR  EFFICIENT LARGE LANGUAGE MODEL INFERENCE:  A QUERY-COMPLEXITY-AWARE APPROACH

HAQ, INJAMAMUL; Bansal, Nipun (SUPERVISOR)

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/22923

Full metadata record

DC Field	Value	Language
dc.contributor.author	HAQ, INJAMAMUL	-
dc.contributor.author	Bansal, Nipun (SUPERVISOR)	-
dc.date.accessioned	2026-06-25T04:57:03Z	-
dc.date.available	2026-06-25T04:57:03Z	-
dc.date.issued	2025-05	-
dc.identifier.uri	http://dspace.dtu.ac.in:8080/jspui/handle/repository/22923	-
dc.description.abstract	The exponential growth of large-scale neural language models has opened up transforma tive possibilities across a broad range of natural language understanding problems. Sys tems built on the Transformer architecture [1] have demonstrated remarkable proficiency on tasks ranging from reading comprehension and open-domain question answering to code synthesis and multi-step reasoning. Yet the self-attention operation sitting at the heart of every Transformer block carries a computational burden that scales quadratically with input length, denoted O(n2). As contexts grow beyond a few thousand tokens — a routine re quirement in legal document analysis, multi-document question answering, and long-form summarisation — this quadratic growth translates into sluggish response times, swollen GPU memory footprints, and deployment costs that place the technology beyond reach for many organisations. Prior work on context compression has made genuine progress by shortening the input before it reaches the model [7, 8]. The unifying flaw across all fourteen methods examined in this thesis, however, is a deceptively simple assumption: that every incoming query deserves the same compression budget. A simple lookup question — “When was BERT published?” — can be answered from a single sentence. A comparative multi-hop question — “How do the pre-training strategies of BERT and GPT-3 differ in their downstream effect on reasoning tasks?” — may need a dozen passages spread across an entire document collection. Treating both with the same 40% removal rate is not an approximation; it is a category error that systematically harms the harder queries and wastes capacity on the easier ones. This thesis introduces the Query-Complexity-Aware Adaptive Context Compression (QCAC) framework, a model-agnostic preprocessing system that measures how demand ing an incoming question is and adjusts the compression ratio accordingly. QCAC com putes a per-query complexity score C(q) ∈ [0, 1] from three lightweight surface features of the question — its normalised length, vocabulary entropy, and the presence of multi-hop syntactic cues — and derives a per-query removal ratio r(q) = rmax − (rmax − rmin) · C(q). Every sentence in the document is then ranked using a weighted combination of three signals extracted from BERT [2]: the attention-magnitude score A(si), the attentional entropy score H(si), and the cosine similarity between the sentence and query embeddings sim(si, q), inspired by dense retrieval research [19]. Experiments conducted on SQuAD v1.1 [22] and HotpotQA [23] (300 samples each) using BERT-base-uncased [2] as the scorer and RoBERTa-base [5] as the downstream QA model on NVIDIA Tesla T4 hardware show that QCAC with weights (α = 0.10, β = 0.30, γ = 0.60) achieves 70.4% F1 on SQuAD at 36.8% sentence removal — a +16.2 percentage- iii point improvement over the prior Attn+Entropy baseline at equivalent compression. Adap tive behaviour is statistically validated without per-dataset tuning: HotpotQA queries re ceive a higher mean C(q) score (0.587 vs. 0.539, p < 0.001, t = 11.2) and automatically receive less compression. A seven-variant ablation study reveals that query-semantic simi larity is the strongest individual signal (58.0% F1 in isolation), while attention-only scoring actively underperforms random pruning (44.9%), consistent with the findings of Clark et al. [21]. End-to-end inference latency stands at 89.7 ms per sample — the lowest among all BERT-based compression methods evaluated — and the framework requires no modifi cation or retraining of the downstream language model.	en_US
dc.language.iso	en	en_US
dc.relation.ispartofseries	TD-8831;	-
dc.subject	LARGE LANGUAGE MODEL INFERENCE	en_US
dc.subject	CONTEXT COMPRESSION	en_US
dc.subject	ADAPTIVE COMPRESSION	en_US
dc.subject	QUERY COMPLEXITY	en_US
dc.subject	SENTENCE SELECTION	en_US
dc.subject	SEMANTIC SIMILARITY	en_US
dc.subject	HOTPOTQA	en_US
dc.subject	BERT	en_US
dc.subject	SQUAD	en_US
dc.title	ADAPTIVE CONTEXT COMPRESSION TECHNIQUES FOR EFFICIENT LARGE LANGUAGE MODEL INFERENCE: A QUERY-COMPLEXITY-AWARE APPROACH	en_US
dc.type	Thesis	en_US
Appears in Collections:	M.E./M.Tech. Computer Engineering

Files in This Item:

File	Description	Size	Format
INJAMAMUL HAQ M.Tech.pdf		1.75 MB	Adobe PDF	View/Open
INJAMAMUL HAQ plag.pdf		1.69 MB	Adobe PDF	View/Open

Show simple item record