Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time,arXiv:2305.17118 H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models,arXiv:2306.14048 Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs,arX...
模型告诉要放弃什么:LLMs的自适应KV缓存压缩 "Model Tells You What To Discard: Adaptive KV Cache Compression For LLMs",来自UIUC和微软。 这项研究介绍自适应KV缓存压缩,一种即插即用的方法,可以减少大语言模型(LLM)生成推理的内存占用。与保留所有上下文token的 Key和Value向量的传统KV缓存不同,作者进行有针...
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads 论文链接: https://arxiv.org/abs/2407.15891 在此基础上,本工作联想到大模型的长序列能力也是上下文学习能力的一个子集,并围绕此展开了讨论。实验中,RazorAttention 压缩了 KV Cache 的 70%,并保持模型的长序列能力几近无损。 算法描述 本文...
Add a description, image, and links to the kv-cache-compression topic page so that developers can more easily learn about it. Curate this topic Add this topic to your repo To associate your repository with the kv-cache-compression topic, visit your repo's landing page and select "manag...
[2]Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference [3]SnapKV: LLM Knows What You are Looking for Before Generation [4]PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling ...
[9]: Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time (Liu et al. 2023) [10]: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs (Ge et al., 2023)
训练策略:为了训练DMC模型,论文提出了一种随机重参数化(stochasticreparametrization)的方法来处理离散的决策变量,以及一种中间压缩步骤(intermediate compression steps)来处理连续的α值。此外,还设计了一个全局一边损失(global one-sided loss)来激励模型达到目标压缩率。
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM 论文地址:https://arxiv.org/html/2403.05527v2 谷歌学术被引数:9 研究机构:佐治亚理工学院、Intel 主要内容: 1.使用均匀量化将kv cache量化低至四比特 2.使用低秩分解方法减少量化误差 3.使用稀疏矩阵来减少异常值造...
七、MiniCache 在[2405.14366] MiniCache: KV Cache Compression in Depth Dimension for Large Language Models 中,作者观察到 KV Cache 在 LLM 中的深层部分的相邻层之间表现出了高度相似性,可以基于这些相似性对 KV Cache 进行压缩。此外,作者还引入了 Token 保留策略,对高度不同的 KV Cache 不进行合并。并且...
The Official Implementation of PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling kv-cachellmkv-cache-compression UpdatedOct 13, 2024 Jupyter Notebook FMInference/H2O Star373 [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models....