MEDIUM
The severity is rated as MEDIUM because while the quantization discrepancies can lead to reduced accuracy, they are not directly exploitable by attackers unless combined with other vulnerabilities. Real-world exploitability remains low in both homelab and production environments due to the complexity of exploiting such discrepancies.

The security advisory discusses the potential risks associated with the KV cache quantizations in several large language models such as Qwen3.5 9B, Qwen3 VL 8B, Gemma 3 12B, Ministral 3 8B, and Irix 12B (Mistral Nemo). The vulnerability lies in the quantization process of these KV caches, which can introduce discrepancies between the original model's performance and its quantized version. These discrepancies, measured using Kullback-Leibler Divergence (KLD), can lead to reduced accuracy or even unexpected behavior that could be exploited by attackers if not properly managed. Engineers and sysadmins need to carefully evaluate the trade-offs between memory usage and model fidelity when deploying these models in production environments.

Affected Systems
  • Qwen3.5 9B
  • Qwen3 VL 8B
  • Gemma 3 12B
  • Ministral 3 8B
  • Irix 12B (Mistral Nemo)
Affected Versions: all versions using KV cache quantization techniques
Remediation
  • Review and validate the accuracy of each model post-quantization by comparing KLD measurements against original models.
  • Implement additional logging for discrepancies in production deployments to monitor unexpected behavior.
  • Consider increasing GPU VRAM if possible, to allow for less aggressive quantization schemes that preserve more fidelity.
Stack Impact

The impact on common homelab stacks is significant as limited VRAM (like 6GB) forces the use of already quantized models. This can lead to reduced accuracy and potential instability in model performance.

Source →