Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

This paper represents a significant leap forward in prompt injection detection using sparse autoencoders and FP-Growth algorithm for mining co-activation patterns. The Gemma Scope SAE approach at layers 6/12/18 is particularly noteworthy as it outperforms traditional single-feature scoring methods by reducing false positives up to 14 times. This method should be considered over conventional approaches like one-class SVM or Isolation Forest, which often struggle with high-dimensional data and suffer from higher rates of false positives.

The paper discusses a novel approach for detecting prompt injection attacks using sparse autoencoders and co-activation patterns. The research achieves a 95.2% detection rate across 2,067 held-out payloads from 110 attack categories with significantly fewer false positives compared to traditional single-feature scoring methods. It leverages Gemma Scope Sparse AutoEncoders (SAEs) at specific layers and conjunctive co-activation patterns mined via the FP-Growth algorithm for efficient and accurate detection. The model also includes a trust boundary mechanism that excludes the beginning of sentence (BOS) token, ensuring robustness against malicious payloads. Performance-wise, the system boasts a p95 latency of 8.6 milliseconds on consumer-grade GPUs, making it both effective and scalable.

The practical implications for engineers and sysadmins are profound as prompt injection attacks can lead to significant security breaches in machine learning systems. For instance, a sysadmin running Proxmox VE 7.x or Docker CE 20.10 might face challenges in securing their Kubernetes clusters from such attacks. By implementing Gemma Scope SAEs with the described method, they could enhance detection capabilities and reduce false alarms, which is crucial for maintaining trust boundaries in environments where machine learning models interact with user inputs. This can prevent unauthorized access to sensitive data or services managed by these systems.

The paper introduces a mechanism based on sparse autoencoders (SAEs) and co-activation patterns, which significantly improves the detection of prompt injection attacks compared to single-feature scoring methods. This is particularly useful in environments where security is paramount.
Utilizing Gemma Scope SAEs at layers 6/12/18 enhances the model's ability to detect complex attack vectors, leveraging specific neural network layer outputs for more accurate pattern recognition and anomaly detection.
The FP-Growth algorithm aids in efficiently mining conjunctive co-activation patterns from large datasets, which is critical for identifying rare or novel injection attacks that may not be detected by traditional methods.
Excluding the BOS token as part of the trust boundary mechanism prevents false alarms and ensures that the detection system focuses on suspicious inputs rather than legitimate prompts starting with common tokens.
The p95 latency of 8.6 milliseconds on consumer GPUs means this method is not only accurate but also highly performant, making it suitable for real-time monitoring in large-scale production environments.

Stack Impact

For homelab stacks running Proxmox VE 7.x or Docker CE 20.10 with Kubernetes (v1.23+), implementing Gemma Scope SAEs could improve security by detecting and mitigating prompt injection attacks more effectively than traditional methods.

Action Items

{'item': 'Pin your Gemma Scope SAE library to version >=1.0.5 for the latest improvements in co-activation pattern detection and performance optimization.'}
{'item': 'Update Docker CE configuration files at /etc/docker/daemon.json with appropriate resource limits to ensure sufficient GPU allocation for running Gemma Scope SAE models efficiently.'}
{'item': 'Consider upgrading Proxmox VE to the latest stable version (7.x) for improved compatibility and security features that can enhance the effectiveness of your prompt injection detection setup.'}

Source →