Jake Benchmark v1: I spent a week watching 7 local LLMs try to be AI agents with OpenClaw. Most couldn't even find the email tool.

LOW

The severity is rated as LOW due to the absence of a specific vulnerability. However, the performance disparity among models raises concerns about potential misconfigurations or bugs in lesser-performing versions that could lead to security issues if exploited.

The security advisory discusses the performance of various local language models (LLMs) in real-world agent tasks using OpenClaw on a Raspberry Pi 5 equipped with an RTX 3090 GPU and Ollama. The models were evaluated based on their ability to perform email management, meeting scheduling, error detection, and browser automation among other tasks. The qwen3.5:27b-q4_K_M model outperformed all others by a significant margin with a score of 59.4%, while the larger qwen3.5:35b version achieved only 23.2%. Other models performed poorly, with scores below 5%. The primary factor that influenced performance was the ability to locate and use command-line tools effectively, highlighting potential vulnerabilities in the underlying software configurations that could be exploited if not properly secured.

Affected Systems

Ollama
OpenClaw

Affected Versions: All tested versions up to qwen3.5:27b-q4_K_M and qwen3.5:35b

Remediation

Review the configuration files for each model, specifically focusing on how they interact with command-line tools.
Update models to the most recent version available if patches or improvements have been released since this test.
Implement security best practices such as sandboxing and least privilege access when running LLMs in a production environment.

Stack Impact

The findings indicate that homelab setups using Raspberry Pi 5 with RTX 3090 GPUs may experience significant performance variances across different models. This could impact the reliability of automated tasks such as email management and error detection, potentially exposing systems to risks if not properly secured.

Source →