All ideas/devtools/Una plataforma SaaS que implemente un sistema de análisis de seguridad basado en LLMs separados para actor y juez, con inyección dinámica de experiencias de seguridad obtenidas mediante pruebas automatizadas en sandbox (similar a ToolShield).

GitHubB2Bdevtools

Una plataforma SaaS que implemente un sistema de análisis de seguridad basado en LLMs separados para actor y juez, con inyección dinámica de experiencias de seguridad obtenidas mediante pruebas automatizadas en sandbox (similar a ToolShield).

Scouted 8 hours ago

6.5/ 10

Overall score

Turn this signal into an edge

We help you build it, validate it, and get there first.

Go from idea to plan: who buys, what MVP to launch, how to validate it, and what to measure before spending months.

Extra context

Learn more about this idea

Get a clearer explanation of what the opportunity means, the current problem behind it, how this idea solves it, and the key concepts involved.

Score breakdown

Urgency9.0

Market size7.0

Feasibility6.0

Competition4.0

Pain point

Los modelos LLM que generan acciones riesgosas no pueden autoevaluar correctamente el riesgo de seguridad, lo que resulta en altas tasas de éxito de ataques en multi-turn sequences.

Who'd pay for this

Empresas que desarrollan o integran agentes conversacionales o sistemas automatizados con acceso a herramientas ejecutables, especialmente en sectores de seguridad informática, desarrollo de software y automatización.

Source signal

"Attack Success Rate ranges from 75% to 88% across three frontier models. That's the number we'd like to bring down."

Original post

[Feature]: LLM-as-guardrail SecurityAnalyzer with self-exploration safety-experience injection (ToolShield)

Published: 8 hours ago

Repository: OpenHands/OpenHands Author: xli04 ### Is there an existing feature request for this? - [X] I have searched existing issues and feature requests, and this is not a duplicate. ### Problem or Use Case ## What problem or use case are you trying to solve? The current default, `LLMSecurityAnalyzer`, asks the same LLM that generates a tool call to self-annotate `security_risk` in the same response. A model intended to execute a risky tool call is unlikely to simultaneously flag that call as risky, the same context that causes the bad action also biases the self-assessment. This also has some known reliability issues on weaker models that silently omit or hallucinate the `security_risk` field ([ALL-3921](https://linear.app/all-hands-ai/issue/ALL-3921/bug-cli-v1-crashes-with-llm-provided-a-security-risk-but-no-security)), and the Invariant rule-based analyzer can't capture dynamic multi-turn attack structure (and has had stability problems, [#5264](<https://github.com/OpenHands/OpenHands/issues/5264>)). The design issue that introduced the current analyzer ([ALL-3014](https://linear.app/all-hands-ai/issue/ALL-3014/agent-improve-security-analyzer-with-llm-provided-risk)) itself noted that rule-based approaches aren't general enough. We measured this on a 100-task subset of MT-AgentRisk (arXiv:2602.13379, Feb 2026), a benchmark that transforms single-turn harmful tasks from established works such as OpenAgentSafety and Safearena into multi-turn sequences. The distributed harmful goals bypass the current single turn safety mechanism and make executor LLMs can't reliably identify the risk. **Attack Success Rate ranges from 75% to 88% across three frontier models.** That's the number we'd like to bring down. ## Background: MT-AgentRisk and ToolShield **MT-AgentRisk** (arXiv:2602.13379) is a multi-turn agent safety benchmark. It takes single-turn harmful tasks from established benchmarks such as OpenAgentSafety and Safearena (365 tasks across Filesystem-MCP, Browser/Playwright-MCP, PostgreSQL-MCP, Notion-MCP, and Terminal), then transforms each one into a sequence of steps that together achieve the harmful goal. A single-turn `rm -rf /` is rejected; broken into three innocuous-seeming commands across turns, it executes. **ToolShield** (same paper) is a training-free, tool-agnostic defense. When the agent first encounters a tool, it autonomously generates test cases, executes them in the sandbox, and observes what actually happens downstream. From those observations it distills structured safety experiences. At deployment, the relevant experiences are injected into the guardrail LLM's context to guide the judgement. ### Proposed Solution # Proposed Solution We'd like to contribute an **opt-in** alternative that separates the actor and the judge: a distinct guardrail LLM reviews each proposed action rather than relying on the actor to self-report. Second, and more importantly, **ToolShield** distills safety experiences by red-teaming tools in the sandbox once per tool, the resulting guidelines are then injected into the guardrail's context at decision time. It's built on the existing `SecurityAnalyzer` interface, so no changes to core, tools, or the `ConfirmRisky` enforcement path. Two PRs: one SDK change and one small frontend change. **PR 1 —** `ToolShieldLLMSecurityAnalyzer` (SDK) * New `SecurityAnalyzerBase` subclass; for each `ActionEvent`, issues a separate completion to a guardrail LLM and returns `LOW / MEDIUM / HIGH / UNKNOWN`. * Configurable fields: * `llm` — any SDK `LLM` * `history_window` — prior actions of context (default 20) * `safety_experiences` — optional string injected into the system prompt * Failure handling (opinionated, logged distinctly): * LLM infrastructure error → `UNKNOWN` (lets `ConfirmRisky` apply its configured fallback) * Unparseable guardrail output → `HIGH` (conservative) * Pairs with `ConfirmRisky` unchanged; no changes to `Agent` or tool executors. * Per-tool safety experiences come from the existing `toolshield` PyPI package via an optional extra (`pip install openhands-sdk[toolshield]`) — no new SDK submodule. * Three helpers populate `safety_experiences`: * `default_safety_experiences()` — terminal + filesystem seed * `load_safety_experiences(tool_names)` — explicit list * `auto_detect_safety_experiences()` — probes localhost MCP ports **PR 2 — Frontend dropdown entry** (OpenHands app repo) * Adds `"ToolShield Analyzer"` to the analyzer dropdown in `verification-settings.tsx` (\~10 lines + i18n keys). No new sub-form — the guardrail reads its LLM config from the existing SDK `LLM` profile system, and experiences come from the `toolshield` optional extra. * Default remains `LLMSecurityAnalyzer`. ## Evaluation 100-task harmful subset of MT-AgentRisk; 50 benign tasks from Terminal-Bench (35 easy + 15 medium). Three models tested. For each row we use the same model as both the executor and the guardrail. **Attack Success Rate on harmful tasks (lower is better)**  | Model | `LLMSecurityAnalyzer` (current default) | \+ LLM Guardrail | \+ LLM Guardrail + ToolShield | | -- | -- | -- | -- | | Claude Sonnet 4.5 | 88% | 16% | 7% | | Qwen3.5-Plus | 75% | 14% | 10% | | Gemini 3 Flash | 81% | 18% | 10% | The current default security analyzer performs the worst because distributed multi-turn harm bypasses the single-turn safety reasoning that the actor LLM relies on. Adding a separate guardrail improves performance by introducing a second judgment. ToolShield builds on this further. The distilled safety experiences provide guardrail with more actionable, tool-specific grounded knowledge to guide its behavior. **End-to-end cost on 50 benign Terminal-Bench tasks (USD)**: terminal + filesystem experiences injected:  | Model | Baseline | \+ LLM Guardrail | \+ LLM Guardrail + ToolShield | Completion Rate Across 3 Settings | | -- | -- | -- | -- | -- | | Claude Sonnet 4.5 | $30.06 | $31.24 | $32.29 | 80% | | Qwen3.5-Plus | $7.99 | $8.19 | $8.49 | 78% | | Gemini 3 Flash | $4.30 | $4.54 | $4.60 | 72% | And more importantly, **LLM Guardrail + ToolShield introduces 0 false positives on benign tasks** and no benign action was blocked. LLM Guardrail plus ToolShield only adds minimal overhead compare with Baseline. Once generated, experiences are transferable across models: the same artifacts can serve different guardrail LLMs without regeneration. For providers that support prompt caching, the guardrail's static context (system prompt + experiences) can also be cached server-side, further reducing per-call cost. ### Alternatives Considered * **Keep the status quo (**`LLMSecurityAnalyzer`): cheapest, but as shown above, same-LLM self-annotation is structurally weak against multi-turn attacks (75–88% ASR). The [ALL-3921](https://linear.app/all-hands-ai/issue/ALL-3921/bug-cli-v1-crashes-with-llm-provided-a-security-risk-but-no-security) reliability bug on weaker models further limits this option. * **Extend the existing** `LLMSecurityAnalyzer` with a second self-call: we considered asking the actor LLM to re-evaluate its own action as a second pass. It's cheaper than a separate model, but the underlying bias (same context → same blind spot) persists — a distinct judge model is what actually breaks the pattern. * **Rule-based** `InvariantAnalyzer`: can't capture dynamic multi-turn attack structure, and has had stability problems ([#5264](<https://github.com/OpenHands/OpenHands/issues/5264>)). The original design discussion ([ALL-3014](https://linear.app/all-hands-ai/issue/ALL-3014/agent-improve-security-analyzer-with-llm-provided-risk)) already acknowledges that rule-based approaches aren't general enough for agent safety. * **Fine-tuning a dedicated guardrail model**: would give the best per-call accuracy, but requires training pipeline + dataset curation, per-model retraining, and ongoing maintenance. We want a training-free path that works with any off-the-shelf LLM so operators can swap in whatever they already trust. ### Priority / Severity Medium - Would improve experience ### Estimated Scope Medium - New feature with moderate complexity ### Feature Area Agent / AI behavior ### Technical Implementation Ideas (Optional) ## Do you have thoughts on the technical implementation? * Lives in `openhands-sdk` at `openhands/sdk/security/` alongside `LLMSecurityAnalyzer`. No changes to `Agent`, tool executors, or the `ConfirmRisky` enforcement path. * Reuses the existing `LLM` abstraction for the guardrail, so any provider LiteLLM supports works. * ToolShield experiences are JSON-serializable and cached per-tool-name in the workspace/settings area. * Graceful degradation: LLM infrastructure error → `UNKNOWN` (`ConfirmRisky` then prompts the user, matching what the current default does for missing annotations); unparseable guardrail output → `HIGH` (conservative, logged). * By default, the guardrail is seeded with the filesystem and terminal experiences we generated. Users can also generate their own experiences with different models or tools, and can manually adjust the injected experiences as needed. The system also supports automatic injection for all active MCP tools by scanning localhost to identify them. * Happy to gate this behind a feature flag for the first release. ### Additional Context ## Future work and extensibility For this initial PR we include experiences for the tools evaluated in the paper. The API is designed to let users contribute their own experiences; we'll document the format and seed a few examples. Ongoing maintenance of the experience library is a direction we're interested in, but we'd like to discuss governance with maintainers before committing. ## Additional context * Paper: *Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents*, arXiv:2602.13379 * Related: [ALL-3014](https://linear.app/all-hands-ai/issue/ALL-3014/agent-improve-security-analyzer-with-llm-provided-risk) (LLM risk analyzer design), [ALL-3190](https://linear.app/all-hands-ai/issue/ALL-3190/frontend-implement-llm-risk-analyzer-ui) (selector UI), [ALL-3921](https://linear.app/all-hands-ai/issue/ALL-3921/bug-cli-v1-crashes-with-llm-provided-a-security-risk-but-no-security) (self-annotation reliability bug), [#5264](<https://github.com/OpenHands/OpenHands/issues/5264>) (Invariant stability), [ALL-2325](https://linear.app/all-hands-ai/issue/ALL-2325/security-enhancement-secure-handling-of-environment-variables-and) (unified security policy)

View on github ↗