LogSentinel v2: Training Multi-Agent SOC Reasoning with Verifiable Rewards

Modern incident response is not a single-turn classification task. In real SOC workflows, multiple specialists must coordinate under…

Suryasirisolla

~2 min read · April 25, 2026 (Updated: April 25, 2026) · Free: Yes

Modern incident response is not a single-turn classification task. In real SOC workflows, multiple specialists must coordinate under uncertainty, analyze partial evidence, make mitigation decisions, and verify outcomes over long trajectories. LogSentinel v2 is built to train and evaluate exactly this behavior.

LogSentinel v2 is an OpenEnv-compliant environment that simulates a SOC war-room. It introduces four collaborating roles: Incident Commander, App SRE, DB SRE, and Security Analyst. Each role receives role-filtered observations (partial observability), forcing explicit collaboration through handoffs and shared context rather than relying on a single omniscient agent.

The environment is structured as a five-phase lifecycle: detect, triage, mitigate, verify, and final report. Importantly, phase transitions are action-driven, not fixed by time. Agents must produce evidence-backed incident proposals, assign severity, execute targeted mitigations, and validate recovery before final reporting.

At the world-model level, LogSentinel v2 tracks latent operational state, including service health, replication lag, error rate, attack signal, queue depth, and containment status. Mitigations affect these variables with delayed dynamics, creating long-horizon credit-assignment challenges similar to real operations.

To support RLVR/RLVE-style training, rewards are verifiable and composable:

Outcome quality (restoration + containment)
Detection F1 (precision/recall against ground truth incidents)
Severity accuracy (exact and near-miss scoring)
Efficiency (MTTR and wasted-step penalties)
Teamwork quality (useful handoffs, reduced redundancy)

We also add anti-reward-hacking controls: duplicate incident spam penalties, blind mitigation penalties, unsupported report-claim penalties, and no-op repetition penalties.

For training, we use GRPO with TRL + Unsloth. We compare a heuristic baseline against a trained policy on easy/medium/hard scenarios with reproducible seeds. In our runs, trained policies show clear gains across total reward, success rate, detection F1, severity accuracy, and teamwork score.

LogSentinel v2 is designed as a practical benchmark for multi-agent, long-horizon, partially observable LLM behavior with tool-like actions and verifiable outcomes. We hope it helps push post-training beyond static prompt-response tasks toward realistic operational reasoning.

Links

Hugging Face Space: https://huggingface.co/spaces/Surya-sj/logsentinel?logs=container
Colab Notebook: https://drive.google.com/file/d/1N-We4n7g9vtH1A1Emtjndqpc8X0CH8zl/view?usp=sharing
Repository: https://huggingface.co/spaces/Surya-sj/logsentinel

#cybersecurity #artificial-intelligence #large-language-models #devops

< Go to the original

LogSentinel v2: Training Multi-Agent SOC Reasoning with Verifiable Rewards

Modern incident response is not a single-turn classification task. In real SOC workflows, multiple specialists must coordinate under…

Reporting a Problem