Microsoft has introduced ExCyTIn-Bench, an open-source benchmarking tool developed to assess the performance of AI systems in cybersecurity investigations.
The tool simulates multistage cyberattack scenarios in a security operations centre (SOC) environment built on Microsoft Azure, using live queries across 57 log tables from Microsoft Sentinel and related services.
Access deeper industry intelligence
Experience unmatched clarity with a single platform that combines unique data, AI, and human expertise.
Its methodology reflects the data volume and operational complexity that security teams encounter during real incidents.
Unlike earlier benchmarks that rely on static knowledge or multiple-choice questioning, ExCyTIn-Bench generates question-answer sets from incident graphs constructed by human analysts.
These bipartite alert-entity graphs allow for assessments grounded in authentic SOC data, requiring AI models to plan and execute investigative steps across multiple data sources.
The benchmark produces granular, stepwise feedback on each investigative action, moving beyond binary pass-fail grading.
US Tariffs are shifting - will you react or anticipate?
Don’t let policy changes catch you off guard. Stay proactive with real-time data and expert analysis.
By GlobalDataMicrosoft applies ExCyTIn-Bench internally to test AI-driven security features and identify detection or workflow gaps in its own models.
The company also uses it to evaluate integrations with Microsoft Security Copilot, Microsoft Sentinel, and Microsoft Defender, tracking both model performance and associated operational costs.
The framework aims to offer chief information security officers (CISO)s, IT leaders, and buyers a consistent means of comparing AI capabilities in security contexts.
By capturing how AI agents decompose investigative goals, interact with tools, and synthesise evidence, ExCyTIn-Bench addresses the limitations seen in benchmarks based on static evidence or trivia-style questioning.
Microsoft points out that even recent industry efforts such as CyberSOCEval do not fully capture the requirement for agents to interact with live, noisy data in a controlled SOC environment.
ExCyTIn-Bench is available as an open-source resource on GitHub, with Microsoft inviting participation from model developers and security teams.
The company indicated that future updates would include options for tailoring benchmarks to specific threat scenarios at the customer tenant level.
In September 2025, Microsoft integrated Anthropic’s Claude models into Copilot Studio, enhancing its existing support for OpenAI’s large language models.
The rollout has started for early release customers and will be available in preview across all environments within two weeks, with full production deployment anticipated by the end of 2025.
