1 of 22

Towards Autonomous Detection Engineering�Embedding-Based Retrieval & MCP Orchestration

Fatih Bulut, Anjali Mangal

Microsoft

12/10/2025

2 of 22

Autonomous Detection Engineering

12/10/2025

TTPs

3 of 22

Today’s Detection Engineering

Dispersed rules across multiple platforms lead to duplication and inconsistencies.
Workflows rely heavily on manual, ticket-based processes that are vulnerable to fragile handoffs.
There is a lack of comprehensive visibility into coverage, quality, and technical debt.

12/10/2025

4 of 22

Goal

Objective: Enhance existing detection tools with AI to speed up creation, minimize duplication, and reveal coverage gaps.
Approach: Aggregate data, enrich with AI, index, use MCP tools, apply guarded generation, and involve humans in the loop.
Outcome: Achieve safe, gradual automation that supports analysts by reducing repetitive tasks, not replacing them.

12/10/2025

5 of 22

Architecture

6 of 22

Preparation of Detection Metadata

Combine multi-source rules into a single schema (id, tactics, language, entities, sources, logic).
Extract information such as ATT&CK mapping, entity types, logic, dependencies using AI.
Utilize prompted analyzers with few-shot examples and schema-driven JSON outputs.
Maintain quality via self-consistency checks and schema validation.

12/10/2025

7 of 22

Cataloging for Information Retrieval

Vector database for efficient semantic access and retrieval
Relational database to manage metadata mapping and details
Embedding models for representation, and inference.

12/10/2025

8 of 22

Democratize Use Across the Ecosystem

12/10/2025

?

9 of 22

Model Context Protocol (MCP)

12/10/2025

Standardized interface — Defines how LLMs interact with external tools, data sources, and APIs in a consistent, secure way.
Context orchestration — Enables models to dynamically request, retrieve, and use relevant context (files, queries, connectors) during reasoning.
Extensible ecosystem — Supports plug-and-play “tools” (e.g., code, search, calendar) so models can act as autonomous, context-aware agents.

https://modelcontextprotocol.io

10 of 22

Core Building Blocks of MCP Server

Server: Hosts tools, prompts, and resources.
Client: The model or app that connects to the server.
Session: Manages communication and context sharing.
Tools / Prompts / Resources: The capabilities the model can call to reason, retrieve data, or act.

12/10/2025

11 of 22

What Tools Do We Need?

12/10/2025

Tool	Description
semantic_search_detections	Perform semantic search across all detection repositories
get_detection_details	Get comprehensive details about a specific detection by ID
get_detection_content	Read the actual detection logic/code from the database using detection ID
search_by_mitre	Search for detections by MITRE ATT&CK technique or subtechnique ID (e.g., T1003, T1059.001)
get_similar_detections	Find detections similar to a given detection using vector similarity
search_by_platform	Search detections by platform with optional language filtering
get_telemetry_fields	Fetch available fields from the telemetry schema for a given platform and table/event type
get_available_actions_and_tables	List all available detection actions and their corresponding tables for a platform
get_best_table_for_action	Recommend the most relevant telemetry table/event for detecting a specific action
list_available_prompt_platforms	List all available detection development platforms with their configurations

12 of 22

Visual Studio Code for Orchestration

Implementation of agentic orchestration is already available
Offers built-in support for MCP server integration
Comes with extra functionalities like web search
The tool is widely used and recognized among developers

12/10/2025

13 of 22

Experiments

A total of 22 detections spanning two separate platforms
The primary languages used were PySpark and KQL
All MITRE ATT&CK tactics were addressed by the detections, with the exception of Reconnaissance
Three different methods were investigated:
Baseline
Sequential
Agentic
OpenAI models are compared (GPT-4o, GPT-4.1, o1, o3, GPT-5, GPT-5.1)

12/10/2025

14 of 22

Evaluation

LLM as a Judge

{

"ttp_match": true,

"logic_equivalence": true,

"schema_accuracy": true,

"syntax_validity": true,

"indicator_alignment": true,

"exclusion_parity": true,

"robustness": true,

"data_source_correct": true,

"output_alignment": true,

"library_usage": true

}

Embedding cosine similarity between gold and generated detection

12/10/2025

15 of 22

Setting the Right Expectation

Because we depend solely on the model's capability to produce code using a restricted set of tools without executing the code, we do not anticipate perfect scoring.

12/10/2025

16 of 22

Results

12/10/2025

Approach	LLM-Judge	Embedding Similarity
Agentic	0.47	0.80
Sequential	0.39	0.73
Baseline	0.38	0.69

Which method proves to be more effective?

*Mean value calculated from high, medium, and low reasoning efforts across various models.

17 of 22

Results

12/10/2025

Approach	LLM-Judge	Embedding Similarity
GPT-5.1	0.57	0.86
GPT-5	0.54	0.82
GPT-5.1-chat	0.44	0.74
GPT-4.1	0.43	0.83
o3	0.40	0.76
GPT-4o	0.31	0.75

Which model performs better?

*Mean calculated across high, medium, and low levels of reasoning effort for the agentic approach.

18 of 22

Results

12/10/2025

Approach	Reasoning	LLM-Judge	Embedding Similarity
GPT-5.1	High	0.55	0.86
	Medium	0.60	0.86
	Low	0.55	0.86
GPT-5	High	0.55	0.83
	Medium	0.54	0.85
	Low	0.52	0.77
o3	High	0.45	0.80
	Medium	0.41	0.76
	Low	0.35	0.73

Which type of reasoning effort is more effective?

* Only agentic approach is used.

19 of 22

Token Consumption

12/10/2025

40X

2X

X

Baseline

Sequential

Agentic

Cost Increases

Task completion time increases

20 of 22

Key Insights

An agentic workflow is a more effective option for detection generation
Involves a multi-turn process
Selects tools as necessary
Allows for course correction and error fixing
High-level reasoning is advantageous if cost is not a major issue
Note: it requires more time (and expense)
Models tend to be unreliable for scoring when acting as judges, especially across a broad range of values
It’s better to pose specific questions that yield True/False answers

12/10/2025

21 of 22

Future Work

Conduct experiments involving more detections, encompassing the majority of MITRE ATT&CK techniques (beyond tactics).
Perform evaluations with human validation to ensure that automated scoring aligns with human judgment.
The research is centered solely on OpenAI models, plan to investigate other model providers.

12/10/2025

22 of 22

Thanks!

12/10/2025