Towards Autonomous Detection Engineering�Embedding-Based Retrieval & MCP Orchestration
Fatih Bulut, Anjali Mangal
Microsoft
12/10/2025
Autonomous Detection Engineering
12/10/2025
TTPs
Today’s Detection Engineering
12/10/2025
Goal
12/10/2025
Architecture
Preparation of Detection Metadata
12/10/2025
Cataloging for Information Retrieval
12/10/2025
Democratize Use Across the Ecosystem
12/10/2025
?
Model Context Protocol (MCP)
12/10/2025
https://modelcontextprotocol.io
Core Building Blocks of MCP Server
12/10/2025
What Tools Do We Need?
12/10/2025
Tool | Description |
semantic_search_detections | Perform semantic search across all detection repositories |
get_detection_details | Get comprehensive details about a specific detection by ID |
get_detection_content | Read the actual detection logic/code from the database using detection ID |
search_by_mitre | Search for detections by MITRE ATT&CK technique or subtechnique ID (e.g., T1003, T1059.001) |
get_similar_detections | Find detections similar to a given detection using vector similarity |
search_by_platform | Search detections by platform with optional language filtering |
get_telemetry_fields | Fetch available fields from the telemetry schema for a given platform and table/event type |
get_available_actions_and_tables | List all available detection actions and their corresponding tables for a platform |
get_best_table_for_action | Recommend the most relevant telemetry table/event for detecting a specific action |
list_available_prompt_platforms | List all available detection development platforms with their configurations |
Visual Studio Code for Orchestration
12/10/2025
Experiments
12/10/2025
Evaluation
{
"ttp_match": true,
"logic_equivalence": true,
"schema_accuracy": true,
"syntax_validity": true,
"indicator_alignment": true,
"exclusion_parity": true,
"robustness": true,
"data_source_correct": true,
"output_alignment": true,
"library_usage": true
}
12/10/2025
Setting the Right Expectation
Because we depend solely on the model's capability to produce code using a restricted set of tools without executing the code, we do not anticipate perfect scoring.
12/10/2025
Results
12/10/2025
Approach | LLM-Judge | Embedding Similarity |
Agentic | 0.47 | 0.80 |
Sequential | 0.39 | 0.73 |
Baseline | 0.38 | 0.69 |
Which method proves to be more effective?
*Mean value calculated from high, medium, and low reasoning efforts across various models.
Results
12/10/2025
Approach | LLM-Judge | Embedding Similarity |
GPT-5.1 | 0.57 | 0.86 |
GPT-5 | 0.54 | 0.82 |
GPT-5.1-chat | 0.44 | 0.74 |
GPT-4.1 | 0.43 | 0.83 |
o3 | 0.40 | 0.76 |
GPT-4o | 0.31 | 0.75 |
Which model performs better?
*Mean calculated across high, medium, and low levels of reasoning effort for the agentic approach.
Results
12/10/2025
Approach | Reasoning | LLM-Judge | Embedding Similarity |
GPT-5.1 | High | 0.55 | 0.86 |
Medium | 0.60 | 0.86 | |
Low | 0.55 | 0.86 | |
GPT-5 | High | 0.55 | 0.83 |
Medium | 0.54 | 0.85 | |
Low | 0.52 | 0.77 | |
o3 | High | 0.45 | 0.80 |
Medium | 0.41 | 0.76 | |
Low | 0.35 | 0.73 |
Which type of reasoning effort is more effective?
* Only agentic approach is used.
Token Consumption
12/10/2025
40X
2X
X
Baseline
Sequential
Agentic
Cost Increases
Task completion time increases
Key Insights
12/10/2025
Future Work
12/10/2025
Thanks!
12/10/2025