1 of 22

Towards Autonomous Detection Engineering�Embedding-Based Retrieval & MCP Orchestration

Fatih Bulut, Anjali Mangal

Microsoft

12/10/2025

2 of 22

Autonomous Detection Engineering

12/10/2025

TTPs

3 of 22

Today’s Detection Engineering

  • Dispersed rules across multiple platforms lead to duplication and inconsistencies.
  • Workflows rely heavily on manual, ticket-based processes that are vulnerable to fragile handoffs.
  • There is a lack of comprehensive visibility into coverage, quality, and technical debt.

12/10/2025

4 of 22

Goal

  • Objective: Enhance existing detection tools with AI to speed up creation, minimize duplication, and reveal coverage gaps.
  • Approach: Aggregate data, enrich with AI, index, use MCP tools, apply guarded generation, and involve humans in the loop.
  • Outcome: Achieve safe, gradual automation that supports analysts by reducing repetitive tasks, not replacing them.

12/10/2025

5 of 22

Architecture

6 of 22

Preparation of Detection Metadata

  • Combine multi-source rules into a single schema (id, tactics, language, entities, sources, logic).
  • Extract information such as ATT&CK mapping, entity types, logic, dependencies using AI.
  • Utilize prompted analyzers with few-shot examples and schema-driven JSON outputs.
  • Maintain quality via self-consistency checks and schema validation.

12/10/2025

7 of 22

Cataloging for Information Retrieval

  • Vector database for efficient semantic access and retrieval
  • Relational database to manage metadata mapping and details
  • Embedding models for representation, and inference.

12/10/2025

8 of 22

Democratize Use Across the Ecosystem

12/10/2025

?

9 of 22

Model Context Protocol (MCP)

12/10/2025

  • Standardized interface — Defines how LLMs interact with external tools, data sources, and APIs in a consistent, secure way.
  • Context orchestration — Enables models to dynamically request, retrieve, and use relevant context (files, queries, connectors) during reasoning.
  • Extensible ecosystem — Supports plug-and-play “tools” (e.g., code, search, calendar) so models can act as autonomous, context-aware agents.

https://modelcontextprotocol.io

10 of 22

Core Building Blocks of MCP Server

  • Server: Hosts tools, prompts, and resources.
  • Client: The model or app that connects to the server.
  • Session: Manages communication and context sharing.
  • Tools / Prompts / Resources: The capabilities the model can call to reason, retrieve data, or act.

12/10/2025

11 of 22

What Tools Do We Need?

12/10/2025

Tool

Description

semantic_search_detections

Perform semantic search across all detection repositories

get_detection_details

Get comprehensive details about a specific detection by ID

get_detection_content

Read the actual detection logic/code from the database using detection ID

search_by_mitre

Search for detections by MITRE ATT&CK technique or subtechnique ID (e.g., T1003, T1059.001)

get_similar_detections

Find detections similar to a given detection using vector similarity

search_by_platform

Search detections by platform with optional language filtering

get_telemetry_fields

Fetch available fields from the telemetry schema for a given platform and table/event type

get_available_actions_and_tables

List all available detection actions and their corresponding tables for a platform

get_best_table_for_action

Recommend the most relevant telemetry table/event for detecting a specific action

list_available_prompt_platforms

List all available detection development platforms with their configurations

12 of 22

Visual Studio Code for Orchestration

  • Implementation of agentic orchestration is already available
  • Offers built-in support for MCP server integration
  • Comes with extra functionalities like web search
  • The tool is widely used and recognized among developers

12/10/2025

13 of 22

Experiments

  • A total of 22 detections spanning two separate platforms
  • The primary languages used were PySpark and KQL
  • All MITRE ATT&CK tactics were addressed by the detections, with the exception of Reconnaissance
  • Three different methods were investigated:
  • Baseline
  • Sequential
  • Agentic
  • OpenAI models are compared (GPT-4o, GPT-4.1, o1, o3, GPT-5, GPT-5.1)

12/10/2025

14 of 22

Evaluation

  • LLM as a Judge

{

        "ttp_match": true,

        "logic_equivalence": true,

        "schema_accuracy": true,

        "syntax_validity": true,

        "indicator_alignment": true,

        "exclusion_parity": true,

        "robustness": true,

        "data_source_correct": true,

        "output_alignment": true,

        "library_usage": true

}

  • Embedding cosine similarity between gold and generated detection

12/10/2025

15 of 22

Setting the Right Expectation

Because we depend solely on the model's capability to produce code using a restricted set of tools without executing the code, we do not anticipate perfect scoring.

12/10/2025

16 of 22

Results

12/10/2025

Approach

LLM-Judge

Embedding Similarity

Agentic

0.47

0.80

Sequential

0.39

0.73

Baseline

0.38

0.69

Which method proves to be more effective?

*Mean value calculated from high, medium, and low reasoning efforts across various models.

17 of 22

Results

12/10/2025

Approach

LLM-Judge

Embedding Similarity

GPT-5.1

0.57

0.86

GPT-5

0.54

0.82

GPT-5.1-chat

0.44

0.74

GPT-4.1

0.43

0.83

o3

0.40

0.76

GPT-4o

0.31

0.75

Which model performs better?

*Mean calculated across high, medium, and low levels of reasoning effort for the agentic approach.

18 of 22

Results

12/10/2025

Approach

Reasoning

LLM-Judge

Embedding Similarity

GPT-5.1

High

0.55

0.86

Medium

0.60

0.86

Low

0.55

0.86

GPT-5

High

0.55

0.83

Medium

0.54

0.85

Low

0.52

0.77

o3

High

0.45

0.80

Medium

0.41

0.76

Low

0.35

0.73

Which type of reasoning effort is more effective?

* Only agentic approach is used.

19 of 22

Token Consumption

12/10/2025

40X

2X

X

Baseline

Sequential

Agentic

Cost Increases

Task completion time increases

20 of 22

Key Insights

  • An agentic workflow is a more effective option for detection generation
  • Involves a multi-turn process
  • Selects tools as necessary
  • Allows for course correction and error fixing
  • High-level reasoning is advantageous if cost is not a major issue
  • Note: it requires more time (and expense)
  • Models tend to be unreliable for scoring when acting as judges, especially across a broad range of values
  • It’s better to pose specific questions that yield True/False answers

12/10/2025

21 of 22

Future Work

  • Conduct experiments involving more detections, encompassing the majority of MITRE ATT&CK techniques (beyond tactics).
  • Perform evaluations with human validation to ensure that automated scoring aligns with human judgment.
  • The research is centered solely on OpenAI models, plan to investigate other model providers.

12/10/2025

22 of 22

Thanks!

12/10/2025