1 of 49

Prompt Evaluation Pipelines: How to Make Your Prompts Production-GradeOleksandr Antonov - Sr. software engineer at SeniorDev�

‍Software Development Life Cycle (SDLC) in the Age of AI - who's in control now?

Anton Minakov - CTO at X-Receipts

‍How does NUU build AI that helps recruiters stay sharp, consistent, and in control

Thomas Vervik - CEO at Senior Dev�Iryna Mohylenko - Product Designer at Senior Dev

Prompt Engineering Pipeline, Security in the Age of AI

2 of 49

Oleksandr Antonov

Oleksandr bridges business and technology with main focus on building smart, cost-efficient and high-performance solutions that use AI where it truly creates value.

Expertise Highlight

Meet Oleksandr Antonov, Senior Software Engineer at SeniorDev.

He has over 14 years of experience in software development, working across startups and international consulting companies in Fintech, Healthcare, and E-commerce.

Oleksandr`s expertise lies in software architecture, system design, and building scalable products from scratch while improving existing systems with business impact in mind.

Professional Background

3 of 49

Prompt Evaluation

Pipeline

How to Make Your Prompts Production-Grade?

Building confidence through systematic testing

4 of 49

What problem are we trying to solve?

5 of 49

Create feature that provide for user ability to explain job position requirements and generate job description text

Task

6 of 49

Structured job description based on the input parameter from user

Result

7 of 49

Write

a prompt

Try it

a few times

Looks

good to me

Ship it

Cross

fingers

✏️

🔁

👀

🚀

🤞

Sounds easy! 😎

So how do we do make it happen?

8 of 49

Edge cases

01

Prompt Drift

02

Silent Regressions

03

Fear of Iteration

04

Without a safety net, teams stop improving prompts. The cost of being wrong outweighs the gain of being better.

Small tweaks compound over time. What worked in v1 silently degrades by v12. Nobody notices until a customer does.

3 - 5 input sets tested manually do not guarantee consistent output. We have to ensure that with any possible input parameters we deliver the best results.

A model update, a

context change, a new edge case - any of these can break result with zero warning.

And that's where the problems begin 😢

9 of 49

So what is Prompt Evaluation Pipeline (Evals)?

A Prompt Evaluation Pipeline is a structured way to test, measure, and improve LLM prompts to ensure reliable outputs at scale.

  • Test prompts on real examples
  • Measure output quality
  • Detect regressions early
  • Compare results across prompt versions
  • Improve through feedback loops

10 of 49

Evals are to prompts

what unit tests are to code

Prompt

change

Run

Eval Suite

Score &

Analyze

Gate

Decision

Change code - run tests, change prompt - run evals

11 of 49

Anatomy of eval

Dataset

Grader

Report

Runner

The function, model or human that scores each output against expectations

The orchestration that runs all cases, aggregates, and reports results

A curated list of input set with expected outputs or scoring criteria.

Metrics (number 1–10) with reasoning, that collapses evaluation into a trackable signal

The core components you need to start using it

🗂️

🔬

📊

⚙️

12 of 49

What is a good eval dataset

Representative coverage

Edge cases explicitly

Clear ground

truth

More examples - better results

Very important step

Long inputs, empty inputs, adversarial phrasing, language mix.

20 can catch obvious regressions. 100 gives statistical reliability.

Include the full distribution of real inputs - not just the happy path.

Each example has an expected output or scoring rubric before you run anything.

It is often underestimated, but in fact has a greater impact on the outcome than anything else

A dataset is a collection of test cases used to verify prompt behavior and output quality.

⚠️

13 of 49

What is a grader and what types are there

Code

Model

Human

Programmatically evaluate outputs. Used for structured outputs with a single correct answer.

Model score the output. Captures quality, tone, and reasoning that rules can't.

Human score the output. Expensive to scale, but essential for calibrating other evaluators.

A grader evaluates model outputs against defined quality criteria or expected behavior.

Fastest & cheapest

Context aware & flexible

Ground truth

  • Compliance with general rules
  • Completeness
  • Safety
  • Check length
  • Ensure word/mask existence in output
  • Syntax/structure/type validation
  • General response quality
  • Relevance
  • Depth
  • Final approval and grader adjustments

14 of 49

Model grader tips & tricks

  • Describe critical points and format
  • Explicitly describe the criteria for removing/adding points during scoring
  • Ask for reasoning
  • Set 0 temperature
  • Describe nuances, tone of voice, mood, etc

15 of 49

Inside the Runner

Load

Dataset

Run

Grader

Store

Result

Execute

Prompt

Generate

Report

How an eval goes from dataset to report in one automated pass

Loop through the all dataset

Finished

16 of 49

Evals report example

17 of 49

Key takeaways

  1. Vibes don't scale. Evals do
  2. A prompt without an eval is a feature without a test
  3. Bad dataset + great evaluator = false confidence
  4. The eval you skip is the regression you ship
  5. Evals aren't a tool you adopt. They're a habit you build

18 of 49

Anton Minakov

Expertise Highlight

Meet Anton Minakov, CTO at X-Receipts.

Experienced software engineer and engineering manager

(20+ years of experience ) in Telecom and Fintech domains.

Professional Background

19 of 49

INTERNAL TECH TALK / RECEIPTS AS

Software Development

in the AI Era

Everyone's using AI. Almost nobody has updated their security model.

In 2018, attackers had 771 days to use a new bug.

In 2025, they had minus 7.

In 2026...

20 of 49

Too many bugs. Way too many.

Public security bugs (CVEs) by year

48,185

new CVEs in 2025 (+20.6% from 2024)

131

new CVEs every single day

~60,000

expected for 2026

15-20%

only this much gets full NIST review now. The rest, you sort yourself.

Sources: NVD, MITRE CVE.org, Jerry Gamblin CVE Review 2025

21 of 49

And bugs get used faster than ever.

Median time from bug found to bug used in attacks

0

2018

771 days

Two years to patch. Plenty of time.

2021

84 days

Three months. Still manageable.

2023

6 days

Less than one sprint.

2024

4 hours

Forget patching. You won't make it.

2025

-7 days

Attack happens BEFORE the patch exists.

2026

...

Every report so far this year says faster.

The patch-and-respond model the industry used for 30 years is broken.

22 of 49

Why so fast? AI works for both sides.

Attackers got AI tools first. Defenders are catching up.

Attackers with AI

- Working exploit in 10-15 minutes, costs about $1

- One AI agent found 100+ Linux bugs in 30 days for $600

- 32% of new exploits appear before the bug is even public

- 28% of bugs get weaponized within 24 hours

And the noise problem

- Linux team: 5,530 CVEs in 2025 alone (8-9 per day)

- Their policy: "any bug could be a security bug"

- AI scanners flood bug trackers with duplicates

- Maintainers burn out. Real bugs hide in the noise.

RECEIPTS AS / OUR VIEW

Every CVE in our payment libraries needs triage. We can't keep up. Nobody can.

23 of 49

Agile changed too. Quietly.

Agile has been around for 20+ years. The recent shift isn't agile vs. waterfall. It's agile with humans vs. agile with AI.

PRE-AI AGILE / ~2010-2022

Humans in the loop

- Two-week sprints. Stand-ups. Retros.

- Humans wrote the code. Humans reviewed it.

- PR volume matched review capacity.

- Devs picked dependencies. Slowly.

- Bottleneck: writing the code.

AI AGILE / 2023-NOW

Agents in the loop

- Same sprints. Same stand-ups. Same retros.

- Agents draft the code. Humans still review it.

- PR volume doubles. Review capacity doesn't.

- Agents also auto-install packages without asking

- The slow part moved. It didn't disappear.

Coding feels 55% faster (GitHub). Teams ship 19% slower (DORA 2025). The bottleneck moved to review and rework.

24 of 49

"Good enough" vs "perfect".

Market traded "perfect" for "good enough" years ago. AI made "good enough" much, much cheaper. And, it turns out, worse.

AI-WRITTEN CODE vs. HUMAN-WRITTEN CODE

+55%

vulnerability density vs. prior model

+278%

path traversal risks

+336%

certain critical bug classes

2.74x

more likely to introduce XSS

1.91x

more insecure object references

Sources: SonarSource analysis of Claude Opus 4.6 (Feb 2026), CodeRabbit AI code safety study (Dec 2025)

"Good enough" used to mean "ships with known limits."

Now it often means "ships with bugs nobody read."

RECEIPTS AS / OUR VIEW

We can't ship buggy. A double charge can't be rolled back with a hotfix tomorrow. The money already moved.

25 of 49

We can ship that fast. We already do.

Modern deploy pipelines are not the bottleneck. They never were.

11.6s

Amazon. One deploy every 11.6 seconds. In 2014. Pre-AI.

1000s

Netflix, Google. Thousands of deploys a day across services.

50+

Etsy and CapitalOne. 50+ deploys per product per day.

< 1 hr

DORA elite teams. Code-to-prod lead time.

And the surprise:

Teams that deploy more also break less.

- Small batches = small blast radius. One bad change is easy to find and revert.

- Forced automation. Manual gates can't survive 1,000 deploys/day.

- Feature flags as security tools. Kill a vulnerable code path in seconds.

So our pipeline can move in seconds. Why does our patch SLA still say 30 days?

26 of 49

The math doesn't work anymore.

Our pipelines deploy in seconds. Our security SLAs are still measured in days.

WHAT THE STANDARD ALLOWS

30 days

PCI DSS 4.0 patch window for critical bugs

90 days

between mandatory vulnerability scans

70 days

median real-world exposure window

WHAT ATTACKERS DO

5 days

median time to weaponize a CVE

4 hours

for the fast ones (2024 data)

-7 days

mean: exploit before patch exists

RECEIPTS AS / OUR VIEW

We can be fully PCI DSS compliant on Monday and breached on Tuesday. Compliance is the floor, not the ceiling.

27 of 49

Your code isn't really yours.

97% of commercial apps include open-source code. The npm story below has played out on PyPI, Maven, and NuGet too.

REAL EXAMPLE / MARCH 31, 2026

axios got hijacked on npm

~100M

weekly downloads

~3 hrs

malicious version was live

RAT

auto-deployed via postinstall

What happened

- Attackers stole a maintainer's npm token

- Published axios 1.14.1 and 0.30.4 with a hidden malicious dependency

- npm install ran a postinstall script, no user click needed

- The script dropped a remote-access trojan (Win, macOS, Linux)

Why it was so dangerous

- axios is in almost every React / Node app

- AI coding agents auto-install and run new packages

- An agent could fetch the poisoned version, run it, infect the dev machine

- All before any human looked at the diff

RECEIPTS AS / OUR VIEW

We are someone's dependency. Their compromise becomes ours. Ours becomes every merchant's.

28 of 49

So what do we actually do?

Concrete things a developer can do this quarter. No theory.

STOP

Habits that quietly hurt you

- Floating version ranges in dep manifests

- Auto-running install/build scripts from untrusted packages

- Merging AI-generated PRs without reading the diff

- Running coding agents with permission-bypass flags

- Treating PCI/SOC2 sign-off as proof of safety

- One engineer holds the publish token (npm, PyPI, Maven...)

START

Things to ship this quarter

- Pin and lockfile all deps (package-lock, poetry.lock, go.sum...)

- Dep scanner on every PR (Dependabot, Snyk, Socket, Trivy)

- Generate an SBOM (Syft, CycloneDX) and store it

- Branch protection: require review, even for AI PRs

- Secrets scanner in CI (gitleaks, trufflehog)

- Canary tokens in configs. Cheap. They actually work.

- Runtime detection (Falco, Tetragon) for what slips past

29 of 49

The takeaway

Every line of code you ship today

lives in a world where it can be found,

weaponized and exploited

before your next stand up.

Build for that world.

Monday: pick one thing from the last slide. Ship that. Then pick the next.

This applies to mobile apps. It applies more to payments middleware like ours.

It applies to everyone in this room.

30 of 49

Thomas Vervik er daglig leder i SeniorDev, og erfaren utvikler. ��Han har sammen med kunder bygget et utall applikasjoner og plattformer, og jobber aktivt disse dager med å teste ut KI verktøy for å finne ut hvordan de beste skal brukes for å bygge suksessfulle IT verktøy fremover.

Thomas Vervik

31 of 49

AI that supports - you decide

What's coming in Nuu

32 of 49

33 of 49

2. High-Risk (Regulated)

These AI systems are permitted but face stringent legal obligations before they can be used or sold. They are classified based on their use-case (e.g., in critical infrastructure, education, employment, or law enforcement) or as safety components in regulated products. [1, 2, 3, 4, 5]

  • Examples: AI for recruitment (CV-sorting), credit scoring, medical devices, and systems used to evaluate eligibility for essential public or social services.
  • Requirements: Conformity assessments, strict risk management systems, high-quality training data, detailed documentation, robust cybersecurity, human oversight, and continuous logging. You can learn more about the exact implementation rules via the Draft Commission Guidelines. [1, 2, 3]

34 of 49

7 and 8. May in Oslo

35 of 49

Suggestive

AI flags what's worth your attention

Not decisive

Every hire, every call, every judgment is yours

AI is here to inform your judgment and not replace it

36 of 49

From a prompt to a publish-ready job description

01

37 of 49

01 • Job description generation • The problem

A blank page is the wrong place to start.

Every time a blank page

All of it for every new role

38 of 49

01 • Job description generation • The solution

Describe the role. Nuu writes the first draft.

Structured inputs = better output

39 of 49

Meet your AI scoring assistant

02

40 of 49

02 • Candidate scoring • The problem

100+

applications per role

Candidates #1-5

Candidate #87

100 applications. One human.

41 of 49

Set criteria at publish

Define what a great candidate looks like before applications arrive.

AI scores as they come in

Every new candidate is scored automatically, in real time.

You open a sorted pipeline

The work is half done before you even start.

02 • Candidate scoring • The solution

42 of 49

02 • Candidate scoring • The solution

A score with a reason — not a black box.

43 of 49

‍Responsible Candidate Scoring Using AI

AI is increasingly being used in recruiting – but how do we ensure it helps rather than hurts?

In this talk, we’ll explore how to build AI-powered candidate scoring systems (currently referred to as “the most hated” feature) that are fair, transparent and actually useful for recruiters and hiring managers.

44 of 49

Better prep. Better interviews. Better hires.

03

45 of 49

03 • Interview questions • The problem

10

interviews in a week, each with a unique CV, cover letter, portfolio, and story worth exploring.

The insight that would have changed the conversation is buried on page two of the CV.

Not because the recruiter wasn't thorough — but because there simply wasn't time.

The best interviewers ask the most specific questions. AI helps you get there for every candidate.

shrinks as pipeline grows.

Prep time

AI reads what you don't have time to. You ask what matters.

46 of 49

03 • Interview questions • The solution

Based on this candidate's documents

CV, portfolio, and cover letter are read and cross-referenced

Questions written for this candidate. Not any candidate.

Highlights gaps and interesting areas

Something that you might not have caught in a quick read

Yours to edit or ignore

A starting point, not a script

47 of 49

Responsible AI

Tools this powerful deserve careful use.

  • Which criteria matter for this role

  • Whether to accept, override, or ignore the score

  • Which questions to ask — and how

  • Every hiring decision
  • Reads and cross-references candidate documents�
  • Scores candidates against your defined criteria�
  • Surfaces gaps, patterns, and questions worth asking

  • Always explains its reasoning

What AI does

What you control

48 of 49

The certification

49 of 49

Thank you!

Q&A