Prompt Evaluation Pipelines: How to Make Your Prompts Production-Grade�Oleksandr Antonov - Sr. software engineer at SeniorDev�
Software Development Life Cycle (SDLC) in the Age of AI - who's in control now?
Anton Minakov - CTO at X-Receipts
How does NUU build AI that helps recruiters stay sharp, consistent, and in control
Thomas Vervik - CEO at Senior Dev�Iryna Mohylenko - Product Designer at Senior Dev
Prompt Engineering Pipeline, Security in the Age of AI
Oleksandr Antonov
Oleksandr bridges business and technology with main focus on building smart, cost-efficient and high-performance solutions that use AI where it truly creates value.
Expertise Highlight
Meet Oleksandr Antonov, Senior Software Engineer at SeniorDev.
He has over 14 years of experience in software development, working across startups and international consulting companies in Fintech, Healthcare, and E-commerce.
Oleksandr`s expertise lies in software architecture, system design, and building scalable products from scratch while improving existing systems with business impact in mind.
Professional Background
Prompt Evaluation
Pipeline
How to Make Your Prompts Production-Grade?
Building confidence through systematic testing
What problem are we trying to solve?
Create feature that provide for user ability to explain job position requirements and generate job description text
Task
Structured job description based on the input parameter from user
Result
Write
a prompt
Try it
a few times
Looks
good to me
Ship it
Cross
fingers
✏️
🔁
👀
🚀
🤞
Sounds easy! 😎
So how do we do make it happen?
Edge cases
01
Prompt Drift
02
Silent Regressions
03
Fear of Iteration
04
Without a safety net, teams stop improving prompts. The cost of being wrong outweighs the gain of being better.
Small tweaks compound over time. What worked in v1 silently degrades by v12. Nobody notices until a customer does.
3 - 5 input sets tested manually do not guarantee consistent output. We have to ensure that with any possible input parameters we deliver the best results.
A model update, a
context change, a new edge case - any of these can break result with zero warning.
And that's where the problems begin 😢
So what is Prompt Evaluation Pipeline (Evals)?
A Prompt Evaluation Pipeline is a structured way to test, measure, and improve LLM prompts to ensure reliable outputs at scale.
Evals are to prompts
what unit tests are to code
Prompt
change
Run
Eval Suite
Score &
Analyze
Gate
Decision
Change code - run tests, change prompt - run evals
Anatomy of eval
Dataset
Grader
Report
Runner
The function, model or human that scores each output against expectations
The orchestration that runs all cases, aggregates, and reports results
A curated list of input set with expected outputs or scoring criteria.
Metrics (number 1–10) with reasoning, that collapses evaluation into a trackable signal
The core components you need to start using it
🗂️
🔬
📊
⚙️
What is a good eval dataset
Representative coverage
Edge cases explicitly
Clear ground
truth
More examples - better results
Very important step
Long inputs, empty inputs, adversarial phrasing, language mix.
20 can catch obvious regressions. 100 gives statistical reliability.
Include the full distribution of real inputs - not just the happy path.
Each example has an expected output or scoring rubric before you run anything.
It is often underestimated, but in fact has a greater impact on the outcome than anything else
A dataset is a collection of test cases used to verify prompt behavior and output quality.
⚠️
What is a grader and what types are there
Code
Model
Human
Programmatically evaluate outputs. Used for structured outputs with a single correct answer.
Model score the output. Captures quality, tone, and reasoning that rules can't.
Human score the output. Expensive to scale, but essential for calibrating other evaluators.
A grader evaluates model outputs against defined quality criteria or expected behavior.
Fastest & cheapest
Context aware & flexible
Ground truth
Model grader tips & tricks
Inside the Runner
Load
Dataset
Run
Grader
Store
Result
Execute
Prompt
Generate
Report
How an eval goes from dataset to report in one automated pass
Loop through the all dataset
Finished
Evals report example
Key takeaways
Anton Minakov
Expertise Highlight
Meet Anton Minakov, CTO at X-Receipts.
Experienced software engineer and engineering manager
(20+ years of experience ) in Telecom and Fintech domains.
Professional Background
INTERNAL TECH TALK / RECEIPTS AS
Software Development
in the AI Era
Everyone's using AI. Almost nobody has updated their security model.
In 2018, attackers had 771 days to use a new bug.
In 2025, they had minus 7.
In 2026...
Too many bugs. Way too many.
Public security bugs (CVEs) by year
48,185
new CVEs in 2025 (+20.6% from 2024)
131
new CVEs every single day
~60,000
expected for 2026
15-20%
only this much gets full NIST review now. The rest, you sort yourself.
Sources: NVD, MITRE CVE.org, Jerry Gamblin CVE Review 2025
And bugs get used faster than ever.
Median time from bug found to bug used in attacks
0
2018
771 days
Two years to patch. Plenty of time.
2021
84 days
Three months. Still manageable.
2023
6 days
Less than one sprint.
2024
4 hours
Forget patching. You won't make it.
2025
-7 days
Attack happens BEFORE the patch exists.
2026
...
Every report so far this year says faster.
The patch-and-respond model the industry used for 30 years is broken.
Why so fast? AI works for both sides.
Attackers got AI tools first. Defenders are catching up.
Attackers with AI
- Working exploit in 10-15 minutes, costs about $1
- One AI agent found 100+ Linux bugs in 30 days for $600
- 32% of new exploits appear before the bug is even public
- 28% of bugs get weaponized within 24 hours
And the noise problem
- Linux team: 5,530 CVEs in 2025 alone (8-9 per day)
- Their policy: "any bug could be a security bug"
- AI scanners flood bug trackers with duplicates
- Maintainers burn out. Real bugs hide in the noise.
RECEIPTS AS / OUR VIEW
Every CVE in our payment libraries needs triage. We can't keep up. Nobody can.
Agile changed too. Quietly.
Agile has been around for 20+ years. The recent shift isn't agile vs. waterfall. It's agile with humans vs. agile with AI.
PRE-AI AGILE / ~2010-2022
Humans in the loop
- Two-week sprints. Stand-ups. Retros.
- Humans wrote the code. Humans reviewed it.
- PR volume matched review capacity.
- Devs picked dependencies. Slowly.
- Bottleneck: writing the code.
AI AGILE / 2023-NOW
Agents in the loop
- Same sprints. Same stand-ups. Same retros.
- Agents draft the code. Humans still review it.
- PR volume doubles. Review capacity doesn't.
- Agents also auto-install packages without asking
- The slow part moved. It didn't disappear.
Coding feels 55% faster (GitHub). Teams ship 19% slower (DORA 2025). The bottleneck moved to review and rework.
"Good enough" vs "perfect".
Market traded "perfect" for "good enough" years ago. AI made "good enough" much, much cheaper. And, it turns out, worse.
AI-WRITTEN CODE vs. HUMAN-WRITTEN CODE
+55%
vulnerability density vs. prior model
+278%
path traversal risks
+336%
certain critical bug classes
2.74x
more likely to introduce XSS
1.91x
more insecure object references
Sources: SonarSource analysis of Claude Opus 4.6 (Feb 2026), CodeRabbit AI code safety study (Dec 2025)
"Good enough" used to mean "ships with known limits."
Now it often means "ships with bugs nobody read."
RECEIPTS AS / OUR VIEW
We can't ship buggy. A double charge can't be rolled back with a hotfix tomorrow. The money already moved.
We can ship that fast. We already do.
Modern deploy pipelines are not the bottleneck. They never were.
11.6s
Amazon. One deploy every 11.6 seconds. In 2014. Pre-AI.
1000s
Netflix, Google. Thousands of deploys a day across services.
50+
Etsy and CapitalOne. 50+ deploys per product per day.
< 1 hr
DORA elite teams. Code-to-prod lead time.
And the surprise:
Teams that deploy more also break less.
- Small batches = small blast radius. One bad change is easy to find and revert.
- Forced automation. Manual gates can't survive 1,000 deploys/day.
- Feature flags as security tools. Kill a vulnerable code path in seconds.
So our pipeline can move in seconds. Why does our patch SLA still say 30 days?
The math doesn't work anymore.
Our pipelines deploy in seconds. Our security SLAs are still measured in days.
WHAT THE STANDARD ALLOWS
30 days
PCI DSS 4.0 patch window for critical bugs
90 days
between mandatory vulnerability scans
70 days
median real-world exposure window
WHAT ATTACKERS DO
5 days
median time to weaponize a CVE
4 hours
for the fast ones (2024 data)
-7 days
mean: exploit before patch exists
RECEIPTS AS / OUR VIEW
We can be fully PCI DSS compliant on Monday and breached on Tuesday. Compliance is the floor, not the ceiling.
Your code isn't really yours.
97% of commercial apps include open-source code. The npm story below has played out on PyPI, Maven, and NuGet too.
REAL EXAMPLE / MARCH 31, 2026
axios got hijacked on npm
~100M
weekly downloads
~3 hrs
malicious version was live
RAT
auto-deployed via postinstall
What happened
- Attackers stole a maintainer's npm token
- Published axios 1.14.1 and 0.30.4 with a hidden malicious dependency
- npm install ran a postinstall script, no user click needed
- The script dropped a remote-access trojan (Win, macOS, Linux)
Why it was so dangerous
- axios is in almost every React / Node app
- AI coding agents auto-install and run new packages
- An agent could fetch the poisoned version, run it, infect the dev machine
- All before any human looked at the diff
RECEIPTS AS / OUR VIEW
We are someone's dependency. Their compromise becomes ours. Ours becomes every merchant's.
So what do we actually do?
Concrete things a developer can do this quarter. No theory.
STOP
Habits that quietly hurt you
- Floating version ranges in dep manifests
- Auto-running install/build scripts from untrusted packages
- Merging AI-generated PRs without reading the diff
- Running coding agents with permission-bypass flags
- Treating PCI/SOC2 sign-off as proof of safety
- One engineer holds the publish token (npm, PyPI, Maven...)
START
Things to ship this quarter
- Pin and lockfile all deps (package-lock, poetry.lock, go.sum...)
- Dep scanner on every PR (Dependabot, Snyk, Socket, Trivy)
- Generate an SBOM (Syft, CycloneDX) and store it
- Branch protection: require review, even for AI PRs
- Secrets scanner in CI (gitleaks, trufflehog)
- Canary tokens in configs. Cheap. They actually work.
- Runtime detection (Falco, Tetragon) for what slips past
The takeaway
Every line of code you ship today
lives in a world where it can be found,
weaponized and exploited
before your next stand up.
Build for that world.
Monday: pick one thing from the last slide. Ship that. Then pick the next.
This applies to mobile apps. It applies more to payments middleware like ours.
It applies to everyone in this room.
Thomas Vervik er daglig leder i SeniorDev, og erfaren utvikler. ��Han har sammen med kunder bygget et utall applikasjoner og plattformer, og jobber aktivt disse dager med å teste ut KI verktøy for å finne ut hvordan de beste skal brukes for å bygge suksessfulle IT verktøy fremover.
Thomas Vervik
AI that supports - you decide
What's coming in Nuu
2. High-Risk (Regulated)
These AI systems are permitted but face stringent legal obligations before they can be used or sold. They are classified based on their use-case (e.g., in critical infrastructure, education, employment, or law enforcement) or as safety components in regulated products. [1, 2, 3, 4, 5]
7 and 8. May in Oslo
Suggestive
AI flags what's worth your attention
Not decisive
Every hire, every call, every judgment is yours
AI is here to inform your judgment and not replace it
From a prompt to a publish-ready job description
01
01 • Job description generation • The problem
A blank page is the wrong place to start.
Every time a blank page
All of it for every new role
01 • Job description generation • The solution
Describe the role. Nuu writes the first draft.
Structured inputs = better output
Meet your AI scoring assistant
02
02 • Candidate scoring • The problem
100+
applications per role
Candidates #1-5
Candidate #87
100 applications. One human.
Set criteria at publish
Define what a great candidate looks like before applications arrive.
AI scores as they come in
Every new candidate is scored automatically, in real time.
You open a sorted pipeline
The work is half done before you even start.
02 • Candidate scoring • The solution
02 • Candidate scoring • The solution
A score with a reason — not a black box.
Responsible Candidate Scoring Using AI
AI is increasingly being used in recruiting – but how do we ensure it helps rather than hurts?
In this talk, we’ll explore how to build AI-powered candidate scoring systems (currently referred to as “the most hated” feature) that are fair, transparent and actually useful for recruiters and hiring managers.
Better prep. Better interviews. Better hires.
03
03 • Interview questions • The problem
10
interviews in a week, each with a unique CV, cover letter, portfolio, and story worth exploring.
The insight that would have changed the conversation is buried on page two of the CV.
Not because the recruiter wasn't thorough — but because there simply wasn't time.
The best interviewers ask the most specific questions. AI helps you get there for every candidate.
shrinks as pipeline grows.
Prep time
AI reads what you don't have time to. You ask what matters.
03 • Interview questions • The solution
Based on this candidate's documents
CV, portfolio, and cover letter are read and cross-referenced
Questions written for this candidate. Not any candidate.
Highlights gaps and interesting areas
Something that you might not have caught in a quick read
Yours to edit or ignore
A starting point, not a script
Responsible AI
Tools this powerful deserve careful use.
What AI does
What you control
The certification
Thank you!
Q&A