It's 2026, and We're Still Talking Evals

MLOps.community40mApril 21, 2026

Get the full intelligence

Search transcripts, export clips, track mentions, and explore all topics from “It's 2026, and We're Still Talking Evals” inside PodZeus.

AI-Generated Summary

In this episode of MLOps.community, the host and guest dive deep into the evolving challenges of evaluating AI agents in 2026, emphasizing that evaluation (evals) is not a one-time testing phase but a continuous, foundational practice embedded from the very beginning of product development. The discussion highlights how traditional software testing falls short for LLMs due to their non-deterministic behavior, context drift, and emergent failure modes—such as agents losing track after repeated success or failing in unexpected ways when users ask open-ended questions like 'surprise me.' The speakers stress that effective evals require simulating diverse user personas, designing intentional failure scenarios, and conducting thorough error analysis to uncover hidden pain points. They critique over-reliance on generic metrics like accuracy and advocate for business-aligned, context-specific evaluation frameworks tied to real-world outcomes like conversion and user satisfaction. The episode also explores the limitations of current eval tooling, with the host expressing frustration over tools that are slow, inflexible, and poorly designed for multi-turn analysis, ultimately favoring custom-built solutions that enable faster iteration and deeper insight. A recurring theme is that evals should be a team-wide, iterative discipline—not a chore—because they directly impact product quality and user trust.

Key Takeaways
1

Start evaluations from the very first idea of a product, not after shipping.

2

Use simulated user personas and real-world failure modes to stress-test agents before and after deployment.

3

Avoid over-reliance on accuracy; instead, focus on business-aligned metrics like conversion, satisfaction, and error patterns.

4

Error analysis is critical but often skipped—yet it’s where the most valuable insights emerge.

5

Custom evaluation pipelines beat off-the-shelf tools when it comes to flexibility, speed, and alignment with team goals.

…and 2 more takeaways available in PodZeus

Chapters
0:00
10 min

Evals Are Not a Post-Shipment Afterthought

Evals should be constant within the development team. Evals start the moment the idea of the product starts.

Highlight
10:00
10 min

The Problem with Accuracy and Generic Metrics

My agent is accurate 95% of time. What does it mean? Do you know? Yeah, I have no idea.

Highlight
20:00
10 min

Simulating Real Users: Personas and Failure Modes

You design a product. So as a person that designed a product and built it, you are actually having in mind what kind of scenarios this person would take.

Highlight
30:00
10 min

The Tooling Gap: Why Off-the-Shelf Evaluators Fall Short

The host critiques current eval tooling for being slow, inflexible, and poorly designed for multi-turn conversations. They express frustration with tools that require manual data exports, lack sampling, and don’t support custom evaluator creation.

40:00
10 min

Building Custom Evaluators: The Path to Better Insights

The episode concludes with a strong advocacy for custom-built evaluation pipelines. The guest reveals they’ve built an internal tool to handle data sampling, labeling, and evaluation training—proving that off-the-shelf solutions often fail to meet real team needs.

High-Impact Quotes
Evals should be constant within the development team. Evals start the moment the idea of the product starts.
Maggie2:04
Viral: 85.0
My agent is accurate 95% of time. What does it mean? Do you know? Yeah, I have no idea.
Maggie6:24
Viral: 80.0
Why would I leave the labeling part first? That's the greatest opportunity for you to learn what's happening.
Host38:52
Viral: 78.0
Speakers

Host

Host

Guest

Maggie
Topics Discussed
custom evaluation pipelines92%evaluation lifecycle90%failure mode analysis88%LLM non-determinism86%user persona simulation85%error analysis and iteration84%evaluator tooling limitations82%business-aligned metrics80%
People & Brands

Host

person

15xPositive

Maggie

person

12xPositive

A-B testing

other

2xPositive

Tiago

person

2xNeutral

Arise

organization

2xNeutral

Phoenix

organization

1xNeutral

Meta

organization

1xNeutral

DSPY

organization

1xPositive

Excel coding

other

1xNegative

Datadog

organization

1xNeutral

Get the full intelligence

Search transcripts, export clips, track mentions, and explore all topics from “It's 2026, and We're Still Talking Evals” inside PodZeus.

Start discovering podcast insights today

Start with a 7-day trial and explore a growing catalog of popular podcasts. No credit card required.

No credit card required • 7-day trial • Cancel anytime