It's 2026, and We're Still Talking Evals
Get the full intelligence
Search transcripts, export clips, track mentions, and explore all topics from “It's 2026, and We're Still Talking Evals” inside PodZeus.
In this episode of MLOps.community, the host and guest dive deep into the evolving challenges of evaluating AI agents in 2026, emphasizing that evaluation (evals) is not a one-time testing phase but a continuous, foundational practice embedded from the very beginning of product development. The discussion highlights how traditional software testing falls short for LLMs due to their non-deterministic behavior, context drift, and emergent failure modes—such as agents losing track after repeated success or failing in unexpected ways when users ask open-ended questions like 'surprise me.' The speakers stress that effective evals require simulating diverse user personas, designing intentional failure scenarios, and conducting thorough error analysis to uncover hidden pain points. They critique over-reliance on generic metrics like accuracy and advocate for business-aligned, context-specific evaluation frameworks tied to real-world outcomes like conversion and user satisfaction. The episode also explores the limitations of current eval tooling, with the host expressing frustration over tools that are slow, inflexible, and poorly designed for multi-turn analysis, ultimately favoring custom-built solutions that enable faster iteration and deeper insight. A recurring theme is that evals should be a team-wide, iterative discipline—not a chore—because they directly impact product quality and user trust.
Start evaluations from the very first idea of a product, not after shipping.
Use simulated user personas and real-world failure modes to stress-test agents before and after deployment.
Avoid over-reliance on accuracy; instead, focus on business-aligned metrics like conversion, satisfaction, and error patterns.
Error analysis is critical but often skipped—yet it’s where the most valuable insights emerge.
Custom evaluation pipelines beat off-the-shelf tools when it comes to flexibility, speed, and alignment with team goals.
…and 2 more takeaways available in PodZeus
Evals Are Not a Post-Shipment Afterthought
“Evals should be constant within the development team. Evals start the moment the idea of the product starts.”
The Problem with Accuracy and Generic Metrics
“My agent is accurate 95% of time. What does it mean? Do you know? Yeah, I have no idea.”
Simulating Real Users: Personas and Failure Modes
“You design a product. So as a person that designed a product and built it, you are actually having in mind what kind of scenarios this person would take.”
The Tooling Gap: Why Off-the-Shelf Evaluators Fall Short
The host critiques current eval tooling for being slow, inflexible, and poorly designed for multi-turn conversations. They express frustration with tools that require manual data exports, lack sampling, and don’t support custom evaluator creation.
Building Custom Evaluators: The Path to Better Insights
The episode concludes with a strong advocacy for custom-built evaluation pipelines. The guest reveals they’ve built an internal tool to handle data sampling, labeling, and evaluation training—proving that off-the-shelf solutions often fail to meet real team needs.
“Evals should be constant within the development team. Evals start the moment the idea of the product starts.”
“My agent is accurate 95% of time. What does it mean? Do you know? Yeah, I have no idea.”
“Why would I leave the labeling part first? That's the greatest opportunity for you to learn what's happening.”
Host
Guest
Host
person
Maggie
person
A-B testing
other
Tiago
person
Arise
organization
Phoenix
organization
Meta
organization
DSPY
organization
Excel coding
other
Datadog
organization
Spec Driven Development, Workflows, and the Recent Coding Agent Conference
MLOps.community • 59m • 3/31/2026
Fixing GPU Starvation in Large-Scale Distributed Training
MLOps.community • 52m • 4/3/2026
Getting Humans Out of the Way: How to Work with Teams of Agents
MLOps.community • 50m • 4/7/2026
We Cut LLM Latency by 70% in Production
MLOps.community • 1h 5m • 4/10/2026
The Modern Software Engineer
MLOps.community • 53m • 4/14/2026
Get the full intelligence
Search transcripts, export clips, track mentions, and explore all topics from “It's 2026, and We're Still Talking Evals” inside PodZeus.
Start discovering podcast insights today
Start with a 7-day trial and explore a growing catalog of popular podcasts. No credit card required.
No credit card required • 7-day trial • Cancel anytime
