We Cut LLM Latency by 70% in Production
Get the full intelligence
Search transcripts, export clips, track mentions, and explore all topics from “We Cut LLM Latency by 70% in Production” inside PodZeus.
In this episode of MLOps.community, a senior engineering leader shares their journey of building and optimizing AI in production at an HR tech company, achieving up to 70% latency reduction through strategic infrastructure and model optimization. Starting as a non-AI expert, they embraced a 'learning alongside the team' mindset, leading to the development of a 'flywheel framework' for AI adoption that balances business impact with technical feasibility. Key technical wins include using TensorRT LLM for model optimization, implementing scheduled and dynamic GPU scaling based on usage patterns, and baking models into container images to eliminate cold start delays. The speaker emphasizes the importance of focusing on a 'sweet spot' in the AI pyramid—balancing cost, performance, latency, and throughput—rather than optimizing all metrics simultaneously. They also discuss the evolution of AI adoption within their engineering team, from skepticism to near-universal use, and the cultural shift toward responsible AI with guardrails, human-in-the-loop workflows, and a self-hosted AI platform that supports both product and internal agentic engineering. The episode concludes with reflections on the future of self-hosted coding agents and the critical role of transparency, data governance, and customer trust in scaling AI responsibly.
Use TensorRT LLM to optimize models for specific GPU architectures, achieving up to 70% latency reduction.
Implement scheduled and dynamic GPU scaling based on usage patterns to reduce costs by up to 50%.
Bake models into container images to eliminate model download delays and drastically reduce cold start times.
Focus on a 'sweet spot' in the AI pyramid—optimize for your use case (e.g., cost and accuracy over raw speed) rather than all metrics at once.
Build AI as a platform, not a product: abstract guardrails, design patterns, and infrastructure to enable reuse across features.
…and 3 more takeaways available in PodZeus
From Non-AI Leader to AI Champion
The guest shares their personal journey from being unfamiliar with AI to leading AI initiatives in enterprise, driven by market shifts and a need to learn alongside their team. They emphasize humility, curiosity, and the importance of adapting to the AI revolution as a senior leader.
The AI Iceberg: Beyond the Hype
“Behind it, if you want to build it internally and you know if you manage it at scale, all of these invisible things that you know performance, latency, throughput, accuracy, quality of the responses and cost.”
Cutting Latency by 70% with TensorRT LLM
“We saved at least 50% latency. That was like a, again, that's why I told you about the throughput, latency, cost and performance and accuracy, like we have been saving cost and saving on latency and saving on cold starts.”
Scaling Smart: From Scheduled to Dynamic
“We saved minutes of the cold start to just spin up new... But the GPU itself takes a while right yeah the gpu means that time we cannot do much about it but the downloading part was the was the triggering yeah”
The Flywheel Framework: From Planning to Optimization
“Start there and keep it small in terms of scope and execution so you don't want to spin up these huge projects and lose control in the middle.”
“The missing piece in AI sometimes is that and I talk about this a lot where it's all about the data at the end of the day. The data which we forget to mention, talk about the most is the data that was used to train the foundation model.”
“Use AI to save time, but spend 30-40% of that time reviewing output.”
“Behind it, if you want to build it internally and you know if you manage it at scale, all of these invisible things that you know performance, latency, throughput, accuracy, quality of the responses and cost.”
Host
Guest
HR Tech
other
AWS
organization
NVIDIA
organization
TensorRT LLM
product
OpenAI
organization
AI Engineering Lab
other
GitHub Copilot
product
Flywheel Framework
other
SageMaker
product
MLOps
other
Spec Driven Development, Workflows, and the Recent Coding Agent Conference
MLOps.community • 59m • 3/31/2026
Fixing GPU Starvation in Large-Scale Distributed Training
MLOps.community • 52m • 4/3/2026
Getting Humans Out of the Way: How to Work with Teams of Agents
MLOps.community • 50m • 4/7/2026
The Modern Software Engineer
MLOps.community • 53m • 4/14/2026
Why Agents are Driving Software Development to the Cloud
MLOps.community • 51m • 4/17/2026
Get the full intelligence
Search transcripts, export clips, track mentions, and explore all topics from “We Cut LLM Latency by 70% in Production” inside PodZeus.
Start discovering podcast insights today
Start with a 7-day trial and explore a growing catalog of popular podcasts. No credit card required.
No credit card required • 7-day trial • Cancel anytime
