We Cut LLM Latency by 70% in Production

MLOps.community1h 5mApril 10, 2026

Get the full intelligence

Search transcripts, export clips, track mentions, and explore all topics from “We Cut LLM Latency by 70% in Production” inside PodZeus.

Search in PodZeus Start Free Trial

AI-Generated Summary

In this episode of MLOps.community, a senior engineering leader shares their journey of building and optimizing AI in production at an HR tech company, achieving up to 70% latency reduction through strategic infrastructure and model optimization. Starting as a non-AI expert, they embraced a 'learning alongside the team' mindset, leading to the development of a 'flywheel framework' for AI adoption that balances business impact with technical feasibility. Key technical wins include using TensorRT LLM for model optimization, implementing scheduled and dynamic GPU scaling based on usage patterns, and baking models into container images to eliminate cold start delays. The speaker emphasizes the importance of focusing on a 'sweet spot' in the AI pyramid—balancing cost, performance, latency, and throughput—rather than optimizing all metrics simultaneously. They also discuss the evolution of AI adoption within their engineering team, from skepticism to near-universal use, and the cultural shift toward responsible AI with guardrails, human-in-the-loop workflows, and a self-hosted AI platform that supports both product and internal agentic engineering. The episode concludes with reflections on the future of self-hosted coding agents and the critical role of transparency, data governance, and customer trust in scaling AI responsibly.

Key Takeaways

1

Use TensorRT LLM to optimize models for specific GPU architectures, achieving up to 70% latency reduction.

2

Implement scheduled and dynamic GPU scaling based on usage patterns to reduce costs by up to 50%.

3

Bake models into container images to eliminate model download delays and drastically reduce cold start times.

4

Focus on a 'sweet spot' in the AI pyramid—optimize for your use case (e.g., cost and accuracy over raw speed) rather than all metrics at once.

5

Build AI as a platform, not a product: abstract guardrails, design patterns, and infrastructure to enable reuse across features.

…and 3 more takeaways available in PodZeus

Chapters

0:00

10 min

From Non-AI Leader to AI Champion

The guest shares their personal journey from being unfamiliar with AI to leading AI initiatives in enterprise, driven by market shifts and a need to learn alongside their team. They emphasize humility, curiosity, and the importance of adapting to the AI revolution as a senior leader.

10:00

10 min

The AI Iceberg: Beyond the Hype

“Behind it, if you want to build it internally and you know if you manage it at scale, all of these invisible things that you know performance, latency, throughput, accuracy, quality of the responses and cost.”

Highlight

20:00

10 min

Cutting Latency by 70% with TensorRT LLM

“We saved at least 50% latency. That was like a, again, that's why I told you about the throughput, latency, cost and performance and accuracy, like we have been saving cost and saving on latency and saving on cold starts.”

Highlight

30:00

10 min

Scaling Smart: From Scheduled to Dynamic

“We saved minutes of the cold start to just spin up new... But the GPU itself takes a while right yeah the gpu means that time we cannot do much about it but the downloading part was the was the triggering yeah”

Highlight

40:00

10 min

The Flywheel Framework: From Planning to Optimization

“Start there and keep it small in terms of scope and execution so you don't want to spin up these huge projects and lose control in the middle.”

Highlight

High-Impact Quotes

“The missing piece in AI sometimes is that and I talk about this a lot where it's all about the data at the end of the day. The data which we forget to mention, talk about the most is the data that was used to train the foundation model.”

— Guest•63:21

Viral: 90.0

“Use AI to save time, but spend 30-40% of that time reviewing output.”

— Guest•52:30

Viral: 85.0

“Behind it, if you want to build it internally and you know if you manage it at scale, all of these invisible things that you know performance, latency, throughput, accuracy, quality of the responses and cost.”

— Guest•3:20

Viral: 85.0

Speakers

Host

MLOps.community

Guest

Engineering Leader at HR Tech Company

Topics Discussed

AI Latency Optimization95%GPU Scaling and Cost Management90%Responsible AI and Trust85%Self-Hosted AI Infrastructure85%AI Engineering Culture80%AI Platform Strategy80%Agentic Engineering75%AI ROI and Business Impact70%

People & Brands

HR Tech

other

10xPositive

AWS

organization

8xNeutral

NVIDIA

organization

6xPositive

TensorRT LLM

product

5xPositive

OpenAI

organization

5xNegative

AI Engineering Lab

other

5xPositive

GitHub Copilot

product

4xMixed

Flywheel Framework

other

4xPositive

SageMaker

product

3xPositive

MLOps

other

3xPositive

Related Episodes

Spec Driven Development, Workflows, and the Recent Coding Agent Conference

MLOps.community • 59m • 3/31/2026

Fixing GPU Starvation in Large-Scale Distributed Training

MLOps.community • 52m • 4/3/2026

Getting Humans Out of the Way: How to Work with Teams of Agents

MLOps.community • 50m • 4/7/2026

The Modern Software Engineer

MLOps.community • 53m • 4/14/2026

Why Agents are Driving Software Development to the Cloud

MLOps.community • 51m • 4/17/2026

Get the full intelligence

Search transcripts, export clips, track mentions, and explore all topics from “We Cut LLM Latency by 70% in Production” inside PodZeus.

Search in PodZeus Start Free Trial

background image dithered

Start discovering podcast insights today

Start with a 7-day trial and explore a growing catalog of popular podcasts. No credit card required.

Start free trial

Try live search

No credit card required • 7-day trial • Cancel anytime