TLDR DevOps 2025-09-17

The AI + DevOps Event Everyone Will Be Talking About (Sponsor)

On September 30, {unscripted} Virtual will show you how AI is reshaping software delivery. Learn how top companies are using AI to automate workflows, release faster, and innovate at scale. This is your chance to see the future of DevOps and get a first look at upcoming capabilities that will change the way teams build.

→ Join live or catch it on demand.

📱

News & Trends

Azure mandatory multifactor authentication: Phase 2 starting in October 2025 (2 minute read)

Microsoft has completed Phase 1 of mandatory multifactor authentication enforcement for Azure Portal sign-ins and will begin Phase 2 enforcement at the Azure Resource Manager layer on October 1. Users must enable MFA and update Azure CLI and PowerShell clients to remain compliant, while workload identities will remain unaffected.

Kubernetes v1.34: Moving Volume Group Snapshots to v1beta2 (2 minute read)

Kubernetes v1.34 introduced a second beta for volume group snapshots, initially an Alpha feature in v1.27 and then Beta in v1.32, enabling crash-consistent snapshots for grouped volumes using CSI volume drivers. A new VolumeSnapshotInfo struct was added in v1beta2, replacing VolumeSnapshotHandlePairList, to address an issue where the restoreSize field was not set for VolumeSnapshotContents and VolumeSnapshots when the CSI driver didn't implement the ListSnapshots RPC call. Depending on feedback and adoption, the Kubernetes project plans to push the volume group snapshot implementation to general availability (GA) in a future release.

Distributed performance testing for Kubernetes environments: Grafana k6 Operator 1.0 is here (9 minute read)

Grafana's k6 Operator, a Kubernetes operator used for running distributed k6 tests, has reached its 1.0 release. Featuring bug fixes, improved Helm chart configurations, and better versioning, the update includes a commitment to releasing a new minor version every eight weeks and follows Semantic Versioning 2.0 for greater stability. k6 Operator simplifies distributed k6 tests across multiple machines, enabling synchronized testing within private networks, and integrates with Grafana Cloud k6.

🚀

Opinions & Tutorials

Enterprise AKS Multi-Instance GPU (MIG) vLLM Deployment Guide (10 minute read)

This guide details how to deploy vLLM on Azure Kubernetes Service using NVIDIA H100 GPUs with Multi-Instance GPU technology, enabling multiple AI models to run simultaneously on a single GPU with hardware isolation. The solution delivers 50% cost savings, production-ready management, enterprise-grade security, and seamless integration with Azure API Management for hybrid AI infrastructure.

Scaling Terraform agents on Amazon EKS Auto Mode for efficient infrastructure management (6 minute read)

Platform teams can now use the HCP Terraform Operator with Amazon EKS Auto Mode to automatically scale Terraform agents based on workload demands. By combining these tools, organizations can dynamically adjust agent capacity, eliminating manual intervention and optimizing both performance and cost, and also ensuring that the right capacity for Terraform operations is available without manual intervention.

🧑‍💻

Resources & Tools

ngrok: one gateway for every LLM (Sponsor)

Building with multiple LLMs is messy. ngrok lets you route, secure, and manage traffic to any LLM (cloud + local). Manage one API instead of a dozen, automatically choose the best model, redact sensitive info from prompts, and add resilience if an LLM API is down. Early access is open

Cachey (GitHub Repo)

Cachey is a high-performance read-through cache for object storage, mapping requests to 16 MiB page-aligned ranges and using standard HTTP semantics. Throughput stats are provided as JSON via GET /stats, while a comprehensive set of metrics in Prometheus text format is available via GET /metrics.

Task (GitHub Repo)

Task is a task runner/build tool that aims to be simpler and easier to use than alternatives like make.

🎁

Miscellaneous

P50 vs P95 vs P99 Latency: What These Percentiles Actually Mean (And How to Use Them) (4 minute read)

Use latency percentiles like P50, P95, and P99, rather than averages, to understand user experience and set SLOs. Histograms should be implemented to collect these percentiles correctly, as they reveal distribution clarity and expose systemic friction that can be addressed through architectural changes. Architectural changes often require pre-warming, partitioning, caching layers, concurrency isolation, and adaptive retries.

Monitor your LiteLLM AI proxy with Datadog (7 minute read)

Datadog has released an Agent integration and SDK with LiteLLM that allows teams to monitor, troubleshoot, and optimize LLM-powered applications. The LLM Observability SDK traces every request end-to-end, giving insights into model and provider performance, while the Datadog Agent integration monitors the LiteLLM proxy service, tracking metrics like request volumes and error rates. The integration and SDK together provide full-stack observability across LLM workflows.

⚡

Quick Links

Automate your CVE doomscrolling with cvemon (Sponsor)

When you're scrambling for information about the latest vuln, you don't want to be sifting through engagement bait on X/Twitter. See what matters instantly with cvemon by Intruder. cvemon monitors social media 24/7 to surface trending threats, along with expert security commentary - so you can ignore the background chatter. Check out cvemon (it's 100% free)

Make Sense of Your Output Window with Copilot (3 minute read)

Copilot in Visual Studio 17.14 can now read the Output Window, allowing developers to ask questions about build and debug logs and receive explanations or suggested next steps.

KubeCon + CloudNativeCon North America 2025 Co-Located Event Deep Dive: OpenTofu Day (3 minute read)

OpenTofu Day at KubeCon + CloudNativeCon North America 2025 will focus on the project's future, community experiences in production, and migration planning from Terraform in 2026.

Kubernetes v1.34: Decoupled Taint Manager Is Now Stable (1 minute read)

The Kubernetes node lifecycle management has been improved by separating node lifecycle and pod eviction into two distinct components.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!

https://refer.tldr.tech/fc7198a6/10

Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of devops professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here or send a friend's resume to jobs@tldr.tech and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Kunal Desai & Martin Hauskrecht

Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR DevOps isn't for you, please unsubscribe.

Newslurp

Kubernetes Performance Testing 🐎, Azure MFA 🔒, Scaling Terraform Agents 🪐

TLDR DevOps <dan@tldrnewsletter.com>

September 17, 11:30 am

TLDR DevOps 2025-09-17

News & Trends

Opinions & Tutorials

Resources & Tools

Miscellaneous

Quick Links