Newslurp

<< Stories

LLMs Replacing SREs ❓, Cutting AWS Compute Costs πŸ’°, Slack’s Anomaly Event Response πŸš‘

TLDR DevOps <dan@tldrnewsletter.com>

September 5, 11:09 am

TLDR DevOps
AI models are not yet capable of fully replacing Site Reliability Engineers for autonomous root cause analysis, as even advanced systems β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ  β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ 

TLDR

Together With IBM

TLDR DevOps 2025-09-05

IBM TechXchange 2025 – DevOps IRL (Sponsor)

This is your space to build, break, and level up with real tools.
πŸ’» Get hands-on with IaC, CI/CD, GitOps, and open source automation.
πŸ› οΈ Learn from real builders in live coding labs.
πŸŽ“ Earn certs. Meet peers. Leave with deployable skills you can use on Monday.
🌐 Choose from Full, Single-Day, or even Free Pass options to customize your experience.

Explore the Dev experience β†’
See all pass types β†’
Register now β†’
πŸ“±

News & Trends

Kubernetes v1.34: Service Account Token Integration for Image Pulls Graduates to Beta (3 minute read)

Service Account Token Integration for Kubelet Credential Providers has moved to beta in Kubernetes v1.34, bringing the community closer to eliminating long-lived image pull secrets and bolstering container image security. A required cacheType field is introduced in the beta, with administrators now able to revoke access to previously pulled images by deleting and recreating the ServiceAccount.
What's new in the Infinity data source for Grafana: support for JQ parser, additional HTTP methods, and more (7 minute read)

Updates to Grafana's Infinity data source now include support for JQ parsing and customization for OAuth2.0 client credentials, expanding its flexibility when interacting with APIs. A Server-Side Request Forgery (SSRF) vulnerability (CVE-2025-8341) discovered in older versions has been fixed in version 3.4.1, and users are encouraged to upgrade.
πŸš€

Opinions & Tutorials

Can LLMs replace on call SREs today? (21 minute read)

AI models are not yet capable of fully replacing Site Reliability Engineers for autonomous root cause analysis, as even advanced systems like Claude Sonnet 4, OpenAI o3, and GPT-5 frequently require guidance and sometimes fail to identify issues. Their best use today is assisting engineers by summarizing logs, drafting RCA reports, and suggesting investigation steps within a fast observability stack.
How I Cut AWS Compute Costs by 70% with a Multi-Arch EKS Cluster and Karpenter (5 minute read)

This developer reduced AWS compute costs by 70% by replacing the Kubernetes Cluster Autoscaler with Karpenter, leveraging spot and Graviton instances in a multi-architecture EKS cluster. The setup improved pod scheduling latency from minutes to under 20 seconds, increased CPU utilization efficiency, and enabled seamless AMD64 and ARM64 workload support with a modular Terraform and CI/CD pipeline.
Look Out For Bugs (3 minute read)

A key shift in programming practice comes from moving beyond rapid iteration and bug-fixing toward proactively preventing bugs by writing cleaner code and carefully reading existing implementations. Slow, deliberate readingβ€”focusing on control flow, state, and error-prone patternsβ€”can reveal subtle issues and strengthen mental models, making bug discovery and prevention feel like a superpower.
πŸ§‘β€πŸ’»

Resources & Tools

Brittle IT and security is disrupting mission-critical work for most organizations (Sponsor)

64% of organizations experienced disruptions to their most essential workflows in 2024, according to new research from Mattermost and Ponemon Institute. Cyberattacks caused half these failures, yet only 47% of IT teams feel confident about their risk profiles. Read the full report β†’
RunsOn (GitHub Repo)

RunsOn is a simplified solution for self-hosted GitHub Actions runners on AWS that offers 10x cost reduction, 30% faster speeds, and unlimited caching. It's positioned as a superior alternative to Actions Runner Controller, enabling fully self-hosted runners within a user's AWS account and available in 10 AWS regions.
Faster Rust builds on Mac (5 minute read)

On macOS, Rust build scripts and test binaries often run much slower because each executable is scanned by the XProtect antivirus service, which serially checks for malware. By designating Terminal as a β€œdeveloper tool” in System Settings, developers can bypass these checksβ€”trading some security for speedβ€”and see build and test times drop dramatically, with benefits extending to other compiled languages as well.
🎁

Miscellaneous

When Fast Flow Delivers A Real Blow: A PIR (7 minute read)

A minor framework patch on July 18 triggered uncontrolled RabbitMQ queue creation, exhausting memory and causing a full platform outage at Uptime Labs. The team resolved the issue by rolling back the patch, rebuilding the broker, and implementing stronger monitoring, testing, and alerting practices while reaffirming their commitment to fast, frequent delivery with resilience built in.
Troubleshooting network connectivity and performance with Cloudflare AI (7 minute read)

Cloudflare announced two new AI-powered tools in Cloudflare One to simplify troubleshooting of device and network performance issues: the WARP diagnostic analyzer and an MCP server for Digital Experience Monitoring. The WARP analyzer uses AI to interpret diagnostic logs for faster root cause analysis, while the DEX MCP server allows admins to query device performance data in natural language and receive actionable insights without building custom analytics pipelines.
⚑

Quick Links

How Space Force and USAF run mission-critical ops (Sponsor)

See Mattermost Enterprise Advanced handles classified data spillage, post-quantum defense, and DoD Zero Trust. Built for environments where failure isn't an option. Watch the demo
Ansible Register: How to Store and Reuse Task Output (15 minute read)

Ansible streamlines infrastructure and application management by automating tasks, using playbooks to define systems consistently at scale.
Building Slack's Anomaly Event Response (7 minute read)

Slack's Anomaly Event Response (AER) is a proactive defense mechanism that autonomously identifies and terminates suspicious user sessions in minutes.
Addressing the unauthorized issuance of multiple TLS certificates for 1.1.1.1 (14 minute read)

Unauthorized certificates for Cloudflare's 1.1.1.1 public DNS resolver were issued by Fina CA between February 2024 and August 2025, but have since been revoked.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!
Track your referrals here.

Want to advertise in TLDR? πŸ“°

If your company is interested in reaching an audience of devops professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? πŸ’Ό

Apply here or send a friend's resume to jobs@tldr.tech and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Kunal Desai & Martin Hauskrecht


Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR DevOps isn't for you, please unsubscribe.