Can LLMs replace on call SREs today? (21 minute read)
AI models are not yet capable of fully replacing Site Reliability Engineers for autonomous root cause analysis, as even advanced systems like Claude Sonnet 4, OpenAI o3, and GPT-5 frequently require guidance and sometimes fail to identify issues. Their best use today is assisting engineers by summarizing logs, drafting RCA reports, and suggesting investigation steps within a fast observability stack.
|
How I Cut AWS Compute Costs by 70% with a Multi-Arch EKS Cluster and Karpenter (5 minute read)
This developer reduced AWS compute costs by 70% by replacing the Kubernetes Cluster Autoscaler with Karpenter, leveraging spot and Graviton instances in a multi-architecture EKS cluster. The setup improved pod scheduling latency from minutes to under 20 seconds, increased CPU utilization efficiency, and enabled seamless AMD64 and ARM64 workload support with a modular Terraform and CI/CD pipeline.
|
Look Out For Bugs (3 minute read)
A key shift in programming practice comes from moving beyond rapid iteration and bug-fixing toward proactively preventing bugs by writing cleaner code and carefully reading existing implementations. Slow, deliberate readingβfocusing on control flow, state, and error-prone patternsβcan reveal subtle issues and strengthen mental models, making bug discovery and prevention feel like a superpower.
|
|
RunsOn (GitHub Repo)
RunsOn is a simplified solution for self-hosted GitHub Actions runners on AWS that offers 10x cost reduction, 30% faster speeds, and unlimited caching. It's positioned as a superior alternative to Actions Runner Controller, enabling fully self-hosted runners within a user's AWS account and available in 10 AWS regions.
|
Faster Rust builds on Mac (5 minute read)
On macOS, Rust build scripts and test binaries often run much slower because each executable is scanned by the XProtect antivirus service, which serially checks for malware. By designating Terminal as a βdeveloper toolβ in System Settings, developers can bypass these checksβtrading some security for speedβand see build and test times drop dramatically, with benefits extending to other compiled languages as well.
|
|
When Fast Flow Delivers A Real Blow: A PIR (7 minute read)
A minor framework patch on July 18 triggered uncontrolled RabbitMQ queue creation, exhausting memory and causing a full platform outage at Uptime Labs. The team resolved the issue by rolling back the patch, rebuilding the broker, and implementing stronger monitoring, testing, and alerting practices while reaffirming their commitment to fast, frequent delivery with resilience built in.
|
Troubleshooting network connectivity and performance with Cloudflare AI (7 minute read)
Cloudflare announced two new AI-powered tools in Cloudflare One to simplify troubleshooting of device and network performance issues: the WARP diagnostic analyzer and an MCP server for Digital Experience Monitoring. The WARP analyzer uses AI to interpret diagnostic logs for faster root cause analysis, while the DEX MCP server allows admins to query device performance data in natural language and receive actionable insights without building custom analytics pipelines.
|
|
Love TLDR? Tell your friends and get rewards!
|
Share your referral link below with friends to get free TLDR swag!
|
|
Track your referrals here.
|
Want to advertise in TLDR? π°
If your company is interested in reaching an audience of devops professionals and decision makers, you may want to advertise with us.
Want to work at TLDR? πΌ
Apply here or send a friend's resume to jobs@tldr.tech and get $1k if we hire them!
If you have any comments or feedback, just respond to this email!
Thanks for reading,
Kunal Desai & Martin Hauskrecht
|
|
|
|