Newslurp

<< Stories

AWS DynamoDB Outage ☁️, Grafana Mimir 🆕, AI Platform At Pinterest 🧷

TLDR DevOps <dan@tldrnewsletter.com>

November 5, 12:10 pm

TLDR DevOps
Grafana Mimir 3.0 has been released, featuring a decoupled architecture using Apache Kafka to improve reliability, performance, and cost efficiency ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌  ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ 

TLDR

Together With Octopus Deploy

TLDR DevOps 2025-11-05

🧱 Build-it-yourself platforms work great, until they don't. (Sponsor)

Every platform story starts the same way: a few scripts, shared templates, maybe a shiny UI. It feels fast, flexible, and totally under control.

Then the team grows. Environments multiply. Suddenly you're spending more time maintaining what you built than improving how software gets delivered.

Platform Hub from Octopus Deploy helps you escape that dead end. It scales what teams already do well—automating pipelines, enforcing policies, and standardizing delivery without killing speed.

Stop patching your homegrown platform. Start focusing on delivery.

Request a demo →

📱

News & Trends

AWS and OpenAI announce multi-year strategic partnership (1 minute read)

OpenAI will now run its advanced AI workloads on Amazon Web Services' infrastructure. The partnership is effective immediately.
Grafana Mimir 3.0 release: performance improvements, a new query engine, and more (7 minute read)

Grafana Mimir 3.0 has been released, featuring a decoupled architecture using Apache Kafka to improve reliability, performance, and cost efficiency. The new architecture separates read and write paths, allowing each to scale independently, and also includes the Mimir Query Engine, which reduces peak memory usage by up to 92%. It is recommended that users reference the upgrade guide and release notes before upgrading to the new version.
🚀

Opinions & Tutorials

AWS DynamoDB Outage Analysis (22 minute read)

Applying STPA to the DynamoDB DNS-management outage shows that although the root causes seem obvious in hindsight, a pre-incident analysis would have exposed the same issues—missing feedback between Planner and Enactors, timing gaps, the risk of deleting active plans, and failure to recover when no plan is active. The analysis demonstrates that STPA can uncover both known and latent failure modes efficiently, suggesting its regular use could have prevented the outage and should be part of standard reliability practice.
Don't give Postgres too much memory (4 minute read)

Benchmarking PostgreSQL's GIN index builds shows that raising maintenance_work_mem from 64 MB to 16 GB slowed performance by ~30%, even on a fully cached, CPU-bound system. The slowdown stems mainly from exceeding L3 cache capacity—forcing expensive main-memory access—and from kernel write stalls when large dirty buffers accumulate. Thus, smaller memory settings often yield faster, steadier performance.
How to Get Meaningful Feedback on Your Design Document (11 minute read)

A strong design review process helps teams catch flaws early, align on goals, and move projects forward efficiently. Key practices include writing clear, broadly understandable introductions, using collaborative tools for inline comments, creating editable diagrams, letting reviewers read asynchronously, starting with one focused reviewer, resolving feedback directly in the document, limiting unresolved threads, holding meetings only for contentious issues, and running postmortems to improve future reviews.
🧑‍💻

Resources & Tools

Logging Best Practices: Structured Logs, Frameworks, Filters, and Observability Platforms (Sponsor)

How do you debug clout-native environments when one user action can trigger logs across dozens of microservices? This 71-page Manning ebook (sponsored by Chronosphere) shows you how to extract signal from noise, control log volumes, and handle PII/compliance requirements. Get your copy of Logging Best Practices
zeropod (GitHub Repo)

Zeropod, a Kubernetes runtime, automatically checkpoints containers to disk after a period of TCP connection inactivity, scaling down to zero and restoring the container on the next connection in milliseconds. While scaled down, Zeropod listens on the application's port and migrates pods between nodes to prevent resource spikes, with most programs working out-of-the-box.
Serena (GitHub Repo)

Serena, a free and open-source coding agent toolkit, combines semantic code retrieval with editing and shell execution via its MCP server and LSP-based language server integrations, and can be integrated with LLMs like Claude Code to save tokens and time. Serena can be further customized through Modes and Contexts, which allow users to tailor its behavior to their workflow and environment.
pg_lake (GitHub Repo)

pg_lake integrates Iceberg and data lake files into Postgres. With the pg_lake extensions, you can use Postgres as a stand-alone lakehouse system that supports transactions and fast queries on Iceberg tables, and can directly work with raw data files in object stores like S3.
🎁

Miscellaneous

You don't need Kafka: Building a message queue with only two UNIX signals (14 minute read)

A message broker can be built using only two UNIX signals—SIGUSR1 and SIGUSR2—to transmit bits between processes in Ruby. By trapping signals, shifting bits, and using null-terminated messages, this experiment recreates a basic producer–broker–consumer system, demonstrating how simple IPC and binary operations can emulate message queuing.
A Decade of AI Platform at Pinterest (18 minute read)

Pinterest's decade-long AI evolution turned fragmented ML stacks into a unified platform through shared layers like UFR, MLEnv, and the Dataset Store, with adoption accelerating once incentives and leadership aligned. Today, modeling and infrastructure are fused—GPU efficiency, Ray pipelines, and hybrid CPU/GPU serving drive both speed and capability, showing that success depends on timing when to unify versus explore.

Quick Links

How to Stop (or Limit) AI Models From Scraping Your PDFs (Sponsor)

AI is coming for your PDFs. Learn several options for allowing, limiting, or preventing AI crawlers scraping your documents. Read the blog by Datalogics
Accelerate your Azure integration setup with guided onboarding (3 minute read)

Datadog has launched a new guided onboarding flow for its Azure integration that automates the setup experience directly within the Datadog platform.
Absurd Workflows: Durable Execution With Just Postgres (4 minute read)

A new project called Absurd demonstrates how durable workflows—long-running, crash-resilient tasks—can be implemented using only Postgres.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!
Track your referrals here.

Want to advertise in TLDR? 📰

If your company is interested in reaching an audience of devops professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? 💼

Apply here or send a friend's resume to jobs@tldr.tech and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Kunal Desai & Martin Hauskrecht


Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR DevOps isn't for you, please unsubscribe.