Newslurp

<< Stories

Networking Hardware For AI 🌐, Chaos Engineering πŸ’, Zero-Trust AI On Kubernetes πŸ”’

TLDR DevOps <dan@tldrnewsletter.com>

October 15, 11:10 am

TLDR DevOps
Meta announced milestones for its data center networking, including a Disaggregated Scheduled Fabric (DSF) evolution to support 18,432 XPUs β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ  β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ β€Œ 

TLDR

Together With Dynatrace

TLDR DevOps 2025-10-15

AI is everywhereβ€”but trust, performance, and alignment are still evolving. (Sponsor)

For platform engineering teams, observability is becoming the strategic layer that makes AI explainable, auditable, and resilient.

In The State of Observability 2025, Dynatrace surveyed 800+ IT leaders to uncover insights, such as:

  • 46% expect AI-driven observability to deliver the greatest ROI for optimizing AI model configurations
  • 37% expect observability to help identify data drift, forecast system load, and alert users to potential issues
  • 69% of AI decisions are still verified by humans

Explore the full report to see how observability is enabling real-time automation and smarter AI infrastructure.

πŸ“±

News & Trends

OCP Summit 2025: The Open Future of Networking Hardware for AI (5 minute read)

Meta announced milestones for its data center networking, including a Disaggregated Scheduled Fabric (DSF) evolution to support 18,432 XPUs for AI clusters and a new Non-Scheduled Fabric (NSF) architecture. It also introduced Minipack3N, a new 51T Ethernet switch, and is expanding its optics portfolio with 2x400G FR4 LITE (500-m) optics and the 400G DR4 OSFP-RHS optics. Meta is a founding participant in the new Ethernet for Scale-Up Networking (ESUN) initiative.
Docker Model Runner on the new NVIDIA DGX Spark: a new paradigm for developing AI locally (6 minute read)

Docker Model Runner now supports NVIDIA DGX Spark, allowing developers to easily run and iterate on larger models locally with a familiar Docker experience. The combination of DGX Spark's workstation-class AI system and Docker Model Runner's sandboxed environment offers plug-and-play GPU acceleration and Docker-level simplicity.
πŸš€

Opinions & Tutorials

Instantly respond to changes in your data with Datadog automation rules (4 minute read)

Datadog Datastore automation rules allow workflows to trigger instantly when data is added, updated, or deleted, ensuring all processes and integrations stay current without manual intervention. This feature keeps data consistent across apps and workflows, streamlines incident management, and enables faster, automated responses to operational changes.
Managing Kubernetes Workloads Using the App of Apps Pattern in ArgoCD-2 (5 minute read)

The App of Apps pattern in ArgoCD uses a parent application to manage and deploy multiple child applications, enabling modular, version-controlled, and GitOps-driven Kubernetes deployments. This approach improves visibility, reusability, and consistency across environments while allowing automated syncing and self-healing of applications like NGINX Ingress and Cert-Manager.
A blueprint for zero-trust AI on Kubernetes (6 minute read)

Tigera highlighted the need to secure AI workloads on Kubernetes, pointing out that while AI security challenges aren't new, the stakes are higher due to the sensitive data and powerful APIs involved. Solutions include securing API endpoints with the Kubernetes Gateway API, using network policies enforced by CNIs like Calico or Cilium, restricting API keys using IP address restrictions or egress gateways, and implementing observability with tools like OpenTelemetry.
πŸ§‘β€πŸ’»

Resources & Tools

Free resources for migrating Linux workloads to Azure (Sponsor)

Aging infrastructure? Rising costs? Pressure to modernize? It might be time to move your Linux workloads to Azure. VIAcode has partnered with Microsoft and AMD to make that easier. Get the guide for a step-by-step migration framework; or jump right in with a free, no-risk migration assessment.
Onedump (GitHub Repo)

Onedump, a database administration tool, streamlines database backup and restore tasks across multiple databases and storage destinations using binaries and Docker images. The tool supports direct network access or SSH for database connections, can load configurations from local directories or AWS S3 buckets, and offers a native MySQL dumper with limitations, as well as binlog sync-s3 and restore commands for point-in-time recovery.
OpenZL (GitHub Repo)

Meta is using OpenZL, a BSD-licensed data compression framework, extensively in production. OpenZL delivers high compression ratios while preserving high speed and consists of a core library and tools to generate specialized compressors.
🎁

Miscellaneous

From key sprawl to scalable control: Rethinking SSH access (5 minute read)

Static SSH keys create security and operational challenges at scale due to long lifetimes, sharing, and lack of auditability, while SSH certificates provide short-lived, signed access controlled by a certificate authority. HashiCorp Vault and Boundary streamline SSH certificate management by dynamically generating and injecting ephemeral credentials, enabling secure, auditable, and scalable SSH access with centralized policy and minimal user overhead.
Cloudflare Stock Rises: The Reason Behind the Surge (3 minute read)

Cloudflare shares rose 3.5% after announcing a partnership with Oracle to integrate its connectivity cloud platform into Oracle Cloud Infrastructure (OCI). The collaboration aims to enable joint customers to use Cloudflare's security and performance services within OCI environments. Cloudflare also introduced new solution bundles to simplify security and application management for its partners.
⚑

Quick Links

Everyone hates working with PDFs. This SDK will make you hate it less (Sponsor)

Solve complex PDF workflows with the Adobe PDF Library SDK by Datalogics. Create, edit and manage PDF documents and workflows with easy-to-use APIs and expert support. Start your free trial
Amazon CloudWatch Application Signals new enhancements for application monitoring (5 minute read)

Amazon CloudWatch Application Signals introduces enhanced features for monitoring large-scale distributed applications, including automatic and custom service grouping, deployment tracking, and audit insights for SLI breaches.
How Sysdig secures your containers and Kubernetes (8 minute read)

Sysdig secures containers and Kubernetes by providing end-to-end visibility across the container lifecycle.
Getting started with chaos engineering (5 minute read)

Chaos engineering is a methodology popularized by Netflix that introduces controlled disruptions to identify vulnerabilities.

Love TLDR? Tell your friends and get rewards!

Share your referral link below with friends to get free TLDR swag!
Track your referrals here.

Want to advertise in TLDR? πŸ“°

If your company is interested in reaching an audience of devops professionals and decision makers, you may want to advertise with us.

Want to work at TLDR? πŸ’Ό

Apply here or send a friend's resume to jobs@tldr.tech and get $1k if we hire them!

If you have any comments or feedback, just respond to this email!

Thanks for reading,
Kunal Desai & Martin Hauskrecht


Manage your subscriptions to our other newsletters on tech, startups, and programming. Or if TLDR DevOps isn't for you, please unsubscribe.