center

Quote

“Hi everyone! For those of you who regularly attend my bi-weekly sessions, please note that I’ll be pausing this series for about two sessions. I’m doing this to focus on another goal for the year, and I’ll definitely be bringing the series back!

So, this will be the final session for the next month or so. I hope you can make it and enjoy the session.”

Data Engineer

1. Medium - Data Engineering Was Hard Until I Learned These 15 System Design Concepts.

This article, dropped by Akanksha Singh, is truly incredible in how it delivers knowledge, acting like a dictionary for system design in data architecture.
It covers 15 diverse and significant concepts such as CDC, ETL & ELT, and Batch & Streaming Ingestion, which is both wild and insightful.
When you read this article, you’ll gain detailed information on each design type, including their advantages and the problems they solve. I particularly appreciate the article’s flow, with its clear illustrations and explanations, making it accessible for both those starting their data engineering journey and experienced professionals looking to explore more use cases.
As I often say in my blogs and articles about data, data engineering is truly mysterious and constantly charming with its curiosities. So, enjoy this article and let it seed your new journey, sounds cool!

Kubernetes

1. Alibaba - A Multi-Cloud and Multi-Cluster Architecture with Kubernetes

Multi-Cloud and Multi-Cluster Kubernetes Architecture is incredibly complex. This is because it demands managing and controlling numerous Kubernetes clusters across various clouds, which is inherently challenging. This difficulty is amplified with hybrid clusters that combine both cloud and on-premises environments.
Throughout the articles, the author aims to highlight the benefits of this architecture. These include High Availability, reduced vendor dependencies, and solutions for resource demanding workloads, disaster recovery, and multi-site active-active real-time data synchronization for global environments.
Due to Kubernetes’ inherent complexity and the need to operate with multiple strategies across different clouds and clusters, this architecture is truly tough to implement. This includes challenges related to Kubernetes API Accessibility, implementing Centralized GitOps, and handling Authentication across clusters.
Through these articles, Alibaba offers several ideas to reduce this complexity, notably with their concept of a Kubernetes API Tunnel, which is truly intriguing and worth exploring further after reading the core content.
The content provides several case studies or projects like Federation v2, KubeCDN, and Self-hosted CDN implementations, along with many other insights that will not disappoint.

2. Outshift - Four ways to build hybrid clouds with Kubernetes

Like the topic above, I’m genuinely fascinated by researching how Kubernetes can function as a hybrid cloud solution to address significant challenges in scaling and global latency across services within a cluster. It’s an ambitious and certainly not easy endeavor.
This article provides various solutions for hybrid clouds, such as Cluster Groups, Federation, Service Mesh, and Hybrid Cloud Controllers.
With compelling illustrations and open discussions, you’ll gain a clear vision of what’s required to build a hybrid Kubernetes environment. At the end, you’ll find a comparison of these four options to help you choose. If this topic interests you as well, I highly recommend diving into it.

3. OpenAI - Scaling Kubernetes to 7,500 nodes

This article presents an incredible real-world case study of OpenAI’s success in running Kubernetes with an astonishing 7,500 nodes. This is a truly massive scale, making one wonder how they manage control and debugging. Given that most organizations only operate a tiny fraction of a cluster like OpenAI’s, the methods and insights into their hosting approach are truly wild and insightful.
Throughout the article, OpenAI engineers share the significant challenges encountered when managing such an enormous cluster. These include complexities with the KubeScheduler, GPU Reservation, Storage, and Networking, among many other considerations. The article provides a wealth of solutions, stories, and experiences shared directly from OpenAI’s engineers as they operate these insane clusters
Another article to help you delve into Kubernetes System of OpenAI
- Medium - How OpenAI Scaled Kubernetes with Azure CNI to Handle 7,500 Nodes
- Medium - How Kubernetes Powers OpenAI’s Infrastructure: A Deep Dive into Scaling for AI

4. The new stack - Tutorial: Set Up a Cloud Native GPU Testbed With Nvkind Kubernetes 🌟 (Recommended)

This article delves into the specifics of testing GPU scheduling and inference for AI models using NVKind, which is a specialized wrapper for Kind, designed specifically for GPU environments.
The article provides a comprehensive tutorial on self-hosting NVKind, outlining the necessary prerequisites and demonstrating how to verify the successful operation of your NVKind cluster.
Additionally, I will mentions this stack on an upcoming CD, where the topic of Kubernetes Sandbox environments will be explored in greater detail.

xeusnguyen.xyz

Explorer

Recent Notes

Awesome Developer

Awesome System Architecture

The Story of Mine about Multi-Region Architecture

DueWeekly Tech: 11-08-2025 to 17-08-2025

Data Engineer

Kubernetes

Graph View

Table of Contents

Backlinks