center

AIOps / MLOps / LLMOps

1. Medium - From Beginner to AI/ML Pro in 2025: The Step-by-Step Roadmap that Gets You Hired

  • This article presents a great roadmap for individuals looking to begin their journey into the field of AI/ML in 2025. It emphasizes mastering the fundamental basics, which is crucial for building a strong foundation and avoiding common pitfalls often encountered when getting drawn to more advanced aspects of AI/ML.
  • The roadmap highlights the importance of Python, Mathematics, and AI Fundamentals as essential building blocks for aspiring AI/ML engineers. It also provides relevant resources and topics. I highly recommend this roadmap for beginners as it accurately reflects the core activities of AI/ML engineers in practice.

2. Medium - What Every AI Engineer Should Know About A2A, MCP & ACP

  • Following the discussion about AI/ML standards in the above article, this piece introduces three increasingly popular and rapidly developing concepts being adopted by major companies. Understanding MCP, ACP, and A2A can provide a significant advantage and help you seize opportunities to advance your AI/ML career in 2025 and beyond.
  • The article provides a detailed explanation of each concept – MCP, ACP, and A2A – including illustrations, core fundamentals, components, architecture, and their key characteristics. Ultimately, the author compares these approaches and explores scenarios where combining two or all three technologies could lead to greater success in building AI/ML systems.
  • However, it’s important to note that these protocols are still under active development. This means there are likely ongoing efforts to enhance, patch, and establish them as more robust standards. While learning about them and potentially contributing to the community is valuable, careful consideration should be given before fully implementing them in production environments at this stage (June 2025). They hold significant promise for the near future.

3. Medium - LangChain + MCP + RAG + Ollama = The Key To Powerful Agentic AI

Kubernetes

1. Medium - Using Kong to access Kubernetes services, using a Gateway resource with no cloud provided LoadBalancer

2. Medium - Why do I need an API Gateway on a Kubernetes cluster

3. Blog - Explore Kubernetes Gateway API With Kong Ingress Controller

  • These articles offer a great opportunity to learn about Kong Gateway and how to utilize it within a Kubernetes environment.
  • Through these resources, you will gain detailed walkthroughs and knowledge about deploying Kong in your production setup and understand the advantages and considerations when choosing this platform.

4. Medium - Kubernetes Multi-tenancy: Are You Doing It the Hard Way?

  • This article focuses on operating multi-tenancy within a Kubernetes environment using the vCluster project. vCluster allows you to create virtualized Kubernetes environments inside a physical cluster, helping to optimize resource utilization, especially for expensive resources like GPUs, by enabling multiple teams to use Kubernetes with different configurations on the same hardware.
  • The article provides a complete walkthrough for implementing multi-tenancy with vCluster in Kubernetes. It details enhancement points and various techniques for isolating containers, networks, resources, and nodes. For example, it concludes with an illustration of how to operate AI/ML workloads within a multi-tenant Kubernetes cluster using vCluster.

5. Medium - 9 Proven Steps to Bulletproof Your Kubernetes Disaster Recovery

  • If you want to understand and learn more about planning for disaster recovery, specifically within a Kubernetes environment, this article is a good resource to review. The author lists common scenarios that can lead to disruptions, such as Cloud Provider Outages, Security Breaches, and Network failure. They also highlight key factors to strengthen your disaster recovery strategies.
  • The article outlines nine steps for Kubernetes disaster recovery, providing clear explanations and illustrations for implementation. It also references popular tools like Velero for backups and restores, ChaoMesh for chaos engineering to test resilience, and Argo CD for GitOps-based recovery.

6. Medium - Struggling with GPU Waste on Kubernetes? How KAI-Scheduler’s Sharing Unlocks Efficiency

  • The author addresses the challenges encountered when using GPUs for AI/ML workloads in Kubernetes environments, emphasizing the need for solutions to optimize GPU utilization. They introduce KAI Scheduler, an open-source project by NVIDIA, as a tool that can tackle this through GPU sharing techniques.
  • The article explains how standard Kubernetes scheduling can struggle with defining flexible GPU utilization, often leading to complex workarounds or the development of custom controllers. In contrast, KAI Scheduler has built-in logic that allows a single physical GPU to be divided and shared among smaller workloads.
  • Furthermore, the author draws a comparison between KAI Scheduler’s approach (soft isolation) and HAMI’s (hard isolation), suggesting that this distinction presents different behaviors and trade-offs to consider when deciding which approach is most suitable for specific needs. The article also outlines the pros and cons of KAI Scheduler and its ongoing development progress, indicating its potential as a viable solution in the near future.

7. Medium - Reclaiming Idle GPUs in Kubernetes: A Practical Approach (and a Call for Ideas!)

  • Once again, highlighting the insightful content from the author Nimbus, this article discusses how to implement a smart GPU recycling system to optimize resource utilization in Kubernetes, addressing the lack of standard strategies for evicting or reclaiming idle GPU workloads.
  • The author suggests leveraging existing tools like Prometheus and Kubernetes annotations to build either an API for detecting idle GPU processes or a Custom Resource Definition (CRD) within Kubernetes to automate the detection and eviction of these idle processes.
  • Furthermore, the author is actively seeking community input and discussion to find the best solution for this challenge, inviting alternative ideas. If you have a potential solution, you are encouraged to visit their page and share your thoughts.

8. Medium - Struggling with AI/ML on Kubernetes? Why Specialized Schedulers Are Key to Efficiency

  • In this article, the author shares their insights into the challenges of efficiently scheduling GPUs for AI/ML workloads within a Kubernetes environment, highlighting that it can be a complex aspect of Kubernetes management.
  • The article points out common problems that many users might encounter, such as achieving 100% GPU utilization and dealing with resource fragmentation. These are presented as typical scenarios when running AI/ML workloads on Kubernetes, and the author suggests that these pitfalls are common.
  • However, the article also emphasizes that there are multiple solutions to these challenges, primarily involving the use of custom schedulers specifically designed for GPU workloads, as the default Kubernetes scheduler is not well-suited for this task. The author provides a helpful list of such schedulers for further exploration, including Kueue, HAMI, KAI Scheduler, and Apache Yunikorn.
  • I have to use Gemini to deep researching about GPU, if you have time, you can go double-check about them, that really good documentation, cover a lot of information and diverse technique will be useful in somehow or situations
    1. Kubernetes GPU Scheduling Solutions
    2. HAMI: Open-Source GPU Scheduling Insight

9. PerfectScale - How Talk to Your Kubernetes Cluster Using AI (Yes, Really)

  • This article provides insights into the evolving integration of Kubernetes with Artificial Intelligence, which is becoming increasingly common. The potential of AI, particularly with natural language models, offers significant benefits for interacting with Kubernetes, promising to reduce the workload associated with monitoring, managing, and diagnosing root causes.
  • The article discusses various AI concepts that can be applied to Kubernetes, such as LLM (Large Language Models) with Function Calling, RAG (Retrieval-Augmented Generation), MCP (likely referring to a Management Control Plane or similar concept in an AI context), and AI Agents. These approaches offer different responsibilities and functionalities to help you effectively manage your Kubernetes Cluster. Some examples of relevant tools mentioned are k8s-gpt and botkube.
  • However, it’s important to remember the author’s advice regarding the trade-offs and considerations that come with using AI to directly interact with your Kubernetes environment.

10. Medium - Tracking Down β€œInvisible” OOM Kills in Kubernetes

  • This article raises a very important point, as I have personally encountered the frustrating issue of β€œinvisible” Out-of-Memory (OOM) Kills within my Kubernetes cluster. These can be particularly difficult to diagnose because the process is terminated at the kernel level, often without triggering obvious alerts.
  • The author provides valuable insights into how to debug, validate, and resolve these β€œinvisible” OOM Kills. It’s a worthwhile read if you are looking for new techniques and perspectives to investigate this issue.
  • Interestingly, the article highlights that you might not need external tools. If you are already familiar with journalctl or systemctl, these built-in Linux utilities can be effective in observing host-level events and potentially detecting these kernel-level OOM Kills. Follow the article to explore this approach further.

11. Medium - Kagent: When AI Agents Meet Kubernetes

  • In the growing era of Agents being integrated everywhere, including Kubernetes, the author introduces a new tool for building AI Agents specifically for Kubernetes environments.
  • This tool goes beyond simply analyzing or detecting problems in Kubernetes; it aims to resolve issues, provide consultation, and even take over various tasks typically handled by DevOps Engineers, which is a significant advancement.
  • This new tool is called kagent. According to the author, it has been accepted by the Cloud Native Computing Foundation (CNCF), which is great news for the community.
  • If you are interested in exploring solutions for building an AI Assistant or co-worker for Kubernetes management, you can read the article and learn more about this promising Agent.