AIOps / MLOps / LLMOps
1. Medium - From Beginner to AI/ML Pro in 2025: The Step-by-Step Roadmap that Gets You Hired
- This article presents a great roadmap for individuals looking to begin their journey into the field of AI/ML in 2025. It emphasizes mastering the fundamental basics, which is crucial for building a strong foundation and avoiding common pitfalls often encountered when getting drawn to more advanced aspects of AI/ML.
- The roadmap highlights the importance of Python, Mathematics, and AI Fundamentals as essential building blocks for aspiring AI/ML engineers. It also provides relevant resources and topics. I highly recommend this roadmap for beginners as it accurately reflects the core activities of AI/ML engineers in practice.
2. Medium - What Every AI Engineer Should Know About A2A, MCP & ACP
- Following the discussion about AI/ML standards in the above article, this piece introduces three increasingly popular and rapidly developing concepts being adopted by major companies. Understanding
MCP
,ACP
, andA2A
can provide a significant advantage and help you seize opportunities to advance your AI/ML career in 2025 and beyond. - The article provides a detailed explanation of each concept β
MCP
,ACP
, andA2A
β including illustrations, core fundamentals, components, architecture, and their key characteristics. Ultimately, the author compares these approaches and explores scenarios where combining two or all three technologies could lead to greater success in building AI/ML systems. - However, itβs important to note that these protocols are still under active development. This means there are likely ongoing efforts to enhance, patch, and establish them as more robust standards. While learning about them and potentially contributing to the community is valuable, careful consideration should be given before fully implementing them in production environments at this stage (June 2025). They hold significant promise for the near future.
3. Medium - LangChain + MCP + RAG + Ollama = The Key To Powerful Agentic AI
-
While not an AI/ML engineer myself, I found this article to be a good introduction to the topic. The author explains complex concepts in an accessible way, focusing on relevant technologies (MCP vs RAG) and models (like Mistral AI) by providing a practical example of building an Agentic Chatbot.
-
The article includes a step-by-step walkthrough with code examples and explanations of the functionality.
-
For those who prefer video format, the author also provides a YouTube Tutorial which explains how to create a multi-agent chatbot using Langchain MCP, web scraping, and Ollama.
-
More about articles about concept of A2A, MCP, maybe you can love
Kubernetes
2. Medium - Why do I need an API Gateway on a Kubernetes cluster
3. Blog - Explore Kubernetes Gateway API With Kong Ingress Controller
- These articles offer a great opportunity to learn about Kong Gateway and how to utilize it within a Kubernetes environment.
- Through these resources, you will gain detailed walkthroughs and knowledge about deploying Kong in your production setup and understand the advantages and considerations when choosing this platform.
4. Medium - Kubernetes Multi-tenancy: Are You Doing It the Hard Way?
- This article focuses on operating multi-tenancy within a Kubernetes environment using the vCluster project. vCluster allows you to create virtualized Kubernetes environments inside a physical cluster, helping to optimize resource utilization, especially for expensive resources like GPUs, by enabling multiple teams to use Kubernetes with different configurations on the same hardware.
- The article provides a complete walkthrough for implementing multi-tenancy with vCluster in Kubernetes. It details enhancement points and various techniques for isolating containers, networks, resources, and nodes. For example, it concludes with an illustration of how to operate AI/ML workloads within a multi-tenant Kubernetes cluster using vCluster.
5. Medium - 9 Proven Steps to Bulletproof Your Kubernetes Disaster Recovery
- If you want to understand and learn more about planning for disaster recovery, specifically within a Kubernetes environment, this article is a good resource to review. The author lists common scenarios that can lead to disruptions, such as
Cloud Provider Outages
,Security Breaches
, andNetwork failure
. They also highlight key factors to strengthen your disaster recovery strategies. - The article outlines nine steps for Kubernetes disaster recovery, providing clear explanations and illustrations for implementation. It also references popular tools like Velero for backups and restores, ChaoMesh for chaos engineering to test resilience, and Argo CD for GitOps-based recovery.
- The author addresses the challenges encountered when using GPUs for AI/ML workloads in Kubernetes environments, emphasizing the need for solutions to optimize GPU utilization. They introduce KAI Scheduler, an open-source project by NVIDIA, as a tool that can tackle this through GPU sharing techniques.
- The article explains how standard Kubernetes scheduling can struggle with defining flexible GPU utilization, often leading to complex workarounds or the development of custom controllers. In contrast,
KAI Scheduler
has built-in logic that allows a single physical GPU to be divided and shared among smaller workloads. - Furthermore, the author draws a comparison between KAI Schedulerβs approach (soft isolation) and HAMIβs (hard isolation), suggesting that this distinction presents different behaviors and trade-offs to consider when deciding which approach is most suitable for specific needs. The article also outlines the pros and cons of KAI Scheduler and its ongoing development progress, indicating its potential as a viable solution in the near future.
7. Medium - Reclaiming Idle GPUs in Kubernetes: A Practical Approach (and a Call for Ideas!)
- Once again, highlighting the insightful content from the author Nimbus, this article discusses how to implement a smart GPU recycling system to optimize resource utilization in Kubernetes, addressing the lack of standard strategies for evicting or reclaiming idle GPU workloads.
- The author suggests leveraging existing tools like Prometheus and Kubernetes annotations to build either an API for detecting idle GPU processes or a Custom Resource Definition (CRD) within Kubernetes to automate the detection and eviction of these idle processes.
- Furthermore, the author is actively seeking community input and discussion to find the best solution for this challenge, inviting alternative ideas. If you have a potential solution, you are encouraged to visit their page and share your thoughts.
8. Medium - Struggling with AI/ML on Kubernetes? Why Specialized Schedulers Are Key to Efficiency
- In this article, the author shares their insights into the challenges of efficiently scheduling GPUs for AI/ML workloads within a Kubernetes environment, highlighting that it can be a complex aspect of Kubernetes management.
- The article points out common problems that many users might encounter, such as achieving 100% GPU utilization and dealing with resource fragmentation. These are presented as typical scenarios when running AI/ML workloads on Kubernetes, and the author suggests that these pitfalls are common.
- However, the article also emphasizes that there are multiple solutions to these challenges, primarily involving the use of custom schedulers specifically designed for GPU workloads, as the default Kubernetes scheduler is not well-suited for this task. The author provides a helpful list of such schedulers for further exploration, including Kueue, HAMI, KAI Scheduler, and Apache Yunikorn.
- I have to use Gemini to deep researching about GPU, if you have time, you can go double-check about them, that really good documentation, cover a lot of information and diverse technique will be useful in somehow or situations
9. PerfectScale - How Talk to Your Kubernetes Cluster Using AI (Yes, Really)
- This article provides insights into the evolving integration of Kubernetes with Artificial Intelligence, which is becoming increasingly common. The potential of AI, particularly with natural language models, offers significant benefits for interacting with Kubernetes, promising to reduce the workload associated with monitoring, managing, and diagnosing root causes.
- The article discusses various AI concepts that can be applied to Kubernetes, such as LLM (Large Language Models) with Function Calling, RAG (Retrieval-Augmented Generation), MCP (likely referring to a Management Control Plane or similar concept in an AI context), and AI Agents. These approaches offer different responsibilities and functionalities to help you effectively manage your Kubernetes Cluster. Some examples of relevant tools mentioned are k8s-gpt and botkube.
- However, itβs important to remember the authorβs advice regarding the trade-offs and considerations that come with using AI to directly interact with your Kubernetes environment.
10. Medium - Tracking Down βInvisibleβ OOM Kills in Kubernetes
- This article raises a very important point, as I have personally encountered the frustrating issue of βinvisibleβ Out-of-Memory (OOM) Kills within my Kubernetes cluster. These can be particularly difficult to diagnose because the process is terminated at the kernel level, often without triggering obvious alerts.
- The author provides valuable insights into how to debug, validate, and resolve these βinvisibleβ OOM Kills. Itβs a worthwhile read if you are looking for new techniques and perspectives to investigate this issue.
- Interestingly, the article highlights that you might not need external tools. If you are already familiar with
journalctl
orsystemctl
, these built-in Linux utilities can be effective in observing host-level events and potentially detecting these kernel-level OOM Kills. Follow the article to explore this approach further.
11. Medium - Kagent: When AI Agents Meet Kubernetes
- In the growing era of Agents being integrated everywhere, including Kubernetes, the author introduces a new tool for building AI Agents specifically for Kubernetes environments.
- This tool goes beyond simply analyzing or detecting problems in Kubernetes; it aims to resolve issues, provide consultation, and even take over various tasks typically handled by DevOps Engineers, which is a significant advancement.
- This new tool is called kagent. According to the author, it has been accepted by the Cloud Native Computing Foundation (CNCF), which is great news for the community.
- If you are interested in exploring solutions for building an AI Assistant or co-worker for Kubernetes management, you can read the article and learn more about this promising Agent.