Can ChatGPT Save Collective Kubernetes Troubleshooting?

Companies like OpenAI have been training models on public data from Stack Overflow, Reddit and others. With AI-driven DevOps platforms, more knowledge is locked inside proprietary models.

Sep 8th, 2023 7:54am by Blair Rampling

Featued image for: Can ChatGPT Save Collective Kubernetes Troubleshooting?

Image from alice-photo on Shutterstock.

Decades ago, sysadmins started flooding the internet with questions about the technical problems they faced daily. They had long, vibrant and valuable discussions about how to investigate and troubleshoot their way to understanding the root cause of the problem; then they detailed the solution that ultimately worked for them.

This flood has never stopped, only changed the direction of its flow. Today, these same discussions still happen on Stack Overflow, Reddit and postmortems on corporate engineering blogs. Each one is a valuable contribution to the global anthology of IT system troubleshooting.

Kubernetes has profoundly altered the flow as well. The microservice architecture is far more complex than the virtual machines (VMs) and monolithic applications that have troubled sysadmin and IT folks for decades. Local reproductions of K8s-scale bugs are often impossible to operate. Observability data gets fragmented across multiple platforms, if captured at all, due to Kubernetes’ lack of data persistence. Mapping the interconnectedness of dozens or hundreds of services, resources and dependencies is an effort in futility.

Now your intuition, driven by experience, isn’t necessarily enough. You need to know how to debug the cluster for clues as to your next step.

This complexity means that public troubleshooting discussions are more important now than ever, but now we’re starting to see this valuable flood not get redirected, but dammed up entirely. You’ve seen this in Google. Any search for a Kubernetes-related issue brings you a half-dozen paid ads and at least a page of SEO-driven articles that lack technical depth. Stack Overflow is losing its dominance as the go-to Q&A resource for technical folks, and Reddit’s last few years have been mired in controversy.

Now, every DevOps platform for Kubernetes is building one last levee: Centralize your troubleshooting knowledge within their platform, and replace it with AI and machine learning (ML) until the entire stack becomes a black box to even your most experienced cloud native engineers. When this happens, you lose the skills for individually probing, troubleshooting and fixing your system. This trend turns what used to be a flood of crowdsourced troubleshooting know-how into a mere trickle compared to what was available in the past.

When we become dependent on platforms, the collective wisdom of troubleshooting techniques disappears.

The Flood Path of Troubleshooting Wisdom

In the beginning, sysadmins relied on genuine books for technical documentation and holistic best practices to implement in their organizations. As the internet proliferated in the ‘80s and ‘90s, these folks generally adopted Usenet to chat with peers and ask technical questions about their work in newsgroups like comp.lang.*, which operated like stripped-down versions of the forums we know today.

The general availability of the World Wide Web quickly and almost completely diverted the flood of troubleshooting wisdom. Instead of newsgroups, engineers and administrators flocked to thousands of forums, including Experts Exchange, which went live in 1996. After amassing a repository of questions and answers, the team behind Experts Exchange put all answers behind a $250-a-year paywall, which isolated countless valuable discussions from public consumption and ultimately led to the site’s sinking relevance.

Stack Overflow came next, opening up these discussions to the public again and gamifying discussions through reputation points, which could be earned by providing insights and solutions. Other users then vote for and validate the “best” solution, which helps follow-on searchers find an answer quickly. The gamification, self-moderation and community around Stack Overflow made it the singular channel where the flood of troubleshooting know-how flowed.

But, like all the other eras, nothing good can last forever. Folks have been predicting the “decline of Stack Overflow” for nearly 10 years, citing that it “hates new users” due to its combative nature and structure of administration by whoever has the most reputation points. While Stack Overflow has certainly declined in relevance and popularity, with Reddit’s development/engineering-focused subreddits filling the void, it remains the largest repository of publicly accessible troubleshooting knowledge.

Particularly so for Kubernetes and the cloud native community, which is still experiencing major growing pains. And that’s an invaluable resource, because if you think Kubernetes is complex now …

The Kubernetes Complexity Problem

In a fantastic article about the downfall of “intuitive debugging,” software delivery consultant Pete Hodgson argues that the modern architectures for building and delivering software, like Kubernetes and microservices, are far more complex than ever. “The days of naming servers after Greek gods and sshing into a box to run tail and top are long gone for most of us,” he writes, but “this shift has come at a cost … traditional approaches to understanding and troubleshooting production environments simply will not work in this new world.”

Cynefin model. Source: Wikipedia

Hodgson uses the Cynefin model to illustrate how software architecture used to be complicated, in that given enough experience, one could understand the cause-and-effect relationship between troubleshooting and resolution.

He argues that distributed microservice architectures are instead complex, in that even experienced folks only have a “limited intuition” as to the root cause and how to troubleshoot it. Instead of driving straight toward results, they must spend more time asking and answering questions with observability data to eventually hypothesize what might be going wrong.

If we agree with Hodgson’s premise — that Kubernetes is inherently complex and requires much more time spent analyzing the issue before responding — then it seems imperative that engineers working with Kubernetes learn which questions are most imperative to ask, then answer with observability data, to make the optimal next move.

That’s exactly the type of wisdom disappearing into this coming generation of AI-driven troubleshooting platforms.

Two Paths for AI in Kubernetes Troubleshooting

For years, companies like OpenAI have been scraping and training their models based on public data published on Stack Overflow, Reddit and others, which means these AI models have access to lots of systems and applications knowledge including Kubernetes. Others recognize that an organization’s observability data is a valuable resource for training AI/ML models for analyzing new scenarios.

They’re both asking the same question: How can we leverage this existing data about Kubernetes to simplify the process of searching for the best solution to an incident or outage? The products they’re building take very different paths.

First: Augment the Operator’s Analysis Efforts

These tools automate and streamline access to that existing flood of troubleshooting knowledge published publicly online. They don’t replace the human intuition and creativity that’s required to do proper troubleshooting or root-cause analysis (RCA), but rather thoughtfully automate how an operator finds relevant information.

For example, if a developer new to Kubernetes struggles with deploying their application because they see a CrashLoopBackOff status when running kubectl get pods, they can query an AI-powered tool to provide recommendations, like running kubectl describe $POD or kubectl logs $POD. Those steps might in turn lead the developer to investigate the relevant deployment with kubectl describe $DEPLOYMENT.

At Botkube, we found ourselves invested in this concept of using AI, trained on the flood of troubleshooting wisdom, to automate this back-and-forth querying process. Users should be able to ask questions directly in Slack, like “How do I troubleshoot this nonfunctional service?” and receive a response penned by ChatGPT. During a companywide hackathon, we followed through, building a new plugin for our collaborative troubleshooting platform designed around this concept.

With Doctor, you can tap into the flood of troubleshooting know-how, with Botkube as the bridge between your Kubernetes cluster and your messaging/collaboration platform without trawling through Stack Overflow or Google search ads, which is particularly useful for newer Kubernetes developers and operators.

The plugin also takes automation a step further by generating a Slack message with a Get Help button for any error or anomaly, which then queries ChatGPT for actionable solutions and next steps. You can even pipe the results from the Doctor plugin into other actions or integrations to streamline how you actively use the existing breadth of Kubernetes troubleshooting knowledge to debug more intuitively and sense the problem faster.

Second: Remove the Operator from Troubleshooting

These tools don’t care about the flood of public knowledge. If they can train generalist AI/ML models based on real observability data, then fine-tune based on your particular architecture, they can seek to cut out the human operator in RCA and remediation entirely.

Causely is one such startup, and they’re not shying away from their vision of using AI to “eliminate human troubleshooting.” The platform hooks up to your existing observability data and processes them to fine-tune causality models, which theoretically take you straight to remediation steps — no probing or kubectl-ing required.

I’d be lying if I said a Kubernetes genie doesn’t sound tempting on occasion, but I’m not worried about a tool like Causely taking away operations jobs. I’m worried about what happens to our valuable flood of troubleshooting knowledge in a Causely-led future.

The Gap Between These Paths: The Data

I’m not priming a rant about how “AI will replace all DevOps jobs.” We’ve all read too many of these doomsday scenarios for every niche and industry. I’m far more interested in the gap between these two paths: What data is used for training and answering questions or presenting results?

The first path generally uses existing public data. Despite concerns around AI companies crawling these sites for training data — looking at you, Reddit and Twitter — the openness of this data still provides an incentive loop to keep developers and engineers contributing to the continued flood of knowledge on Reddit, Stack Overflow and beyond.

The cloud native community is also generally amenable to an open source-esque sharing of technical knowledge and the idea that a rising tide (of Kubernetes troubleshooting tips) lifts all boats (of stressed-out Kubernetes engineers).

The second path looks bleaker. With the rise of AI-driven DevOps platforms, more troubleshooting knowledge gets locked inside these dashboards and the proprietary AI models that power them. We all agree that Kubernetes infrastructure will continue to get more complex, not less, which means that over time, we’ll understand even less about what’s happening between our nodes, pods and containers.

When we stop helping each other analyze a problem and sense a solution, we become dependent on platforms. That feels like a losing path for everyone but the platforms.

How Can We Not Lose (or Lose Less)?

The best thing we can do is continue to publish amazing content online about our troubleshooting endeavors in Kubernetes and beyond, like “A Visual Guide on Troubleshooting Kubernetes Deployments”; create apps that educate through gamification, like SadServers; take our favorite first steps when troubleshooting a system, like “Why I Usually Run ‘w’ First When Troubleshooting Unknown Machines”; and conduct postmortems that detail the stressful story of probing, sensing and responding to potentially disastrous situations, like the July 2023 Tarsnap outage.

We can go beyond technical solutions, too, like talking about how we can manage and support our peers through stressful troubleshooting scenarios, or building organizationwide agreement on what observability is.

Despite their current headwinds, Stack Overflow and Reddit will continue to be reliable outlets for discussing troubleshooting and seeking answers. If they end up in the same breath as Usenet and Experts Exchange, they’ll likely be replaced by other publicly available alternatives.

Regardless of when and how that happens, I hope you’ll join us at Botkube, and the new Doctor plugin, to build new channels for collaboratively troubleshooting complex issues in Kubernetes.

It doesn’t matter if AI-powered DevOps platforms continue to train new models based on scraped public data about Kubernetes. As long as we don’t willingly and wholesale deposit our curiosity, adventure and knack for problem-solving into these black boxes, there will always be a new path to keep the invaluable flood of troubleshooting know-how flowing.

Blair Rampling is project leader for Botkube. He has over 20 years of experience in the enterprise IT industry in operations, salesand product management. Over the past several years he has led product teams in both large enterprises and small startups...