Today, we’re joined by Anand Naidu, a development expert with deep proficiency across both frontend and backend systems. As agentic AI begins to reshape IT operations, it’s also reintroducing old challenges in a new guise, reminiscent of the unpredictable “cowboy sysadmin” era. We’ll explore how to harness the power of these advanced AI tools for exploratory design and troubleshooting without letting them run wild in production environments. Our discussion will cover the critical distinction between deterministic and non-deterministic tasks, the risks of creating unrepeatable AI-driven fixes, and the essential role of modern practices like GitOps in establishing necessary guardrails.
The concept of “cowboy chaos” has re-emerged with agentic AI, mirroring how sysadmins once made unrepeatable changes. How does this AI-driven chaos differ from the manual version, and what initial guardrails should IT ops teams establish to prevent it? Please share some details.
It’s a fascinating and slightly terrifying parallel. The old cowboy sysadmin would SSH into a box, make a series of undocumented changes, and fix the immediate problem, but no one could ever reproduce that fix. The new AI-driven chaos is more insidious because it feels automated and, therefore, more trustworthy. The key difference is its non-deterministic nature. An LLM, by definition, can produce a different solution to the same prompt at different times. A manual fix was at least the result of a single human’s logic, however flawed or unrecorded. The AI’s logic is probabilistic and can drift. The most crucial initial guardrail is to enforce a strict boundary: AI can propose, but it can never apply. This means any output from an agent—be it a script, a playbook, or a configuration file—must be committed to a version control system like Git as an artifact. From there, it must pass through the same rigorous, deterministic CI/CD pipeline that any human-generated code would, including automated tests and human review.
When a critical app fails at 3 a.m., the impulse to let an AI agent log in and fix it is strong. Can you describe the specific long-term risks of this, such as creating a “time bomb,” and outline a safer, alternative protocol for that high-pressure moment?
That 3 a.m. scenario is where discipline is tested the most. Giving an AI shell access feels like a magic bullet, but you’re actually creating a ticking time bomb. Let’s say the agent successfully improvises a fix by installing a package or changing a config directly on the server. That system has now drifted from every other system in your environment and, more importantly, from its own declarative definition. When it’s time to apply a security patch or perform a migration months later, you have no deterministic recipe to rebuild that system. The original “fix” is lost in the probabilistic ether of the LLM that performed it. A much safer protocol, even under pressure, is to use the AI as an expert assistant. Prompt it to analyze logs and propose a fix in the form of a code change or a revised configuration manifest. The on-call engineer then takes that proposed artifact, reviews it for sanity, and pushes it through the established Git-driven workflow. It might feel a few minutes slower in the moment, but it ensures the fix is auditable, repeatable, and doesn’t create a snowflake system that will cause a much bigger explosion down the line.
A core idea is to use AI for non-deterministic, exploratory work while keeping production changes deterministic. Could you walk us through a practical example of an AI helping to sketch out a Kubernetes manifest, and then detail the mandatory, human-led steps before that code is deployed?
Absolutely. Imagine you need to deploy a new microservice that includes a web front end, an application tier, and a database. This is a perfect non-deterministic task for an AI. You could prompt it to “design a Kubernetes manifest for a three-tier application with a public-facing ingress, a stateful set for the database, and autoscaling for the application deployment.” The AI would excel at this, sketching out the YAML for all the necessary deployments, services, and CRDs, likely much faster than a human could. But that’s where the AI’s job ends. The mandatory next steps are purely human-led and deterministic. First, the generated YAML must be committed to a Git repository. Second, a pull request is opened, triggering a peer review where another engineer validates the logic, security contexts, and resource limits. Third, once approved, the change is merged, which automatically triggers a CI/CD pipeline. This pipeline tests the manifest, perhaps by deploying it to a staging cluster, before a GitOps controller declaratively and safely applies the final, vetted configuration to the production cluster. The AI provides the creative spark, but the platform ensures the execution is rigid and predictable.
Organizations that still allow direct SSH access to production are considered most vulnerable to the misuse of AI agents. Why are these environments particularly at risk, and what cultural and technical shifts are necessary to prepare them for safely integrating AI design tools?
Those environments are at extreme risk because their entire operational culture is built on the idea of direct, manual intervention. If your senior engineers are already accustomed to SSH-ing into servers to perform one-off fixes, the mental leap to “just let the agent try” is dangerously small. The very existence of that access path creates a temptation that will eventually be acted upon, especially when a team is under pressure. The risk is catastrophic because an AI can make changes at a scale and speed that a human cowboy never could. The necessary shift is twofold. Technically, you must move toward immutable infrastructure and a declarative model. This means building golden images and enforcing Git-driven workflows where the only way to change a system is by merging code that a pipeline then executes. Direct SSH access to production should be eliminated or restricted to rare, audited break-glass scenarios. Culturally, the team must shift from valuing individual heroics to valuing repeatable, automated processes. The hero is no longer the person who fixes a server at 3 a.m. but the one who builds the resilient, automated platform that prevents the failure in the first place.
In complex cloud-native environments, an application might span multiple services and deployments. How does this complexity amplify the dangers of letting an AI “just handle it,” and what specific GitOps practices can help a team safely use AI to manage that scale?
Modern applications aren’t single servers; they’re constellations of interconnected systems. A simple e-commerce app might have a dozen microservices, caches, and databases spread across multiple namespaces or even clusters. This complexity is a massive amplifier for risk. If you let an AI “just handle” a problem with one service, you have no way of knowing what cascading, unpredictable effects its non-deterministic changes will have on the other ten services. It’s like letting an amateur electrician rewire one part of a skyscraper—the lights might come on in one office, but the whole building could burn down. GitOps is the perfect antidote to this. It forces you to treat the entire cluster as a single, declarative system. The state of that whole constellation is defined in Git. An AI can be a powerful assistant in helping humans design and test the manifests and Helm charts for this complex system. But the actual application of those changes is managed by a GitOps controller that continuously reconciles the cluster’s state with the desired state in Git. This ensures that even at massive scale, every change is intentional, version-controlled, and applied deterministically across the entire system.
An opinionated tech stack can protect enterprises from the temptation of misusing AI. What key components or rules would you build into such a stack to ensure AI is only used at design time, effectively keeping it out of direct contact with production systems?
An opinionated stack is your best defense. The core rule is that the platform must enforce a single, audited path to production. First, I would build in strict, role-based access controls that programmatically prevent shell access or direct API calls to production infrastructure from any entity that isn’t the core CI/CD system. No exceptions. Second, the stack would be built entirely around a Git-driven workflow. The only way to declare a change is to commit a code artifact—like a playbook or an image definition—to a repository. Third, every commit must trigger a mandatory pipeline that includes automated testing, security scanning, and a review gate. This pipeline becomes the sole actor authorized to deploy changes. In this world, an AI can be an incredibly valuable design-time tool. A developer can use it to generate a Dockerfile, but that Dockerfile is useless until it’s committed to Git and survives the gauntlet of the pipeline. The stack itself becomes the guardrail, quietly protecting the organization from its own worst impulses by making it technically impossible for anyone, human or machine, to improvise on production.
What is your forecast for the role of agentic AI in IT operations over the next five years?
Over the next five years, I forecast that agentic AI will become an indispensable partner at design time but will remain strictly firewalled from runtime operations in mature organizations. We’ll see AI assistants seamlessly integrated into IDEs and platform dashboards, becoming incredibly adept at proposing high-quality, secure, and efficient infrastructure-as-code, Dockerfiles, and Kubernetes manifests. They will analyze telemetry to suggest optimizations and even draft entire incident post-mortems. However, the hype around “AI agents running your infrastructure” will give way to a more sober reality. The fundamental problem of non-determinism won’t be solved. The smart organizations will double down on deterministic platforms, using GitOps and immutable infrastructure as the non-negotiable contract through which all changes flow. The role of AI will be to help humans write better contracts, faster—not to become an ungoverned actor improvising directly on the systems our businesses depend on. The true innovation will be in the human-AI collaboration that perfects the artifacts, while the underlying platform ensures execution remains flawlessly predictable and safe.
