What Happens to CloudOps in the AI Era?

AIOps & the Transformation of One of Engineerings Most Vital Functions

Mar 10, 2026

TL;DR

I predict a major shift in influence away from software engineering and towards CloudOps in the next 3 to 5 years as AI becomes a key responsibility of infrastructure teams & dev roles decline — this will lead to a new discipline called AIOps.

How It Started vs. How It’s Going

When I first began in the software industry, the people responsible for server infrastructure & networking were called system admins & network engineers. There was no CloudOps or DevOps to speak of.

In those days developers thought little of sys admins & network engineers. For the engineers who wrote the code, they alone were the stars of the show and everyone else was there just to support them (including Product Management, which also barely existed).

Anyone who has been in the industry long enough remembers this dynamic well.

But ever so slowly over the years products & systems became more complex, release cycles became continuous, and infrastructure couldn’t play 2nd fiddle to developers anymore. The industry therefore created “DevOps,” which attempted to close the gap & strike a better balance between the two sides.

Then after years of DevOps, as the Cloud matured in the Enterprise and infrastructure became more programmable, many larger organizations evolved again into what we now refer to as CloudOps — teams responsible not only for maintaining infrastructure but for architecting and operating highly automated cloud environments at scale.

This entire evolution took about 20 years:

From Sys Admins & Network Engineers managing physical hardware
To DevOps, a new discipline connecting infra & engineering closer together
To CloudOps operating complex, large-scale cloud environments

Infrastructure is now so central to the way software systems are built and managed that the balance of power between software engineers who write code and CloudOps engineers who manage production systems has become a lot more equitable.

With AI exploding in popularity though, all of this is going to change for the 4th time in 25 years…to the detriment of software engineering (once again).

AI will shift the balance of power even more towards CloudOps because of all the responsibilities for AI that essentially live in the infrastructure layer…this will lead to a new discipline called AIOps.

Additionally, Engineering will get hammered by AI taking over coding duties and will further reduce itself down, ceding more power to CloudOps.

Lets look at the 5 top reasons for the shift:

Reason 1: AI is Hollowing Out Engineering

AI is beginning to take over large parts of traditional software engineering work. As more coding, testing, debugging, and even basic architecture becomes automated, the center of gravity inside technology organizations shifts.

If fewer people are needed to produce application code, and if more of that code is increasingly generated rather than handcrafted, then the strategic bottleneck moves away from pure software production and toward the environments in which that software actually runs.

That shift elevates the importance of CloudOps. If application development becomes faster, cheaper, and more automated, then the infrastructure layer becomes even more central because it remains the place where reliability, scalability, governance, cost control, and security have to be enforced.

In other words, if AI reduces the distinctiveness of traditional engineering labor, the teams that control production infrastructure become relatively more important because they still determine whether the resulting systems can operate economically at scale.

Seen this way, the easier it becomes to generate software, the more valuable become the teams that control the platform, the runtime environment, and the operational rules of the system.

Caveat

Traditional engineering is not disappearing overnight, and strong product and engineering leadership will still matter enormously for some time to come. But the trend is important to consider.

Reason 2: AI is Driving Complexity into CloudOps

AI introduces an additional layer of complexity that sits on top of the typical infrastructure & software design patterns most software orgs have spent the last decade refining.

Production environments that once consisted primarily of application services and data platforms increasingly need to accommodate GPU clusters, model inference systems, embedding pipelines, vector databases, and large-scale data processing workflows that support training and experimentation with AI.

Running these systems reliably requires CloudOps teams to think not only about uptime and scalability, but also about data movement, model lifecycle management, and the computational demands of training and inference workloads.

The result is that production environments begin to evolve into hybrid systems that combine traditional application infrastructure with elements of high-performance computing and large-scale data platforms driving AI. The architecture starts to feel less like a typical SaaS stack and more like something out of The Matrix: layers of systems operating other systems beneath the surface.

As infrastructure complexity increases in this way, CloudOps teams naturally gain influence inside the organization. Architectural decisions about how models are served, how data pipelines are constructed, and how computational resources are provisioned directly shape what the rest of the engineering organization can/should realistically build.

Caveat

Many companies will initially interact with AI through external APIs rather than operating the infrastructure themselves. However, as AI capabilities become more deeply integrated into products and organizations begin optimizing performance and cost, portions of that infrastructure almost inevitably migrate in-house.

Reason 3: AI Turns CloudOps into a Finance Team

The other shift AI introduces is economic. Historically, cloud infrastructure costs tend to scale in relatively predictable ways. Compute resources grow with usage, storage grows with data, and over time most organizations develop a reasonable understanding of how their cloud bill behaves as the product grows.

AI workloads disrupt that predictability. AI is so immature that training jobs, inference pipelines, and experimentation environments create cost profiles that can fluctuate dramatically depending on how models are used and how computational resources are provisioned. GPU infrastructure in particular introduces a category of high-performance compute resources that can accumulate costs extremely quickly if not carefully managed.

Anyone who has accidentally left a GPU cluster running overnight already understands how quickly this can happen.😀

The broader implication is organizational rather than technical. In many modern software companies the cloud bill has become one of the largest line items in the entire technology function. CloudOps teams therefore end up controlling a spending category that rivals many other operational budgets. It is not surprising that CloudOps leaders increasingly find themselves in conversations not only with devs but also with finance teams and CFOs who are trying to understand how infrastructure spending translates into business value.

At that point CloudOps begins to evolve into something closer to a hybrid discipline that sits at the intersection of engineering, infrastructure, and finance. This further cements CloudOps as a key player in the business.

Caveat

Higher infrastructure spending is not necessarily a problem if it corresponds with meaningful productivity or ARR gains. AI tools may allow engineering teams to build capabilities that would have previously been infeasible, which means infrastructure spending must increasingly be evaluated in terms of the business outcomes it enables rather than purely as a cost to minimize.

Reason 4: AI Shifts Product Design to CloudOps

Sounds crazy, right? But let’s look at this…

Traditional infrastructure failures tend to be deterministic. In other words, platforms usually break in ways that are observable and diagnosable. However, AI systems introduce a completely different category of product operational risk.

An AI platform may remain technically operational while still producing degraded or inconsistent results for the customer. Model performance can drift as data changes, and external AI APIs may change behavior or latency characteristics without warning.

Thus, ensuring reliability in AI-enabled systems requires infrastructure teams to monitor not only the health of the systems but also the behavior of the AI models running on top of them. For example, signals related to model drift, inference latency variability, prompt regressions, and dependency risks associated with external AI services.

In other words, for the first time in history CloudOps is getting into the game of product design. I know it sounds like heresy, but I’ve already seen it happen. For example, if your brand spanking new Voice AI isn’t giving customers the right response, CloudOps now has a pretty big say in that.

Caveat

Engineers have dealt with probabilistic systems before. Distributed systems and large-scale recommendation engines already introduced elements of non-deterministic behavior. But as AI gets injected into almost every product it amplifies these dynamics and makes them more central to how the product behaves.

Reason 5: AI Merges Data Science & CloudOps

Historically, CloudOps and data science teams operated in relatively separate domains. Infrastructure engineers focused on operating production environments while data scientists concentrated on experimentation, modeling, and training workflows.

But AI-enabled products collapse that boundary.

Model training pipelines, data processing infrastructure, evaluation frameworks, and production inference systems require tight coordination between infrastructure ops and machine learning teams. This overlap is what the industry broadly refers to as MLOps.

As a result, CloudOps teams increasingly need to understand the lifecycle of machine learning systems, including how models are trained, deployed, evaluated, and monitored once they are in production.

AI will accelerate this trend through rapid advances in data science technologies, pushing CloudOps teams to develop skills that traditionally belonged to data scientists. In many ways, this shift is already underway—there is an obvious blurring of the lines between the two disciplines.

Caveat

Some organizations will build dedicated MLOps teams while others will integrate those responsibilities into CloudOps.

The 4th Evolution: AIOps

If the history of infrastructure ops teaches us anything, it is that every major shift in software eventually forces a corresponding evolution in the teams responsible for operating those systems.

The move from sys admins dealing with physical infrastructure to automated deployment pipelines gave rise to DevOps. The migration from on-premise environments to hyper-scale cloud platforms produced CloudOps. Each transformation reflected a deeper level of abstraction, automation, and system complexity.

AI is now driving the next evolution.

As AI becomes embedded directly into production software, infrastructure teams increasingly find themselves operating environments that combine cloud infrastructure, large-scale data platforms, and AI systems. These environments are fundamentally different from traditional application stacks. They involve probabilistic systems, dynamic model behavior, and computational workloads that behave more like scientific computing than traditional web infrastructure.

At the same time, the tools used to operate these systems are beginning to incorporate AI themselves. Monitoring platforms, anomaly detection systems, incident analysis tools, and infrastructure optimization engines are increasingly using machine learning to analyze operational data and assist engineers in managing complex environments.

This convergence is giving rise to a fourth operational paradigm: AIOps.

In this new model, AI is both part of the workload being operated and part of the tooling used to operate it. Infrastructure teams are responsible not only for running AI systems but increasingly for supervising intelligent systems that help manage the infrastructure itself. At times the environment begins to take on a slightly familiar sci-fi quality: engineers overseeing systems that are partially overseeing themselves. Fortunately we are still a long way from Skynet, but the direction of travel is clear.

AIOps is not simply a new observability tool category. It represents a fundamental change in how modern software systems are built & operated.

When AI becomes a core component of the production architecture, the infrastructure layer becomes the place where models are deployed, governed, monitored, and secured. The teams responsible for that layer therefore become central to how AI-enabled systems function inside the organization.

Taken together, these shifts are gradually transforming the role CloudOps plays inside technology companies. Infrastructure teams were once viewed primarily as operational support groups responsible for maintaining reliability while product and engineering teams drove the roadmap.

That model is increasingly outdated.

As infrastructure complexity rises and infrastructure spending becomes one of the largest operational costs in the technology organization, the teams managing that infrastructure inevitably gain influence. Decisions about architecture, compute allocation, and platform tooling directly shape how quickly new capabilities can be developed and how economically they can be delivered.

The teams that control AI spend increasingly control the operating model of engineering.

AI will not eliminate Engineering or CloudOps. But it will completely change them. As AI systems move deeper into production environments, the influence of the teams operating the infrastructure will grow rather than diminish while AI will take a big bite out of software development.

Good organizations will realize these shifts and act accordingly as CloudOps is evolving into more of a strategic player and software engineering is looking to simply stay alive.

Mathew Jose

Mar 12

Interesting and realistic take on how the future will unfold for Ops. Yes, I was that devops person, way before devops became a thing.

Agree to most of the reasoning, though AI ops assuming product design needs more pondering.

My supplemental thoughts.

Finops is already a thing with Prosperops like companies saving enterprises money thru good cost reductions. This will gravitate towards onprem ops too as many enterprises are shifting data back into onprem systems due to their data being the gold and to reduce abnormal cloud costs.

Mlops is already in place and is one of the top reasons why enterprises like mine go for Data robot and/Or Databricks.

Dataops is going to be huge especially since data pipelines and data engineering is at the heart of AI ML.

I have also been hearing from 'Agentic Ops' companies who promise the keep all my agents in check, though I do think this will be rolled into observability kings like Datadog, Dynatrace etc.

CodeOps is one I would add, as we need to know which Agent checked in that BUG and how that skipped peer, code reviews, passed the battery of tests and was promoted to prod.. which brought the enterprise down crashing :)

Yes, we are sure on the way to another interesting transformation. Tech will still be a decider for great business outcomes, though AI coding is leveling the playing field.

2 replies

2 more comments...

Discussion about this post

Ready for more?