Blog//

AI & Automation

Cloud & Hybrid IT

Networking & Data centers

Google Cloud Next 2026: AI Operations Start with IT Operations

May 2, 2026

April 30, 2026

The era of AI pilots is over. Now it's about running agents at scale."

That was Saran Sundar, Head of Cloud at Astreya, in a recent interview.

Ninety seconds into the Google Cloud Next 2026 keynote, in front of fifteen thousand people, the same read on the market arrived from the main stage.

Companies that spent 2024 and 2025 piloting AI are now running agents in production, across functions, in numbers that demand a management discipline of their own. Every announcement Google Cloud made downstream of the opening assumed the same thing: enterprises are deploying AI, and the open question is how they keep thousands of agents running, governed, observable, and improving, every day.

Most of the coverage will land on the model news. Gemini 3 Flash Image went into preview. Anthropic Claude Opus 4.7 landed in support. The eighth-generation TPUs were unveiled. Apple was confirmed as a preferred cloud customer building the next generation of Apple foundation models on Gemini. All of that was expected.

A bigger story sits underneath: the same partners who have operated the rest of their stack for decades are uniquely positioned to operate their agents as well.

Companies that build out the AI operations layer will thrive. Companies that try to scale models without scaling that layer will not.

The pilot phase is over

Thomas Kurian, CEO of Google Cloud, opened with a number that did most of the work. About three quarters of Google Cloud customers now use its AI products in their core business. Thousands of agents run across every industry, reaching billions of users.

Then a line from Kurian framed every minute that followed:

You have moved beyond the pilot. The experimentation phase is behind us, and now the real challenge begins: how do you move AI into production across your entire enterprise?"

The conference compressed into fifteen seconds. Production is the new center of gravity.

Sundar Pichai, CEO of Google, later put the operational point on it directly:

The conversation has gone from 'can we build an agent?' to 'how do we manage thousands of them?'"

Each agent needs context, secured actions, audited decisions, observed behavior, governed permissions, and a feedback loop from every incident. Multiply that by a fleet of thousands and the operations problem of the decade comes into view: AI that has to keep running long after the pilot ends.

Humans cannot manually troubleshoot at this scale

The operational reality showed up most clearly in Amin Vahdat's segment on the AI Hypercomputer. Vahdat, Google's SVP and Chief Technologist for AI and Infrastructure, walked through the eighth-generation TPUs, the new Virgo network, and the storage upgrades that let a single cluster reach 1.7 million exaFLOPs of compute.

Then came the line that mattered for IT and operations leaders:

At this scale, you cannot have humans manually troubleshooting configurations. You need a cloud that drives itself."

That is a hardware engineer's way of stating the operations principle. Tens of thousands of agents, billions of telemetry events, exabytes of model traffic, and a continuously shifting set of model versions and data pipelines have run past what manual operations can support.

The cloud has to drive itself. Operations do too.

Agent fleets need their own operations layer because the humans cannot keep up.

Google built that layer piece by piece across the rest of the keynote.

Agent Identity gives each agent a unique cryptographic ID with traceable, auditable authorization
Agent Gateway centralizes policy enforcement
Model Armor protects models and proprietary data from sensitive data leakage
Agent Observability instruments the full execution path of every agent through OTel-compliant telemetry
Agent Registry indexes every internal agent and tool so nothing rogue runs unaccounted for
Skills and Tools Registries package reusable instructions so agents do specialized work the same way every time
Agent-to-agent orchestration handles handoffs between specialized agents on a single workflow
Agent Marketplace exposes vetted agents from partners including Atlassian, Box, ServiceNow, and Workday
Native Model Context Protocol turns every Google Cloud service into something an agent can call directly

That is a long list. Read it as one sentence: every layer that IT operations teams have spent thirty years building for human-and-machine workflows is being rebuilt for agent-and-machine workflows.

AI operations is the layer that makes a fleet of agents legible, governable, observable, and improvable. Without it, an agent that hallucinates a customer refund, leaks a credential, or makes the wrong call on a P1 incident is indistinguishable from one that performed correctly. There is no audit trail, no signal-from-noise filter, no human judgment at the right moments, no learning loop that turns this morning's misfire into an instruction the next agent reads before it acts.

Google Cloud did not name AI operations as a category from the stage. The case for one was assembled across three hours.

The teams who already run things at scale have a head start

Most coverage will miss this. The new operations layer Google Cloud is building rests on the IT operations discipline, applied to a new set of nouns.

Keeping an AI fleet running maps onto the work of keeping an IT estate running.

Triage. A flood of alerts, events, and signals has to be reduced to the few that matter and routed to the right place. In IT operations, that is PagerDuty, Splunk, and the runbook. For AI operations, the toolset is Agent Observability, telemetry traces, and the human-in-the-loop checkpoint.

Governance. Every action has to be auditable, scored, and either approved or contained. In IT, that is change management and least-privilege access. For AI, that is Agent Identity, Model Armor, and the policy engine that decides which actions get blocked before execution.

Root cause and resolution. When something breaks, somebody has to know which system did what, in which order, and why. In IT, that is incident response. For AI, the same investigation runs on a fleet of agents whose reasoning paths are not human-legible by default.

Learning. The operation has to get smarter every time it resolves an incident, or you pay full price for the same mistake twice. That is true of help desks. It is exponentially more true of agent fleets, where one bad pattern can replicate across thousands of decisions before anyone notices.

The companies that already run global IT operations at enterprise scale already have most of the muscle. They know how to ingest events from hundreds of sources and reduce them to a handful of signals. Routing incidents to the right team with the right context is solved territory. Policy enforcement without slowing the business is muscle memory. Feedback loops that make next quarter's operation cheaper than this one are the basic compounding mechanic of the work.

What those companies need is the layer that handles the new objects:

The ID system
The policy engine
The observability stack
The learning graph

That layer is what Google Cloud spent three hours of the keynote describing. The next twelve months will be about who deploys it first.

How Astreya operates on Google Cloud

Another point from Astreya's Head of Cloud, Saran Sundar:

Models are becoming commodities. Operating AI is the real advantage."

Astreya has been building toward the moment Sundar described for years. Six of the Magnificent Seven are clients, with operations running across 40+ countries and a global workforce of 2,400+.

At the center of these solutions is LogicFabric, a knowledge graph encoding 200+ person-years of operational judgment across 2.7 million interconnected relationships and 15,000 atomic resolution tasks.

If that’s confusing, think of LogicFabric as a very experienced brain for Astreya’s stack of AI solutions:

AI OpsHub is the operational explainability platform, where every agent action and incident decision is scored, logged, governed, and reconstructable for the auditor or the board.
Lynx is the predictive insights engine, surfacing recurring incident patterns, data quality gaps, and SLA risks before they escalate into the human queue.
Pictor is the process intelligence platform, recording how work actually flows through the organization, classifying process variants, and preserving the rationale behind every automation and redesign decision so intent survives platform migrations.
Pyxis is the IT data insights framework, correlating signals across ITSM, ITAM, CSAT, call logs, and event sources to produce hypotheses, validated multi-domain insights, and plain-language answers any stakeholder can act on.
Ara is the AIOps and automation readiness assessment, mapping where a customer sits on the path from manual operations to governed automation and specifying which use cases to automate, in what sequence, and with what projected ROI.

Taken together, these AI solutions let enterprises defend every number, see outages before they happen, fix issues before costs spiral, and know where they stand at any time with every agent.

Enterprise IT operations look similar with new objects in the queue. Every AI deployment has a lifecycle: model, data, agent, policy, audit, human, next incident. Those pieces run across people and systems that have to keep working together for years.

Astreya is certain that the team running the rest of the stack is the team to run the agent layer on top of it.

The next twelve months

Every enterprise running AI in production now has a second IT estate to manage.

It is made of agents, models, prompts, and policies, and it has the same failure modes as the first one: alerts that pile up, decisions that go unaudited, incidents whose root causes get lost, learning that never compounds. The estate is also new enough that most organizations have no operations function pointed at it yet.

The companies that win the next year will be the ones that treat agent fleets the way they treat their server fleets and ship the operations layer that makes those fleets legible and intelligible.

AI is an Operations problem now.

The companies that already operate at scale have the head start. Whether they convert it into a competitive advantage depends on how fast they start running with the same partners who have operated the rest of their stack for decades.

The team that runs your stack should run your agents

Astreya has been building the operations layer the keynote described for twenty-five years. Six of the Magnificent Seven already run stacks with our support. The same teams are now partnering with us to run agents on our layer.

A Google Cloud Partner with migration, optimization, and Agentic AI engineering on the bench, Astreya is ready to operate the estate Google Cloud Next 2026 described. Already built, Astraya’s AI IP is currently deployed globally at hyperscale.

For an operations layer that keeps up with AI in production, [talk to us].

[See Astreya's Enterprise AI Services]

About the author

Back to insights

No items found.

AI & Automation

Cloud & Hybrid IT

Networking & Data centers