Beyond LLMs: From Cloud Brains to Physical Bodies—Why the Next AI Revolution Is Robotic

By Alexander Zanzer, Ph.D.

Ellison’s inflection point—and what it signals

Larry Ellison’s story begins far from trillion‑dollar AI factories. Born in New York to an unwed Jewish mother and adopted by relatives in Chicago’s South Shore, he grew up in a Reform Jewish home, skeptical of dogma but intensely drawn to engineering and entrepreneurship. Those formative years—adoption, reinvention, and an early brush with mainframe programming—set the stage for Oracle and an era‑defining bet on databases and, later, cloud infrastructure.

In September 2025, Ellison briefly became the world’s richest person as Oracle shares rocketed on a wave of AI optimism; the crown oscillated back to Elon Musk within hours, but the message was unmistakable: infrastructure for AI is the new oil field.

One catalyst was a reported cloud contract of historic scale: The Wall Street Journal reported that OpenAI signed a deal to purchase $300 billion in Oracle compute over roughly five years starting in 2027—among the largest cloud agreements on record. Reuters and The Verge echoed the report; Oracle declined comment. Whether every dollar materializes is beside the point. The direction of travel is clear: AI’s future requires gargantuan, reliable compute and the supply chains to build and power it.

That future is not only about cloud‑scale LLMs. It’s about embodied intelligence—putting software into hardware so machines can perceive, decide, and act in the physical world.

What large language models do well—and where they stall

Strengths. LLMs compress the world’s text into a powerful predictive engine: outstanding at drafting, summarizing, dialog, and increasingly at tool‑use and code generation.

Limits (today).

Grounding and action. LLMs reason over symbols; they lack direct sensorimotor grounding. Connecting words to forces, friction, or balance requires perception and control stacks beyond pure language.
Temporal continuity. Long‑horizon tasks (think: “clean the kitchen”) require persistent memory, state tracking, and closed‑loop control—capabilities that exceed simple prompt‑response paradigms.
Latency & cost. Running frontier models interactively is energy‑intensive and slow relative to control cycles needed for safe manipulation or locomotion (often ~1 kHz).
Reliability. Hallucination and non‑determinism are unacceptable in safety‑critical settings without additional verification and control layers.

Bottom line: LLMs remain essential for task understanding and high‑level planning. But the revolution comes when language is fused with vision, motion, and control—software in hardware.

From software to hardware: the rise of physical AI

A new stack is coalescing around on‑robot computers powerful enough to run multi‑modal foundation models in real time:

Edge AI compute for robots. NVIDIA’s Jetson Thor brings up to 2,070 FP4 “teraflops” with 128 GB memory at 40–130 W, enabling on‑board vision‑language‑action (VLA) inference and low‑latency sensor fusion. Think of it as a data‑center brain shrunk for a biped.
Data‑center backends. Blackwell‑class GPU racks (e.g., GB200 NVL72) stitch dozens of GPUs/CPUs into a “single, massive GPU” for training trillion‑parameter models and generating synthetic robot data at scale.
Control frequencies. Closing torque/position loops at or near 1 kHz is common in high‑performance humanoids; you cannot safely grasp, balance, or recover from slips at LLM latencies. That’s why inference must move onto the robot or be paired with ultra‑fast local controllers.

In short: the next decade belongs to physical AI—language for goals and constraints; video and proprioception for perception; and high‑rate control on energy‑efficient silicon for action.

Seeing vs. seeing‑in‑time: images and video are different problems

Analyzing a picture is fundamentally a spatial problem (classification, detection, segmentation): what objects, what pose, what affordances—in a single frame. Modern models (CNNs, Vision Transformers) excel here.

Analyzing a video adds time—you must model motion and causality:

Motion cues. Early breakthroughs used two‑stream networks: one net for appearance (RGB), one for motion (optical flow).
Spatiotemporal fusion. SlowFast networks split low‑rate spatial semantics (“Slow”) from high‑rate motion (“Fast”), fusing them for robust recognition.
Attention over space‑time. Transformer architectures like TimeSformer attend jointly across frames to capture temporal dependencies.
Self‑supervision. VideoMAE learns from unlabeled clips by reconstructing heavily masked video “tubes,” making video representation learning data‑efficient.

Recognizing movement (the “verb” of the world) goes beyond pixels:

Skeleton/pose pipelines estimate 2D/3D keypoints (e.g., OpenPose, BlazePose) and run spatial‑temporal graph neural nets (ST‑GCN) or transformers on joint trajectories for action understanding.
RGB‑D and multimodal fusion exploit depth sensors to disambiguate occlusions and infer contact, improving action recognition and manipulation.
Event cameras (DVS) capture intensity changes at microsecond latencies—ideal for high‑speed motion, low power, and reduced blur—promising for agile control.
SLAM & VIO (visual‑inertial odometry) provide a persistent 3D map and state estimate so robots can plan and act coherently over time and space.

Put simply: images tell you what is; video tells you what is happening and what will happen next. Robotics needs the latter.

From industry to homes: usable humanoids arrive in stages

Warehouses & logistics. Agility Robotics’ Digit is already piloted in commercial facilities (e.g., GXO; trials with Amazon), handling tote moves and back‑of‑house tasks—mundane, repeatable, valuable.
Factory assistants. Startups like Figure and 1X are integrating foundation models and teleoperation to bootstrap dexterity, while NVIDIA’s Isaac GR00T family aims to standardize VLA brains for generalist humanoids.

Consumer‑adjacent pilots. Tesla’s Optimus is showcased as a path to high‑volume humanoids; plans are ambitious, and timelines remain contested, but the strategic bet—vision‑driven learning at scale—is clear.

The through‑line: progress depends on movement understanding, on‑device inference, and high‑rate control—not just bigger language models.

The new software stack: Vision‑Language‑Action (VLA)

VLA models marry perception (vision), instruction following (language), and low‑level control (action). They translate “Put the red cup on the top shelf” into grasp pose, trajectory, and force profiles—then adjust when reality deviates.

Surveys now map the field; taxonomies propose “action tokens” that move from language to code to trajectories.

NVIDIA’s Isaac GR00T N1/N1.5 positions itself as an open, extensible robot foundation model for humanoids, tightly coupled to simulation and synthetic data generation—critical for safety and coverage.

VLA is how we bridge LLM intent to millisecond‑level action—the missing rung between words and world.

Why “I am robot” is the right mental model

The arc of AI is bending from disembodied intelligence (text‑only) to embodied competence (perceive‑decide‑act). In this paradigm:

Cloud AI trains the priors and coordinates fleets (think: policy updates, global memory).

Edge AI runs the reflexes and local plans at 1 kHz to keep robots balanced, safe, and useful.

Sensing & mapping give continuity (SLAM/VIO), while action recognition and VLA turn perception into behavior.

This is not speculative; it is already moving markets. Oracle’s surge on the reported $300 billion OpenAI deal, the bloom of “AI factories,” and the proliferation of humanoid pilots in logistics are all early signals that the next platform transition is here: software becoming bodies.

Implications—for operators, regulators, and investors

Capex shifts to AI infrastructure. Data centers and edge compute for robots will dominate balance sheets. Suppliers of GPUs, sensors, batteries, and actuators will be strategic choke points.

Safety and liability move center‑stage. Regulators will require transparent control stacks, telemetry, and fail‑safes for robots operating near people.

Data is the new supply chain. Synthetic data, simulation fidelity, and fleet learning become competitive moats.
Standards will crystallize. Expect de facto standards around VLA interfaces, safety cases, and robot ops—analogous to the cloud APIs of the 2010s.

Conclusion

LLMs gave us fluent, generalizable brains. The frontier now is to grant those brains eyes, hands, and balance—and to run them where the physics is: on robots. Ellison’s ascent on the back of record AI infrastructure spending is a reminder that value accrues where bits meet atoms. The future is not only “large language models—what next?” It is “I am robot.”

Receive Breaking News

Sign up for our newsletter and stay up to date! Be the first to receive the latest news in your mailbox: