CUDA Proves Nvidia Is a Software Company

0
5

Save StorySave this storySave StorySave this story

Forgive me for starting with a cliché, a piece of finance jargon that has recently slipped into the tech lexicon, but I’m afraid I must talk about “moats.” Popularized decades ago by Warren Buffett to refer to a company’s competitive advantage, the word found its way into Silicon Valley pitch decks when a memo purportedly leaked from Google, titled “We Have No Moat, and Neither Does OpenAI,” fretted that open-source AI would pillage Big Tech’s castle.

A few years on, the castle walls remain safe. Apart from a brief bout of panic when DeepSeek first appeared, open-source AI models have not vastly outperformed proprietary models. Still, none of the frontier labs—OpenAI, Anthropic, Google—has a moat to speak of.

The company that does have a moat is Nvidia. CEO Jensen Huang has called it his most precious “treasure.” It is not, as you might assume for a chip company, a piece of hardware. It’s something called CUDA. What sounds like a chemical compound banned by the FDA may be the one true moat in AI.

CUDA technically stands for Compute Unified Device Architecture, but much like laser or scuba, no one bothers to expand the acronym; we just say “KOO-duh.” So what is this all-important treasure good for? If forced to give a one-word answer: parallelization.

Here’s a simple example. Let’s say we task a machine with filling out a 9×9 multiplication table. Using a computer with a single core, all 81 operations are executed dutifully one by one. But a GPU with nine cores can assign tasks so that each core takes a different column—one from 1×1 to 1×9, another from 2×1 to 2×9, and so on—for a ninefold speed gain. Modern GPUs can be even cleverer. For example, if programmed to recognize commutativity—7×9 = 9×7—they can avoid duplicate work, reducing 81 operations to 45, nearly halving the workload. When a single training run costs a hundred million dollars, every optimization counts.

Nvidia’s GPUs were originally built to render graphics for video games. In the early 2000s, a Stanford PhD student named Ian Buck, who first got into GPUs as a gamer, realized their architecture could be repurposed for general high-performance computing. He created a programming language called Brook, was hired by Nvidia, and, with John Nickolls, led the development of CUDA. If AI ushers in the age of a permanent white-collar underclass and autonomous weapons, just know that it would all be because someone somewhere playing Doom thought a demon’s scrotum should jiggle at 60 frames per second.

CUDA is not a programming language in itself but a “platform.” I use that weasel word because, not unlike how The New York Times is a newspaper that’s also a gaming company, CUDA has, over the years, become a nested bundle of software libraries for AI. Each function shaves nanoseconds off single mathematical operations—added up, they make GPUs, in industry parlance, go brrr.

A modern graphics card is not just a circuit board crammed with chips and memory and fans. It’s an elaborate confection of cache hierarchies and specialized units called “tensor cores” and “streaming multiprocessors.” In that sense, what chip companies sell is like a professional kitchen, and more cores are akin to more grilling stations. But even a kitchen with 30 grilling stations won’t run any faster without a capable head chef deftly assigning tasks—as CUDA does for GPU cores.

To extend the metaphor, hand-tuned CUDA libraries optimized for one matrix operation are the equivalent of kitchen tools designed for a single job and nothing more—a cherry pitter, a shrimp deveiner—which are indulgences for home cooks but not if you have 10,000 shrimp guts to yank out. Which brings us back to DeepSeek. Its engineers went below this already deep layer of abstraction to work directly in PTX, a kind of assembly language for Nvidia GPUs. Let’s say the task is peeling garlic. An unoptimized GPU would go: “Peel the skin with your fingernails.” CUDA can instruct: “Smash the clove with the flat of a knife.” PTX lets you dictate every sub-instruction: “Lift the blade 2.35 inches above the cutting board, make it parallel to the clove’s equator, and strike downward with your palm at a force of 36.2 newtons.”

You can begin to see why CUDA is so valuable to Nvidia—and so hard for anyone else to touch. Tuning GPU performance is a gnarly problem. You can’t just conscript some tender-footed undergrad on Market Street, hand them a Claude Max plan, and expect them to hack GPU kernels. Writing at this level is a grindsome enterprise—unless you’re a cracker-jack programmer at DeepSeek.

A disclosure: In previous Machine Readable columns, I was already familiar with the languages I was analyzing. Not so here. In the interest of maintaining this standard, I decided to spend a day with CUDA. It ruined my afternoon.

A simple matrix multiplication that usually takes me three lines in PyTorch—a popular machine-learning framework—took me 50-plus lines in CUDA. Wringing out the last drop of performance, it turns out, is an admirable but tedious business. Having dipped my toe in the moat, I can report that it is indeed deep and forbidding.

CUDA’s dominance is built not just on the quality of its ecosystem but on a lock-in effect. Because modern machine-learning frameworks are built on CUDA, which crucially runs on Nvidia chips, AMD’s chips underperform even when they have more cores and memory. Comparing chips by spec sheets is like comparing race cars by cylinder count, whereas real performance can only be measured on the track.

A second disclosure: I intended to benchmark two chips, but there was no way to expense an Nvidia H100 and an AMD MI300X without landing on Condé Nast’s blacklist. Instead, you will have to take the word of independent researchers who found that even with better specs on paper, AMD was outmatched by Nvidia.

Nvidia’s edge in software might be that, unusual for a chip company, it hires more software engineers than hardware engineers. If I were running AMD, I might follow suit. (But who’s asking me?)

Every year, there are new hopefuls attempting to drain Nvidia’s moat, only to drown in it. OpenCL, an open standard backed by a consortium that included Apple, AMD, and Qualcomm, was a kind of Android manqué to CUDA’s iOS. It barely gained traction.

Meanwhile, AMD’s answer to CUDA, ROCm, is an even worse name than CUDA—is it pronounced “rock cum”? (Forget about hiring more programmers—get a new marketing team.) It has also has been so plagued by bugs and compatibility issues that its subreddit reads like a support group.

Let’s not forget Intel. While it’s easy to brush it off as a failing chipmaker, its recent history reveals it’s also a failing software company. In a last spasm at relevance, it launched oneAPI, but as of 2026, we know for dead sure that CUDA still reigns. If there’s any challenger, it’s Modular, led by Chris Lattner, the legendary language designer who counts among his creations Apple’s Swift and LLVM.

But the open secret is that, much as theoretical physicists cannot change a tire to save their lives, most AI researchers can’t so much as write a single line of C++. There are very few good GPU kernel engineers alive, and many of them are employed by Nvidia. Long before AI researchers started trafficking in clout, these engineers were diligently working on CUDA without kudos. Even trusty coding agents still hobble through kernel code.

Nvidia, in the end, may be closer to Apple than to AMD or Intel. It’s a great hardware company because it’s a software company. Apple’s moat against Android was never just the iPhone but the ecosystem: iOS, the App Store, and its developers. Sure, you can fold a Samsung Galaxy in half, but do you really want to use Samsung Pay? In the meantime, the industry will have to live with Nvidia’s offensive price tags.

This is the first of a three-part Machine Readable series on AI-enabling languages.