Origin of Grok 3 - the Colossus Data Center in Memphis

How Musk leapfrogged the LLM competition by procuring 200,000 NVIDIA H100 GPUs and building a cutting edge data center in Memphis

Feb 20, 2025

Here’s the story of how xAI built the Memphis data center to train Grok 3, pieced together from what’s known about the project as of February 19, 2025.

The Reasoning for Building the Data Center

xAI, driven by Elon Musk’s ambition to rival AI giants like OpenAI, needed a massive leap in computational power to train Grok 3, touted as the “smartest AI on Earth.” Grok 2 had been trained on a modest 8,000 GPUs, but Grok 3 demanded a scale-up to tackle complex math, science, and coding tasks, aiming to outpace competitors like ChatGPT and DeepSeek. The goal was clear: create a model capable of real-time reasoning and self-correction, requiring not just more GPUs but a tightly integrated, high-speed cluster. This wasn’t something they could piecemeal together—they needed a dedicated, monstrous supercomputer, and they needed it fast to stay on Musk’s aggressive timeline for a December 2024 release (later adjusted to February 2025).

Evaluating Cloud Providers

Initially, xAI explored partnering with existing cloud providers like Oracle, which had supplied GPU capacity for earlier Grok models. Musk reportedly negotiated a $10 billion deal for GPU clusters with Oracle, but talks fell apart by mid-2024. The sticking point? Time. Cloud providers quoted 18–24 months to provision and interconnect the 100,000+ GPUs xAI demanded—a delay that would’ve killed their timeline. Even with Oracle’s 16,000 H100 GPUs already in use by xAI, scaling to the level Grok 3 required meant custom infrastructure beyond what cloud vendors could deliver quickly. Frustrated, Musk pivoted: xAI would build its own data center, controlling every aspect from hardware to deployment speed.

Site Selection: Why Memphis?

Memphis wasn’t a random choice—it was a calculated move for speed and practicality. xAI scouted locations and landed on a 785,000-square-foot abandoned Electrolux factory in southwest Memphis, acquired by Phoenix Investors. The site offered a ready-made shell with industrial zoning, bypassing the 18–24-month construction timeline of a new build. Memphis also had logistical perks: access to the Tennessee Valley Authority (TVA) power grid and the Memphis Aquifer for cooling, plus a pro-business local government eager to fast-track approvals via the Greater Memphis Chamber. Musk’s team saw a chance to flip a vacant factory into a “Gigafactory of Compute” in months, not years. The decision was finalized by June 2024, with construction starting almost immediately.

Choosing H100 GPUs and Procurement Challenges

The NVIDIA H100 GPU was the obvious pick—each chip delivers up to 4 PFLOPS of FP8 performance, with 80 GB of HBM2e memory and 2 TB/s bandwidth, perfect for the dense, parallel workloads of Grok 3’s training. xAI initially aimed for 65,000 H100s, a number floated in early plans, reflecting a balance between ambition and what NVIDIA could realistically supply amid global shortages. Procurement wasn’t smooth sailing—NVIDIA’s H100s were in high demand, with Meta snapping up 350,000 that year and Tesla diverting 12,000 originally slated for its own projects to xAI. Musk leaned on his NVIDIA ties (and diverted Tesla shipments) to secure the initial batch, but cosmic-ray bit flips, mismatched BIOS firmware, and network cable issues plagued early testing. Still, by July 22, 2024, the first 65,000 GPUs were online, dubbed “Colossus,” in just 122 days from start to finish.

Expansion to 100,000 and 200,000 GPUs

The jump from 65,000 to 100,000 GPUs came fast—by September 2024, Colossus hit that mark, driven by Musk’s realization that Grok 3’s complexity outstripped initial estimates. The Reasoning Beta mode, with its “Big Brain” feature, demanded more compute to refine self-correcting algorithms. Doubling to 200,000 GPUs by December 2024 was announced at the Memphis Chamber’s Chairmen’s Luncheon, fueled by xAI’s $6 billion funding round and a vision to dwarf rivals. This expansion wasn’t just about scale—it aimed to make Colossus the “most powerful AI training cluster in the world,” syncing all GPUs via NVIDIA’s Spectrum-X Ethernet for unmatched data throughput. Plans for 1 million GPUs by 2026 were teased, signaling xAI’s long-term bet on exponential growth.

Electrical and Cooling Needs

Powering 200,000 H100s was a beast of a challenge. Each GPU consumes 700W at peak, so 100,000 GPUs alone demanded 70 MW, ballooning to 250 MW with servers and cooling overhead—enough for a small city. Memphis started with 8 MW from an existing substation, scaled to 50 MW by August 2024 via MLGW upgrades costing $760,000. xAI then invested $24 million in a new substation for 150 MW, buffered by 14 Voltagrid natural gas generators (35 MW total) and Tesla MegaPacks ($38 million) to handle surges. Cooling was trickier—200,000 GPUs needed a custom liquid-cooling system, not fans. xAI rented a quarter of the U.S.’s mobile cooling units early on, then built a closed-loop setup pulling 1.3 million gallons daily from the aquifer (plans for a $78 million wastewater recycling plant aim to cut this reliance). Locals raised concerns about grid strain and emissions, but xAI pressed on.

Comparison to ChatGPT and DeepSeek Data Centers

Colossus dwarfs most known LLM training setups. ChatGPT’s GPT-3 used 3 million GPU-hours on V100s (10,000 GPUs equivalent), and GPT-4 likely scaled to 20,000–30,000 H100s across Microsoft’s Azure clusters—big, but not a single, cohesive 200,000-GPU beast. DeepSeek V3 trained on 2,048 H800s for $5.5 million, and R1 might’ve used 50,000 GPUs—efficient, but a fraction of Colossus’s scale. Meta’s Llama 3.1 (405B parameters) took 31 million GPU-hours on ~20,000 H100s. At 200,000 GPUs, Colossus’s 250 MW and 200 million GPU-hours for Grok 3 outstrip these by orders of magnitude, though its $1 billion+ cost contrasts with DeepSeek’s lean efficiency. Size-wise, it’s closer to Frontier (37,000 GPUs, 7,300 sq ft) than typical AI clusters, but its single-site density is unmatched.

From a factory shell to a 200,000-GPU titan in under a year, Memphis became xAI’s bold answer to the AI race—a story of speed, scale, and stubborn ingenuity.

Origin of Grok 3 - the Colossus Data Center in Memphis

How Musk leapfrogged the LLM competition by procuring 200,000 NVIDIA H100 GPUs and building a cutting edge data center in Memphis

Discussion about this post