What is Nostr?
Alan Siefert
npub17fn…j4zs
2025-01-28 23:11:21

Alan Siefert on Nostr: Nvidia’s upcoming Digits computer seems similar in specs to the M4 Ultra at a lower ...

Nvidia’s upcoming Digits computer seems similar in specs to the M4 Ultra at a lower price ($3k). We’re definitely going to see the price of consumer hardware that’s decent for AI come down in price overall!

Market close: $NVDA: -16.91% | $AAPL: +3.21%

Why is DeepSeek great for Apple?

Here's a breakdown of the chips that can run DeepSeek V3 and R1 on the market now:
NVIDIA H100: 80GB @ 3TB/s, $25,000, $312.50 per GB
AMD MI300X: 192GB @ 5.3TB/s, $20,000, $104.17 per GB
Apple M2 Ultra: 192GB @ 800GB/s, $5,000, $26.04(!!) per GB

Apple's M2 Ultra (released in June 2023) is 4x more cost efficient per unit of memory than AMD MI300X and 12x more cost efficient than NVIDIA H100!

Why is this relevant to DeepSeek?
DeepSeek V3/R1 are MoE models with 671B total parameters, but only 37B are active each time a token is generated. We don't know exactly which 37B will be active when we generate a token, so they all need to be ready in high-speed GPU memory.

We can't use normal system RAM because it's too slow to load the 37B active parameters (we'd get <1 tok/sec). On the other hand GPUs have fast memory but GPU memory is expensive. Apple Silicon, however, uses Unified Memory and UltraFusion to fuse dies - a tradeoff that favors a large amount of medium-fast memory at a cheaper cost.

Unified memory shares a single pool of memory between the CPU and GPU rather than having separate memory for each. There's no need to have separate memory and copy data between the CPU and GPU.

UltraFusion is Apple's proprietary interconnect technology for connecting two dies with a super high speed, low latency connection (2.5TB/s). Apple's M2 Ultra is literally two Apple M2 Max dies fused together with UltraFusion. This is what enables Apple to achieve such a high amount of memory (192GB) and memory-bandwidth (800GB/s).

Apple M4 Ultra is rumored to use the same UltraFusion technology to fuse together two M4 Max dies. This would give the M4 Ultra 256GB(!!) of unified memory @ 1146GB/s. Two of these could run DeepSeek V3/R1 (4-bit) at 57 tok/sec.

All of this and Apple has managed to package this in a small form-factor for consumers with great power efficiency and great open-source (uncharacteristic of Apple!) software. MLX has made it possible to leverage Apple Silicon for ML workloads and exolabs has made it possible to cluster together multiple Apple Silicon devices to run large models, demonstrating DeepSeek R1 (671B) running on 7 M4 Mac Minis.

It's unclear who will build the best AI models, but it seems likely that AI will run on American hardware, on Apple Silicon.



Author Public Key
npub17fnqdce37y5ryhkfnnfc0qhdhmqnff0ez84fltd84h8ze2gptvtqd9j4zs