Comments by "Mikko Rantalainen" (@MikkoRantalainen) on "NetworkChuck" channel.

15:30 Testing 1B model on 64 GB system is like asking a smartphone what's 1+1. The 1B model is designed to run on your phone, not in a real computer. Even 70B model can run on a single one of the studio macs even if you only had 64 GB RAM per box. You should be trying to run true non-distilled DeepSeek if you're going to purchase 5 studio macs for your AI needs. You would need to use quantized Q3 model to fit the whole DeepSeek model on these 5 boxes (getting 128 GB RAM models would have been a lot better idea for AI loads).
1
If you plan to do something like this, get as much RAM as possible per box because the interconnect is going to be limiting factor and with more RAM per box, you need to transfer less data over the interconnect.
1
If your 405B model only takes 200 GB to download, you're downloading quantized Q4 model instead of full FP16 model which should take about 800 GB for 405B model.
1
Get a Linux PC, 256 GB RAM and RTX 3090 or 4090 you can use llama.cpp with --gpu-layers N to run distilled models with N layers run on GPU and the rest on the CPU. And the result would be probably faster and cheaper than this 5 box cluster. And if you just want to run 70B model, just get RTX 5090 and run quantized Q3 model totally in GPU. Not cheap but still much cheaper than this set of 5 macs.
1
10:00 Real AI clusters use 400 Gbps networks link per node. Since you had 4 other machines, each machine would need to have 4 x 400 Gbps interconnect to switch to be compatible with real AI clusters when it comes to interconnect technology (and you would obviously need 20 x 400 Gbps switch for the cluster network alone).
1