nvidia/Llama-3.1-Nemotron-70B-Instruct-HF · Any way I can run it on my low-mid tier HP Desktop? specs attached as a .png, btw i know its probably a long shot.

vgrowhouse

about 18 hours ago

vgrowhouse

about 18 hours ago

100% stock no upgraded ram. Also if you are reading this, could my old GTS450 run this?

Noxi-V

about 10 hours ago

A simple answer is, no, it's like trying to fit a train in a car or rather, a bike
It's on huggingchat so use it there instead

iiBLACKii

about 10 hours ago

WTF. Are you running on systems lol!. Bro you even can't run on Kaggle or Collabs (best freely available Notebooks).

tarruda

about 2 hours ago

A refurbished mac studio m1 ultra with 128gb RAM can be found on e-bay for $2.5k-$3k and can run 70b models at q8 at ~7.5 tokens/sec which IMO is perfect for chatting (slightly above my reading speed). Up to 8k tokens it is still OK at ~5 tokens/sec.

It can also fit a 64k context in VRAM if you mess around with iogpu.wired_limit_mb (increasing the max VRAM allocation), but with 32k tokens in the context the speed drops to around 2 tokens/sec which is not good for interactive chat but still usable if you are not in a rush (eg: ask it to summarize a big document and go for a walk).