Llama3.3 70B is the most useful that I have tried. Any smaller and you have to stick ...

Llama3.3 70B is the most useful that I have tried. Any smaller and you have to stick to very general knowledge. Asking for niche knowledge from a small model is a recipe for hallucinations.

I use it both for conversational AI to help me narrow search terms and for coding help via the Continue AI vscode plugin. I use, I think, deepSeek coder for auto complete.

The main drawback is that it is somewhat slow. I get 3.3 tokens per second which is equivalent to talking to someone who types 150 words per minute.

That is actually helpful because it is not so slow as to be intolerable but not so instant that I don't try to figure things out on my own first.

Is does require some decent hardware though. I've got a 4090 an 13900k and 64 gigabytes of RAM running at 6400 MT/s That last number is key. The 4bit quantization of llama3.3 is 42 GB. With 24GB of vram that leaves 18GB that have to be processed by the CPU for each token.

The result is that the GPU is actually not doing much. You probably don't need a 4090, just as much vram as you can get. A 5090 with 32 GB of vram should be able to do 6 tokens per second simply for having only 10GB to process on the CPU.

Daniel Wigton on Nostr: Llama3.3 70B is the most useful that I have tried. Any smaller and you have to stick ...