

My target model is Qwen/Qwen3-235B-A22B-FP8. Ideally its maxium context lenght of 131K but i’m willing to compromise. I find it hard to give an concrete t/s awnser, let’s put it around 50. At max load probably around 8 concurrent users, but these situations will be rare enough that oprimizing for single user is probably more worth it.
My current setup is already: Xeon w7-3465X 128gb DDR5 2x 4090
It gets nice enough peformance loading 32B models completely in vram, but i am skeptical that a simillar system can run a 671B at higher speeds then a snails space, i currently run vLLM because it has higher peformance with tensor parrelism then lama.cpp but i shall check out ik_lama.cpp.
Thanks! Ill go check it out.