It’s a capex and that type of hardware needs to be replaced every 3 years minimum and you need people to set it up and maintain a cluster. And it’s not straight forward.
You are never going to get that approved without a serious business case.
Claude on the other end is a opex and much easier to just try out and then build a solution on it
Not saying it doesn’t happen but it’s not as easy as people make it sound like
It’s 3 years if you’re trying to be competitive on frontier models and generally capex is preferred to opex because opex never ends
I don’t think anyone’s building a cluster for their business right now, but one single rack after Claude gets rid of their subscription options? Might be a good deal.
400k on a DGX node starts seeming like a great deal when your employees each start using a few hundred dollars worth of Claude tokens every month. That one node can handle a lot of users depending on the model used.
It’s an expense once every maybe 5 or 6 years in reality and you don’t need to hire new people, you just give your existing sysadmins some extra work. They’ll complain, but they’ll still do it.
Of course the sensible alternative is to use a decent model off openrouter for peanuts but then you’re sending all your sensitive business secrets to China which is even worse than sharing them with a US AI company. And people WILL be sharing secrets lol
Qwen3.6 27b beats Claude Opus 4.5 in most benchmarks. Qwen3.6 35b beats Opus 4.5 in a few specific benchmarks, but most benchmarks have Opus 4.5 beating Qwen3.6 35b, although there is not a big gap between Opus 4.5 and Qwen3.6 27b or 35b either way.
“I don’t think any of that is true. show me data” is shown data “I won’t accept that data!” Lol. Lmao even.
Yeah, I’m not going to play this game of trying to anticipate which numbers you’re willing to accept and which you aren’t. You have just as equal access to a search engine as I have. All of the results I have seen align with the numbers that Qwen released and are well within margins of error.
This model’s release caused such a stir and was a big deal due to the fact that it reproducibly meets or beats Claude Opus 4.5 while being locally runnable. If you won’t believe it, okay, I don’t care. 🤷
I run 27b at q8 with unquantized KV cache and 256k context on two Instinct MI60 GPUs. Definitely the best model that I have been able to run locally at a reasonable speed. 35b generates tokens as fast as you’d expect from any cloud provider. 27b is slower than 35b, of course, but token generation is still faster than my reading speed and suitable with coding agents.
It’s not like the Qwen team hasn’t already built a lot of trust with the community. They’ve never been misleading with previous releases, the “marketing material” (🙄) is for a free product, so they have no incentive to lie, and it would be extra stupid because anyone can run the benchmarks and verify their numbers independently anyway. What would be the point?
deepseek distilled is an alternative that works on more modest hardware.
and i’m not really interested in what claude and chatgpt, mistral and the others are doing, i would never tuch those models with a ten foot pole. if i can’t run it it does not get run.
At Q8 it is around 35-40GB I think + memory for required context.
I have a Framework desktop. It gets you you around 6t/s. Not suitable for professional use but for personal use I think it is fine. I do prefer Gemma 4 though, but that comes with similar reqirements.
huh, i thought that ryzen ai thing would perform better than that. my 7900xtx regularly gets 30+tps with qwen, up to hundreds with more compressed models.
My system runs at 100W TDP though. That is maybe 140W at the power outlet, incl. monitor and everything.
This is also the dense 27B model at Q8. But yeah, it is not terribly fast. I think the best use case is on MoE models. GPT-OSS-120B runs on it for example and at 50T/s speed is not a n issue anymore either. (I could get it to run even on just 64GB but the new llama.cpp might need a tiny bit more memory which pushed it just across the limit. yeah I know, for seriously using it you’d need the 128GB version)
Yes I think Strix Halo makes sense when low power use is a requirement. I built a custom fanless Strix Halo system for the fun of it and I guess there aren’t too many out there running Gemma 4 31B Q8 without a single fan, anywhere.
And for MoE models that need 60-80GB + context it is perfect. Those are decently fast then as well.
PS: If VRAM is all you care about the maxed out Mac Studio is fascinating. 512GB unified memory for around 10K EUR (pre crazy bubble prices) That should be able to run pretty large MoE models but dense models of that size would probably run glacially.
i’m not buying any hardware for the forseeable future :P it’s all just wishful thinking at the moment. but unified memory architectureis probably going to become more common so maybe in five years when some new motherboard standard becomes the norm…
isn’t qwen like 40-50GB? that could work i think. performance is okay even quantised down to 10.
And then add 200k context on top
And then add hundred of users needing to do things in paralell
If it’s a large enough company to have hundreds of users, it can afford several beefy machines tbh
It’s a capex and that type of hardware needs to be replaced every 3 years minimum and you need people to set it up and maintain a cluster. And it’s not straight forward.
You are never going to get that approved without a serious business case.
Claude on the other end is a opex and much easier to just try out and then build a solution on it
Not saying it doesn’t happen but it’s not as easy as people make it sound like
It’s 3 years if you’re trying to be competitive on frontier models and generally capex is preferred to opex because opex never ends
I don’t think anyone’s building a cluster for their business right now, but one single rack after Claude gets rid of their subscription options? Might be a good deal.
Capex never ends either if it’s hardware. Also you need opex to run it
400k on a DGX node starts seeming like a great deal when your employees each start using a few hundred dollars worth of Claude tokens every month. That one node can handle a lot of users depending on the model used.
It’s an expense once every maybe 5 or 6 years in reality and you don’t need to hire new people, you just give your existing sysadmins some extra work. They’ll complain, but they’ll still do it.
Of course the sensible alternative is to use a decent model off openrouter for peanuts but then you’re sending all your sensitive business secrets to China which is even worse than sharing them with a US AI company. And people WILL be sharing secrets lol
If only it was that simple
You don’t have to run Claude Opus for it to be useful lol
nobody said anything about it being a large company :P
anyway, seems the framework is hampered by a slow gpu so the memory issues are apparently moot.
deleted by creator
Qwen3.6 27b beats Claude Opus 4.5 in most benchmarks. Qwen3.6 35b beats Opus 4.5 in a few specific benchmarks, but most benchmarks have Opus 4.5 beating Qwen3.6 35b, although there is not a big gap between Opus 4.5 and Qwen3.6 27b or 35b either way.
deleted by creator
https://github.com/QwenLM/Qwen3.6#benchmarks
deleted by creator
“I don’t think any of that is true. show me data” is shown data “I won’t accept that data!” Lol. Lmao even.
Yeah, I’m not going to play this game of trying to anticipate which numbers you’re willing to accept and which you aren’t. You have just as equal access to a search engine as I have. All of the results I have seen align with the numbers that Qwen released and are well within margins of error.
This model’s release caused such a stir and was a big deal due to the fact that it reproducibly meets or beats Claude Opus 4.5 while being locally runnable. If you won’t believe it, okay, I don’t care. 🤷
deleted by creator
I run 27b at q8 with unquantized KV cache and 256k context on two Instinct MI60 GPUs. Definitely the best model that I have been able to run locally at a reasonable speed. 35b generates tokens as fast as you’d expect from any cloud provider. 27b is slower than 35b, of course, but token generation is still faster than my reading speed and suitable with coding agents.
deleted by creator
It’s not like the Qwen team hasn’t already built a lot of trust with the community. They’ve never been misleading with previous releases, the “marketing material” (🙄) is for a free product, so they have no incentive to lie, and it would be extra stupid because anyone can run the benchmarks and verify their numbers independently anyway. What would be the point?
we were talking about 3.6.
deepseek distilled is an alternative that works on more modest hardware.
and i’m not really interested in what claude and chatgpt, mistral and the others are doing, i would never tuch those models with a ten foot pole. if i can’t run it it does not get run.
At Q8 it is around 35-40GB I think + memory for required context.
I have a Framework desktop. It gets you you around 6t/s. Not suitable for professional use but for personal use I think it is fine. I do prefer Gemma 4 though, but that comes with similar reqirements.
huh, i thought that ryzen ai thing would perform better than that. my 7900xtx regularly gets 30+tps with qwen, up to hundreds with more compressed models.
My system runs at 100W TDP though. That is maybe 140W at the power outlet, incl. monitor and everything.
This is also the dense 27B model at Q8. But yeah, it is not terribly fast. I think the best use case is on MoE models. GPT-OSS-120B runs on it for example and at 50T/s speed is not a n issue anymore either. (I could get it to run even on just 64GB but the new llama.cpp might need a tiny bit more memory which pushed it just across the limit. yeah I know, for seriously using it you’d need the 128GB version)
that’s fair, i’m at like 7x the power. the gpu alone easily pulls 350-400W and the rest of the system isn’t exactly running lean either.
…man now i really want more vram.
Yes I think Strix Halo makes sense when low power use is a requirement. I built a custom fanless Strix Halo system for the fun of it and I guess there aren’t too many out there running Gemma 4 31B Q8 without a single fan, anywhere.
And for MoE models that need 60-80GB + context it is perfect. Those are decently fast then as well.
PS: If VRAM is all you care about the maxed out Mac Studio is fascinating. 512GB unified memory for around 10K EUR (pre crazy bubble prices) That should be able to run pretty large MoE models but dense models of that size would probably run glacially.
i’m not buying any hardware for the forseeable future :P it’s all just wishful thinking at the moment. but unified memory architectureis probably going to become more common so maybe in five years when some new motherboard standard becomes the norm…
I fully understand. ;) Buying hardware now means you’d be either crazy or desperate.