Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
csabakecskemeti 
posted an update 3 days ago
Post
1998
Looking for some help to test an INT8 Deepseek 3.2:
SGLang supports Channel wise INT8 quants on CPUs with AMX instructions (Xeon 5 and above AFAIK)
https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/

Currently uploading an INT8 version of Deepseek 3.2 Speciale:
DevQuasar/deepseek-ai.DeepSeek-V3.2-Speciale-Channel-INT8

I cannot test this I'm on AMD
"AssertionError: W8A8Int8LinearMethod on CPU requires that CPU has AMX support"
(I assumed it can fall back to some non optimized kernel but seems not)

If anyone with the required resources (Intel Xeon 5/6 + ~768-1TB ram) can help to test this that would be awesome.

If you have hints how to make this work on AMD Threadripper 7000 Pro series please guide me.

Thanks all!

@ubergarm you might have the resources!? 😀

·

Oh interesting you're playing more with SGLang specific quantizations of the big ones!

No, I haven't had access to the big dual Intel Xeon 6980P rig in a while and do most everything on a big AMD Epyc now.

Unfortunately, AMX extensions are Intel Xeon only thing and the Intel pytorch team worked with SGLang to add support for the instructions on that project.

This is the intel pytorch guy who added some subset of AMX into llama.cpp: https://github.com/mingfeima who might have access to a big Intel rig?

Hey! I will probably be trying this. Did you use the ktransformers convert_cpu_weights.py script?

·

I've used this:
https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8/tree/main/inference

Hoped I can make it work on my CPU... :P

Hoped I can make it work on my CPU... :P

If you are wanting to do pure CPU or hybrid CPU+GPU inferencing on a big AMD EPYC rig I generally advise going with https://github.com/ikawrakow/ik_llama.cpp/ which can run your standard Q8_0 llama.cpp quants as well as newer SOTA quants from ik which he's optimized for both AVX2 and the newer "real 512 bit" avx512_vnni instructions which make a big difference for prompt processing and work on both intel and amd CPUs.

But of course I'm biased! ;p

Interesting. I always thought KTransformers are basically Intel-only thing, but recently I noticed people running SGLang+KTransformers on AMD rigs quite successfully, too. For example:

https://www.reddit.com/r/LocalLLaMA/comments/1pdrist/comment/ns8zm8m

·

I've been using SGLang+KTransformers for a while. I happen to have access to a server with an EPYC 9004 CPU. Here are my takeaways:

The hardware setup is a single-socket 9V74 with 288GB RAM and a single RTX 3090. I usually run glm4.5air on it (GPU weights in bf16, CPU weights in int8), getting ~390 tokens/s prefill and ~34 tokens/s decode. For the AMD CPU to reach these speeds, you need to specifically install the BLIS library. The documentation for both AMD and ktransformers was seriously lacking, and the CPU weights also require a specific quantization method.

The above speeds were achieved by running with the --kt-max-deferred-experts-per-token 7 flag. Without it, the decode speed drops by about 1.4x, though the prefill slowdown is less dramatic.