Qwen
/

Qwen3-235B-A22B-Thinking-2507

Text Generation

Model card Files Files and versions

feihu.hf commited on Aug 7

Commit

061a2ac

·

1 Parent(s): c3fe46f

update README

Files changed (1) hide show

README.md +9 -2

README.md CHANGED Viewed

@@ -233,6 +233,13 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
 Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
 #### Step 2: Launch Model Server
 After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
@@ -251,7 +258,7 @@ Then launch the server with Dual Chunk Flash Attention enabled:
 ```bash
 VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
-vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 \
   --tensor-parallel-size 8 \
   --max-model-len 1010000 \
   --enable-chunked-prefill \
@@ -288,7 +295,7 @@ Launch the server with DCA support:
 ```bash
 python3 -m sglang.launch_server \
-    --model-path Qwen/Qwen3-235B-A22B-Instruct-2507 \
     --context-length 1010000 \
     --mem-frac 0.75 \
     --attention-backend dual_chunk_flash_attn \

 Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
+```bash
+export MODELNAME=Qwen3-235B-A22B-Thinking-2507
+huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
+mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
+mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json
+```
 #### Step 2: Launch Model Server
 After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
 ```bash
 VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
+vllm serve ./Qwen3-235B-A22B-Thinking-2507 \
   --tensor-parallel-size 8 \
   --max-model-len 1010000 \
   --enable-chunked-prefill \
 ```bash
 python3 -m sglang.launch_server \
+    --model-path ./Qwen3-235B-A22B-Thinking-2507 \
     --context-length 1010000 \
     --mem-frac 0.75 \
     --attention-backend dual_chunk_flash_attn \