feihu.hf
commited on
Commit
·
061a2ac
1
Parent(s):
c3fe46f
update README
Browse files
README.md
CHANGED
|
@@ -233,6 +233,13 @@ For full technical details, see the [Qwen2.5-1M Technical Report](https://arxiv.
|
|
| 233 |
|
| 234 |
Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
|
| 235 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
#### Step 2: Launch Model Server
|
| 237 |
|
| 238 |
After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
|
|
@@ -251,7 +258,7 @@ Then launch the server with Dual Chunk Flash Attention enabled:
|
|
| 251 |
|
| 252 |
```bash
|
| 253 |
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
|
| 254 |
-
vllm serve
|
| 255 |
--tensor-parallel-size 8 \
|
| 256 |
--max-model-len 1010000 \
|
| 257 |
--enable-chunked-prefill \
|
|
@@ -288,7 +295,7 @@ Launch the server with DCA support:
|
|
| 288 |
|
| 289 |
```bash
|
| 290 |
python3 -m sglang.launch_server \
|
| 291 |
-
--model-path
|
| 292 |
--context-length 1010000 \
|
| 293 |
--mem-frac 0.75 \
|
| 294 |
--attention-backend dual_chunk_flash_attn \
|
|
|
|
| 233 |
|
| 234 |
Replace the content of your `config.json` with `config_1m.json`, which includes the config for length extrapolation and sparse attention.
|
| 235 |
|
| 236 |
+
```bash
|
| 237 |
+
export MODELNAME=Qwen3-235B-A22B-Thinking-2507
|
| 238 |
+
huggingface-cli download Qwen/${MODELNAME} --local-dir ${MODELNAME}
|
| 239 |
+
mv ${MODELNAME}/config.json ${MODELNAME}/config.json.bak
|
| 240 |
+
mv ${MODELNAME}/config_1m.json ${MODELNAME}/config.json
|
| 241 |
+
```
|
| 242 |
+
|
| 243 |
#### Step 2: Launch Model Server
|
| 244 |
|
| 245 |
After updating the config, proceed with either **vLLM** or **SGLang** for serving the model.
|
|
|
|
| 258 |
|
| 259 |
```bash
|
| 260 |
VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN VLLM_USE_V1=0 \
|
| 261 |
+
vllm serve ./Qwen3-235B-A22B-Thinking-2507 \
|
| 262 |
--tensor-parallel-size 8 \
|
| 263 |
--max-model-len 1010000 \
|
| 264 |
--enable-chunked-prefill \
|
|
|
|
| 295 |
|
| 296 |
```bash
|
| 297 |
python3 -m sglang.launch_server \
|
| 298 |
+
--model-path ./Qwen3-235B-A22B-Thinking-2507 \
|
| 299 |
--context-length 1010000 \
|
| 300 |
--mem-frac 0.75 \
|
| 301 |
--attention-backend dual_chunk_flash_attn \
|