pytorch
/

Qwen3-8B-AWQ-INT4

Text Generation

text-generation-inference

Model card Files Files and versions

jerryzh168 commited on Sep 11

Commit

e23ee47

·

verified ·

1 Parent(s): 6dd2748

Update README.md

Files changed (1) hide show

README.md +8 -8

README.md CHANGED Viewed

@@ -212,7 +212,7 @@ We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-h
 | Benchmark                        |                |                        |                           |
 |----------------------------------|----------------|------------------------|---------------------------|
 |                                  | Qwen/Qwen3-8B   | pytorch/Qwen3-8B-INT4 | pytorch/Qwen3-8B-AWQ-INT4 |
-| bbh                              | To be filled   | To be filled           |                           |
 <details>
@@ -242,8 +242,8 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks bbh --device cuda:0 --
 | Benchmark        |                |                                |                                |
 |------------------|----------------|--------------------------------|--------------------------------|
-|                  | Qwen/Qwen3-8B   | pytorch/Qwen3-8B-INT4         |  pytorch/Qwen3-8B-AWQ-INT4     |
-| Peak Memory (GB) | To be filled   | To be filled (?% reduction)    |                                |
@@ -299,11 +299,11 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 # Model Performance
-## Results (A100 machine)
-| Benchmark (Latency)              |                |                          |
-|----------------------------------|----------------|--------------------------|
-|                                  | Qwen/Qwen3-8B   | jerryzh168/Qwen3-8B-AWQ-INT4        |
-| latency (batch_size=1)           | ?s          | ?s (?x speedup)    |
 <details>
 <summary> Reproduce Model Performance Results </summary>

 | Benchmark                        |                |                        |                           |
 |----------------------------------|----------------|------------------------|---------------------------|
 |                                  | Qwen/Qwen3-8B   | pytorch/Qwen3-8B-INT4 | pytorch/Qwen3-8B-AWQ-INT4 |
+| bbh                              | 79.33   | 74.92          |                           |
 <details>
 | Benchmark        |                |                                |                                |
 |------------------|----------------|--------------------------------|--------------------------------|
+|                  | Qwen/Qwen3-8B  | pytorch/Qwen3-8B-INT4          |  pytorch/Qwen3-8B-AWQ-INT4     |
+| Peak Memory (GB) | 16.47          | 6.27 (62% reduction)           |          6.27 (62% reduction)  |
 # Model Performance
+## Results (H100 machine)
+| Benchmark (Latency)              |                |                           |                           |
+|----------------------------------|----------------|---------------------------|---------------------------|
+|                                  | Qwen/Qwen3-8B  | pytorch/Qwen3-8B-INT4     | pytorch/Qwen3-8B-AWQ-INT4 |
+| latency (batch_size=1)           | 2.46s          | 1.40s (1.76x speedup)     |                           |
 <details>
 <summary> Reproduce Model Performance Results </summary>