Add 2.57bpw
Browse files
README.md
CHANGED
|
@@ -74,9 +74,13 @@ The base quants use the new "MCG" multiplier from https://github.com/turboderp-o
|
|
| 74 |
> [!TIP]
|
| 75 |
> 🛈 HuggingFace reports file sizes in GB while VRAM is in GiB, there is a factor (1024/1000)³ = 1.0734 between both.
|
| 76 |
|
|
|
|
|
|
|
|
|
|
| 77 |
| Quant | Size | Context / VRAM | KL-div (quant, FP16) | KL-div (FP16, quant) | Perplexity | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
|
| 78 |
| -------------------------------------------------------------------------------- | ---------- | ----------------------------------------- | -------------------- | -------------------- | ---------- | ------ | ------ | ------ | ------ | ------ |
|
| 79 |
| [2.10bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/2.10bpw-tuned)| 86 GiB | 131072 tokens, k5v4 for 96 GiB VRAM | 0.54398251 | 0.61162654 | 7.15544606 | 0.7584 | 0.4237 | 0.1948 | 0.0801 | 0.0306 |
|
|
|
|
| 80 |
| [3.15bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/3.15bpw-tuned)| 129 GiB | 102400 tokens, k5v4 for 144 GiB VRAM | 0.21854555 | 0.21465828 | 6.35729832 | 0.8573 | 0.6119 | 0.3776 | 0.2107 | 0.1071 |
|
| 81 |
| [3.84bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/3.84bpw-tuned)| 158 GiB | 202752 tokens (max), k6v5 for 192GiB VRAM | 0.15823333 | 0.15401253 | 6.41935951 | 0.8854 | 0.6743 | 0.4587 | 0.2832 | 0.1638 |
|
| 82 |
|
|
@@ -155,13 +159,13 @@ CMD ["main.py"]
|
|
| 155 |
### Detailed measurements of KL-div improvements
|
| 156 |
|
| 157 |
<details>
|
| 158 |
-
<summary>
|
| 159 |
Exllamav3 offers tools to measure per layer (with `-l2`) or even per-tensor (with `-l3`) contributions to KL-div improvements.
|
| 160 |
They might take 2 hours to 5 hours, if comparing 2 quants -- to 12 hours if comparing 3 quants -- to 24h of compute if comparing all quants.
|
| 161 |
|
| 162 |
The json file can be fed to https://github.com/turboderp-org/exllamav3/blob/v0.0.14/util/optimize.py with a target `bpw` to output an optimized quant.
|
| 163 |
|
| 164 |
-
Please note that from experimentations, manual tuning using the heuristics below can achieve better KL-divergence than optimizing by only mixing 3 quants and is less likely to overfit the calibration set. Having `shared experts` or `self_attn` layers use 6 or even 8-bit provide a very large improvement to KL-divergence. Even a measurement with all available quants currently doesn't achieve manual tuning results.
|
| 165 |
</details>
|
| 166 |
|
| 167 |
## Quantization theory and heuristics for manual tuning
|
|
|
|
| 74 |
> [!TIP]
|
| 75 |
> 🛈 HuggingFace reports file sizes in GB while VRAM is in GiB, there is a factor (1024/1000)³ = 1.0734 between both.
|
| 76 |
|
| 77 |
+
> [!WARNING]
|
| 78 |
+
> ⚠️ For the 2.57bpw weights, the max number of tokens in context represents a very tight fit, there is only ~250MiB of spare space, and if you have a graphical environment ...
|
| 79 |
+
|
| 80 |
| Quant | Size | Context / VRAM | KL-div (quant, FP16) | KL-div (FP16, quant) | Perplexity | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
|
| 81 |
| -------------------------------------------------------------------------------- | ---------- | ----------------------------------------- | -------------------- | -------------------- | ---------- | ------ | ------ | ------ | ------ | ------ |
|
| 82 |
| [2.10bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/2.10bpw-tuned)| 86 GiB | 131072 tokens, k5v4 for 96 GiB VRAM | 0.54398251 | 0.61162654 | 7.15544606 | 0.7584 | 0.4237 | 0.1948 | 0.0801 | 0.0306 |
|
| 83 |
+
| [2.57bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/2.57bpw-tuned)| 105 GiB | 160000 tokens, k5v4 for 128 GiB VRAM</br>102400 tokens, k5v4 for 120 GiB VRAM| 0.41910998 | 0.44874423 | 6.63182633 | 0.7903 | 0.4787 | 0.2463 | 0.1132 | 0.0482 |
|
| 84 |
| [3.15bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/3.15bpw-tuned)| 129 GiB | 102400 tokens, k5v4 for 144 GiB VRAM | 0.21854555 | 0.21465828 | 6.35729832 | 0.8573 | 0.6119 | 0.3776 | 0.2107 | 0.1071 |
|
| 85 |
| [3.84bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/3.84bpw-tuned)| 158 GiB | 202752 tokens (max), k6v5 for 192GiB VRAM | 0.15823333 | 0.15401253 | 6.41935951 | 0.8854 | 0.6743 | 0.4587 | 0.2832 | 0.1638 |
|
| 86 |
|
|
|
|
| 159 |
### Detailed measurements of KL-div improvements
|
| 160 |
|
| 161 |
<details>
|
| 162 |
+
<summary>Measurements for autotuning</summary>
|
| 163 |
Exllamav3 offers tools to measure per layer (with `-l2`) or even per-tensor (with `-l3`) contributions to KL-div improvements.
|
| 164 |
They might take 2 hours to 5 hours, if comparing 2 quants -- to 12 hours if comparing 3 quants -- to 24h of compute if comparing all quants.
|
| 165 |
|
| 166 |
The json file can be fed to https://github.com/turboderp-org/exllamav3/blob/v0.0.14/util/optimize.py with a target `bpw` to output an optimized quant.
|
| 167 |
|
| 168 |
+
Please note that from experimentations, manual tuning using the heuristics below can achieve better KL-divergence than optimizing by only mixing 3 quants and is less likely to overfit the calibration set. Having `shared experts` or `self_attn` layers use 6 or even 8-bit provide a very large improvement to KL-divergence. Even a measurement with all available quants currently doesn't achieve manual tuning results. Hence for now, I don't plan to add measurements.
|
| 169 |
</details>
|
| 170 |
|
| 171 |
## Quantization theory and heuristics for manual tuning
|