mratsim
/

GLM-4.7-EXL3

Text Generation

exllamav3

exl3

Model card Files Files and versions

xet

Community

mratsim commited on 2 days ago

Commit

5d7edf2

verified ·

1 Parent(s): 6811f96

Add 2.57bpw

Browse files

Files changed (1) hide show

README.md +6 -2

README.md CHANGED Viewed

@@ -74,9 +74,13 @@ The base quants use the new "MCG" multiplier from https://github.com/turboderp-o
 > [!TIP]
 > 🛈 HuggingFace reports file sizes in GB while VRAM is in GiB, there is a factor (1024/1000)³ = 1.0734 between both.
 | Quant                                                                            | Size       | Context / VRAM                            | KL-div (quant, FP16) | KL-div (FP16, quant) | Perplexity | Top-1  | Top-2  | Top-3  | Top-4  | Top-5  |
 | -------------------------------------------------------------------------------- | ---------- | ----------------------------------------- | -------------------- | -------------------- | ---------- | ------ | ------ | ------ | ------ | ------ |
 | [2.10bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/2.10bpw-tuned)| 86 GiB | 131072 tokens, k5v4 for 96 GiB VRAM       | 0.54398251           | 0.61162654           | 7.15544606 | 0.7584 | 0.4237 | 0.1948 | 0.0801 | 0.0306 |
 | [3.15bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/3.15bpw-tuned)| 129 GiB | 102400 tokens, k5v4 for 144 GiB VRAM       | 0.21854555           | 0.21465828           | 6.35729832 | 0.8573 | 0.6119 | 0.3776 | 0.2107 | 0.1071 |
 | [3.84bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/3.84bpw-tuned)| 158 GiB  | 202752 tokens (max), k6v5 for 192GiB VRAM | 0.15823333           | 0.15401253           | 6.41935951 | 0.8854 | 0.6743 | 0.4587 | 0.2832 | 0.1638 |
@@ -155,13 +159,13 @@ CMD ["main.py"]
 ### Detailed measurements of KL-div improvements
 <details>
-<summary>Quantization quality benchmarking in Exllama v3</summary>
 Exllamav3 offers tools to measure per layer (with `-l2`) or even per-tensor (with `-l3`) contributions to KL-div improvements.
 They might take 2 hours to 5 hours, if comparing 2 quants -- to 12 hours if comparing 3 quants -- to 24h of compute if comparing all quants.
 The json file can be fed to https://github.com/turboderp-org/exllamav3/blob/v0.0.14/util/optimize.py with a target `bpw` to output an optimized quant.
-Please note that from experimentations, manual tuning using the heuristics below can achieve better KL-divergence than optimizing by only mixing 3 quants and is less likely to overfit the calibration set. Having `shared experts` or `self_attn` layers use 6 or even 8-bit provide a very large improvement to KL-divergence. Even a measurement with all available quants currently doesn't achieve manual tuning results.
 </details>
 ## Quantization theory and heuristics for manual tuning

 > [!TIP]
 > 🛈 HuggingFace reports file sizes in GB while VRAM is in GiB, there is a factor (1024/1000)³ = 1.0734 between both.
+> [!WARNING]
+> ⚠️ For the 2.57bpw weights, the max number of tokens in context represents a very tight fit, there is only ~250MiB of spare space, and if you have a graphical environment ...
 | Quant                                                                            | Size       | Context / VRAM                            | KL-div (quant, FP16) | KL-div (FP16, quant) | Perplexity | Top-1  | Top-2  | Top-3  | Top-4  | Top-5  |
 | -------------------------------------------------------------------------------- | ---------- | ----------------------------------------- | -------------------- | -------------------- | ---------- | ------ | ------ | ------ | ------ | ------ |
 | [2.10bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/2.10bpw-tuned)| 86 GiB | 131072 tokens, k5v4 for 96 GiB VRAM       | 0.54398251           | 0.61162654           | 7.15544606 | 0.7584 | 0.4237 | 0.1948 | 0.0801 | 0.0306 |
+| [2.57bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/2.57bpw-tuned)| 105 GiB | 160000 tokens, k5v4 for 128 GiB VRAM</br>102400 tokens, k5v4 for 120 GiB VRAM| 0.41910998           | 0.44874423           | 6.63182633 | 0.7903 | 0.4787 | 0.2463 | 0.1132 | 0.0482 |
 | [3.15bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/3.15bpw-tuned)| 129 GiB | 102400 tokens, k5v4 for 144 GiB VRAM       | 0.21854555           | 0.21465828           | 6.35729832 | 0.8573 | 0.6119 | 0.3776 | 0.2107 | 0.1071 |
 | [3.84bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/3.84bpw-tuned)| 158 GiB  | 202752 tokens (max), k6v5 for 192GiB VRAM | 0.15823333           | 0.15401253           | 6.41935951 | 0.8854 | 0.6743 | 0.4587 | 0.2832 | 0.1638 |
 ### Detailed measurements of KL-div improvements
 <details>
+<summary>Measurements for autotuning</summary>
 Exllamav3 offers tools to measure per layer (with `-l2`) or even per-tensor (with `-l3`) contributions to KL-div improvements.
 They might take 2 hours to 5 hours, if comparing 2 quants -- to 12 hours if comparing 3 quants -- to 24h of compute if comparing all quants.
 The json file can be fed to https://github.com/turboderp-org/exllamav3/blob/v0.0.14/util/optimize.py with a target `bpw` to output an optimized quant.
+Please note that from experimentations, manual tuning using the heuristics below can achieve better KL-divergence than optimizing by only mixing 3 quants and is less likely to overfit the calibration set. Having `shared experts` or `self_attn` layers use 6 or even 8-bit provide a very large improvement to KL-divergence. Even a measurement with all available quants currently doesn't achieve manual tuning results. Hence for now, I don't plan to add measurements.
 </details>
 ## Quantization theory and heuristics for manual tuning