mratsim commited on
Commit
5d7edf2
·
verified ·
1 Parent(s): 6811f96

Add 2.57bpw

Browse files
Files changed (1) hide show
  1. README.md +6 -2
README.md CHANGED
@@ -74,9 +74,13 @@ The base quants use the new "MCG" multiplier from https://github.com/turboderp-o
74
  > [!TIP]
75
  > 🛈 HuggingFace reports file sizes in GB while VRAM is in GiB, there is a factor (1024/1000)³ = 1.0734 between both.
76
 
 
 
 
77
  | Quant | Size | Context / VRAM | KL-div (quant, FP16) | KL-div (FP16, quant) | Perplexity | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
78
  | -------------------------------------------------------------------------------- | ---------- | ----------------------------------------- | -------------------- | -------------------- | ---------- | ------ | ------ | ------ | ------ | ------ |
79
  | [2.10bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/2.10bpw-tuned)| 86 GiB | 131072 tokens, k5v4 for 96 GiB VRAM | 0.54398251 | 0.61162654 | 7.15544606 | 0.7584 | 0.4237 | 0.1948 | 0.0801 | 0.0306 |
 
80
  | [3.15bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/3.15bpw-tuned)| 129 GiB | 102400 tokens, k5v4 for 144 GiB VRAM | 0.21854555 | 0.21465828 | 6.35729832 | 0.8573 | 0.6119 | 0.3776 | 0.2107 | 0.1071 |
81
  | [3.84bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/3.84bpw-tuned)| 158 GiB | 202752 tokens (max), k6v5 for 192GiB VRAM | 0.15823333 | 0.15401253 | 6.41935951 | 0.8854 | 0.6743 | 0.4587 | 0.2832 | 0.1638 |
82
 
@@ -155,13 +159,13 @@ CMD ["main.py"]
155
  ### Detailed measurements of KL-div improvements
156
 
157
  <details>
158
- <summary>Quantization quality benchmarking in Exllama v3</summary>
159
  Exllamav3 offers tools to measure per layer (with `-l2`) or even per-tensor (with `-l3`) contributions to KL-div improvements.
160
  They might take 2 hours to 5 hours, if comparing 2 quants -- to 12 hours if comparing 3 quants -- to 24h of compute if comparing all quants.
161
 
162
  The json file can be fed to https://github.com/turboderp-org/exllamav3/blob/v0.0.14/util/optimize.py with a target `bpw` to output an optimized quant.
163
 
164
- Please note that from experimentations, manual tuning using the heuristics below can achieve better KL-divergence than optimizing by only mixing 3 quants and is less likely to overfit the calibration set. Having `shared experts` or `self_attn` layers use 6 or even 8-bit provide a very large improvement to KL-divergence. Even a measurement with all available quants currently doesn't achieve manual tuning results.
165
  </details>
166
 
167
  ## Quantization theory and heuristics for manual tuning
 
74
  > [!TIP]
75
  > 🛈 HuggingFace reports file sizes in GB while VRAM is in GiB, there is a factor (1024/1000)³ = 1.0734 between both.
76
 
77
+ > [!WARNING]
78
+ > ⚠️ For the 2.57bpw weights, the max number of tokens in context represents a very tight fit, there is only ~250MiB of spare space, and if you have a graphical environment ...
79
+
80
  | Quant | Size | Context / VRAM | KL-div (quant, FP16) | KL-div (FP16, quant) | Perplexity | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
81
  | -------------------------------------------------------------------------------- | ---------- | ----------------------------------------- | -------------------- | -------------------- | ---------- | ------ | ------ | ------ | ------ | ------ |
82
  | [2.10bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/2.10bpw-tuned)| 86 GiB | 131072 tokens, k5v4 for 96 GiB VRAM | 0.54398251 | 0.61162654 | 7.15544606 | 0.7584 | 0.4237 | 0.1948 | 0.0801 | 0.0306 |
83
+ | [2.57bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/2.57bpw-tuned)| 105 GiB | 160000 tokens, k5v4 for 128 GiB VRAM</br>102400 tokens, k5v4 for 120 GiB VRAM| 0.41910998 | 0.44874423 | 6.63182633 | 0.7903 | 0.4787 | 0.2463 | 0.1132 | 0.0482 |
84
  | [3.15bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/3.15bpw-tuned)| 129 GiB | 102400 tokens, k5v4 for 144 GiB VRAM | 0.21854555 | 0.21465828 | 6.35729832 | 0.8573 | 0.6119 | 0.3776 | 0.2107 | 0.1071 |
85
  | [3.84bpw-tuned🂱](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/3.84bpw-tuned)| 158 GiB | 202752 tokens (max), k6v5 for 192GiB VRAM | 0.15823333 | 0.15401253 | 6.41935951 | 0.8854 | 0.6743 | 0.4587 | 0.2832 | 0.1638 |
86
 
 
159
  ### Detailed measurements of KL-div improvements
160
 
161
  <details>
162
+ <summary>Measurements for autotuning</summary>
163
  Exllamav3 offers tools to measure per layer (with `-l2`) or even per-tensor (with `-l3`) contributions to KL-div improvements.
164
  They might take 2 hours to 5 hours, if comparing 2 quants -- to 12 hours if comparing 3 quants -- to 24h of compute if comparing all quants.
165
 
166
  The json file can be fed to https://github.com/turboderp-org/exllamav3/blob/v0.0.14/util/optimize.py with a target `bpw` to output an optimized quant.
167
 
168
+ Please note that from experimentations, manual tuning using the heuristics below can achieve better KL-divergence than optimizing by only mixing 3 quants and is less likely to overfit the calibration set. Having `shared experts` or `self_attn` layers use 6 or even 8-bit provide a very large improvement to KL-divergence. Even a measurement with all available quants currently doesn't achieve manual tuning results. Hence for now, I don't plan to add measurements.
169
  </details>
170
 
171
  ## Quantization theory and heuristics for manual tuning