add usage example with new `-ger`
Browse files
README.md
CHANGED
|
@@ -109,7 +109,40 @@ numactl -N ${SOCKET} -m ${SOCKET} \
|
|
| 109 |
|
| 110 |
## Quick Start
|
| 111 |
```bash
|
| 112 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
```
|
| 114 |
|
| 115 |
## References
|
|
|
|
| 109 |
|
| 110 |
## Quick Start
|
| 111 |
```bash
|
| 112 |
+
# Clone and checkout
|
| 113 |
+
$ git clone https://github.com/ikawrakow/ik_llama.cpp
|
| 114 |
+
$ cd ik_llama.cpp
|
| 115 |
+
|
| 116 |
+
# Build for hybrid CPU+CUDA
|
| 117 |
+
$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
|
| 118 |
+
$ cmake --build build --config Release -j $(nproc)
|
| 119 |
+
|
| 120 |
+
# CPU-Only Inference
|
| 121 |
+
# `-ger` is still fresh:
|
| 122 |
+
# https://github.com/ikawrakow/ik_llama.cpp/pull/836
|
| 123 |
+
# Omit numactl and `--numa ...` if you have only a single NUMA node
|
| 124 |
+
# set batches/threads/kv cache as desired
|
| 125 |
+
# NOTE: multiple slots e.g. `--parallel 2` may case error after canceling generation then starting a new one at the moment
|
| 126 |
+
SOCKET=0
|
| 127 |
+
numactl -N "$SOCKET" -m "$SOCKET" \
|
| 128 |
+
./build/bin/llama-server \
|
| 129 |
+
--model "$model"\
|
| 130 |
+
--alias ubergarm/Ling-1T-GGUF \
|
| 131 |
+
--ctx-size 65536 \
|
| 132 |
+
-fa -fmoe -ger \
|
| 133 |
+
-ctk q8_0 -ctv q8_0 \
|
| 134 |
+
-ub 4096 -b 4096 \
|
| 135 |
+
--parallel 1 \
|
| 136 |
+
--threads 128 \
|
| 137 |
+
--threads-batch 192 \
|
| 138 |
+
--numa numactl \
|
| 139 |
+
--host 127.0.0.1 \
|
| 140 |
+
--port 8080 \
|
| 141 |
+
--no-mmap \
|
| 142 |
+
--no-display-prompt
|
| 143 |
+
|
| 144 |
+
# optional use this once after downloading to confirm good files
|
| 145 |
+
--validate-quants
|
| 146 |
```
|
| 147 |
|
| 148 |
## References
|