ubergarm
/

Ling-1T-GGUF

 ## Quick Start
 ```bash
+# Clone and checkout
+$ git clone https://github.com/ikawrakow/ik_llama.cpp
+$ cd ik_llama.cpp
+# Build for hybrid CPU+CUDA
+$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON
+$ cmake --build build --config Release -j $(nproc)
+# CPU-Only Inference
+# `-ger` is still fresh:
+# https://github.com/ikawrakow/ik_llama.cpp/pull/836
+# Omit numactl and `--numa ...` if you have only a single NUMA node
+# set batches/threads/kv cache as desired
+# NOTE: multiple slots e.g. `--parallel 2` may case error after canceling generation then starting a new one at the moment
+SOCKET=0
+numactl -N "$SOCKET" -m "$SOCKET" \
+./build/bin/llama-server \
+    --model "$model"\
+    --alias ubergarm/Ling-1T-GGUF \
+    --ctx-size 65536 \
+    -fa -fmoe -ger \
+    -ctk q8_0 -ctv q8_0 \
+    -ub 4096 -b 4096 \
+    --parallel 1 \
+    --threads 128 \
+    --threads-batch 192 \
+    --numa numactl \
+    --host 127.0.0.1 \
+    --port 8080 \
+    --no-mmap \
+    --no-display-prompt
+# optional use this once after downloading to confirm good files
+    --validate-quants
 ```
 ## References