--- license: other base_model: nvidia/NVIDIA-Nemotron-Nano-12B-v2 library_name: llama.cpp tags: - gguf - quantized - 4-bit - Q4_K_M - nemotron - 12B - tool-calling - thinking - 128k - multilingual - llama.cpp - ollama --- # NVIDIA Nemotron Nano 12B v2 - GGUF Q4_K_M (7GB) This repository provides a 4-bit quantized GGUF build of NVIDIA Nemotron Nano 12B v2 using Q4_K_M, reducing the on-disk size to approximately 7GB from roughly 23GB for the original full precision weights, while preserving core capabilities. **Upstream base model:** [nvidia/NVIDIA-Nemotron-Nano-12B-v2](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2) **SHA256:** `82ea4805d2f9f37e3c67b06768141ff58e43fb0dcd3983a82e9c2f481eb7fea8` ## What's included - `model-q4.gguf` (7.0GB) - `tokenizer.json` - `tokenizer_config.json` - `special_tokens_map.json` - `config.json` - `generation_config.json` - `configuration_nemotron_h.py` - `modeling_nemotron_h.py` - `nemotron_toolcall_parser_no_streaming.py` - `bias.md`, `explainability.md`, `privacy.md`, `safety.md` - `acc-vs-budget.png` - `README.md` ## Capabilities - ✓ Tool calling support via preserved special tokens and helper parser script - ✓ Thinking mode tokens for structured reasoning - ✓ Long-context up to 128k window - ✓ Multilingual general-purpose LLM behavior **Note:** GGUF inference backends may vary in their native support for tool-calling integrations; use the included parser or your own orchestration as needed. ## Hardware notes - **Disk space:** 8GB free recommended for the quantized file and metadata - **CPU inference:** 16GB RAM recommended for 4k contexts; 32GB suggested for comfortable operation. For 128k contexts, memory usage grows significantly and 64 to 128GB system RAM may be required - **GPU offload:** 8 to 16GB VRAM can accelerate decoding with llama.cpp `-ngl` offloading; very long contexts may require 24 to 48GB VRAM or hybrid CPU plus GPU offload - **Throughput:** Depends on backend, threads, and offload settings ## Usage ### llama.cpp Build llama.cpp, then run: **Generate:** ```bash ./llama-cli -m model-q4.gguf -p "Hello, Nemotron." -n 128 -t 8 -c 4096 -ngl 35 ``` **Server:** ```bash ./llama-server -m model-q4.gguf -c 4096 -ngl 35 ``` For very long contexts, increase `-c` accordingly and ensure sufficient RAM or VRAM for KV cache. ### Python via llama-cpp-python ```bash pip install llama-cpp-python ``` ```python from llama_cpp import Llama llm = Llama(model_path="model-q4.gguf", n_ctx=4096, n_threads=8) out = llm("Write a short greeting.", max_tokens=128) print(out) ``` ### Ollama Create a Modelfile referencing this repo, then create and run: **Modelfile:** ``` FROM hf.co/Avarok/nvidia-nemotron-nano-12b-v2-q4_k_m PARAMETER num_ctx 4096 ``` **Commands:** ```bash ollama create nemotron-nano-12b-q4km -f Modelfile ollama run nemotron-nano-12b-q4km ``` **Note:** Ollama versions and syntax may evolve; consult Ollama docs if the Modelfile interface changes. ## License and attribution - **Base model:** NVIDIA Nemotron Nano 12B v2 - **License:** This GGUF quantized derivative is subject to the original model's license and terms. See the [upstream model card](https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2) and license. By using this repository you agree to comply with NVIDIA's licensing for Nemotron models - **Attribution:** If you use this model, please attribute both NVIDIA for the base model and this repository for the quantized packaging ## Reproducibility This artifact was produced by converting the upstream weights to GGUF and quantizing with Q4_K_M. An equivalent quantization command with llama.cpp tools is: ```bash llama-quantize input.gguf model-q4.gguf Q4_K_M ``` Exact commands may differ based on the conversion workflow for the upstream model version. ## Safety Review the included bias, privacy, and safety documents. As with all LLMs, outputs may be inaccurate or unsafe without proper safeguards and human oversight. ![Accuracy vs Budget](acc-vs-budget.png)