---
base_model: unsloth/gpt-oss-20b-unsloth-bnb-4bit
tags:
- text-generation-inference
- transformers
- unsloth
- gpt_oss
license: apache-2.0
language:
- en
---
## Model Card
### We release open-weight early experimental Codeforce metatune-gpt20b, fine tuned version of OpenAI's gpt-oss-20b model,  this is one of the first public release recursive self improving AI.
- Generates new data for itself of Codeforce-Cot
- Evaluates its performance, and
- Adjusts its own hyperparameters based on improvement metrics.

## Use cases: 
- Coding

## Guardrails:
- generally, please set reasoning = "high", it will usually prevent jailbreaking and prompt injection
- use safety gpt oss 20b for guardrails before this model:  [openai/gpt-oss-safeguard-20b](https://huggingface.co/openai/gpt-oss-safeguard-20b)

# Inference examples

## Transformers

You can use `gpt-oss-120b` and `gpt-oss-20b` with Transformers. If you use the Transformers chat template, it will automatically apply the [harmony response format](https://github.com/openai/harmony). If you use `model.generate` directly, you need to apply the harmony format manually using the chat template or use our [openai-harmony](https://github.com/openai/harmony) package.

To get started, install the necessary dependencies to setup your environment:

```
pip install -U transformers kernels torch 
```

For Google Colab (free/Pro)
```
!pip install -q --upgrade torch

!pip install -q transformers triton==3.4 kernels

!pip uninstall -q torchvision torchaudio -y
```

Once, setup you can proceed to run the model by running the snippet below:

```py
from transformers import pipeline
import torch
model_id = "EpistemeAI/Codeforce-metatune-gpt20b"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)
messages = [
    {"role": "user", "content": "Derive the Euler–Lagrange equation from the principle of stationary action.""},
]
outputs = pipe(
    messages,
    max_new_tokens=3000,
)
print(outputs[0]["generated_text"][-1])
```
# Reasoning levels

You can adjust the reasoning level that suits your task across three levels:

* **Low:** Fast responses for general dialogue.  
* **Medium:** Balanced speed and detail.  
* **High:** Deep and detailed analysis.

The reasoning level can be set in the system prompts, e.g., "Reasoning: high".

# Tool use

The gpt-oss models are excellent for:
* Web browsing (using built-in browsing tools)
* Function calling with defined schemas
* Agentic operations like browser tasks

# Fine-tuning

Both gpt-oss models can be fine-tuned for a variety of specialized use cases.

This smaller model `gpt-oss-20b` can be fine-tuned on consumer hardware, whereas the larger [`gpt-oss-120b`](https://huggingface.co/openai/gpt-oss-120b) can be fine-tuned on a single H100 node.


# Benchmark
```py
#humaneval
!lm_eval --model hf --model_args pretrained=EpistemeAI/Codeforce-metatune-gpt20b,parallelize=True,dtype=bfloat16 --tasks humaneval --trust_remote_code --confirm_run_unsafe_code  --num_fewshot 0 --gen_kwargs temperature=0.9,top_p=0.9,max_new_tokens=1024 --batch_size auto:4 --limit 10  --device cuda:0 --output_path ./eval_harness/gpt-oss-20b3
```

hf (pretrained=EpistemeAI/Codeforce-metatune-gpt20b,parallelize=True,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (temperature=0.9,top_p=0.9,max_new_tokens=1024), limit: 10.0, num_fewshot: 0, batch_size: auto:4
|  Tasks  |Version|  Filter   |n-shot| Metric  |   |Value|   |Stderr|
|---------|------:|-----------|-----:|---------|---|----:|---|-----:|
|humaneval|      1|create_test|     0|pass@1   |   |  0.9|±  |   0.1|

# 🧠 Model Benchmark Comparison

This table presents HumanEval benchmark scores across several large language models.

| Model                 | HumanEval |
|------------------------|------------|
| Codeforce-GPT-oss-20b  | **90**     |
| Qwen 3 235B            | 80         |
| DeepSeek-R1 70B        | 88         |
| Phi-4 Reasoning         | 88         |
| Llama 4 Scout           | 78         |
| Llama 3.3 70B           | 83         |
| Gemma 3 27B             | 76         |
| GPT-OSS 20B             | 73         |
| GPT-OSS 120B            | 71         |

---

### 📊 Notes
- **HumanEval** measures coding problem-solving and reasoning ability.  
- Scores are normalized for consistency across models.  
- Models highlighted in **bold** achieved top-tier performance.

---

### 🔍 Summary
Codeforce-GPT-oss-20b leads the benchmark, surpassing even larger models like Qwen 3 235B and DeepSeek-R1 70B. Its superior reasoning and code synthesis capabilities indicate an optimized training strategy rather than sheer scale dominance.

--------------------------------------

- **Developed by:** EpistemeAI
- **License:** apache-2.0
- **Finetuned from model :** unsloth/gpt-oss-20b-unsloth-bnb-4bit

This gpt_oss model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

# Citation

```bibtex

@misc{bi2025gptossgoodcomprehensiveevaluation,
      title={Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models}, 
      author={Ziqian Bi and Keyu Chen and Chiung-Yi Tseng and Danyang Zhang and Tianyang Wang and Hongying Luo and Lu Chen and Junming Huang and Jibin Guan and Junfeng Hao and Junhao Song},
      year={2025},
      eprint={2508.12461},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.12461}, 
}
```