DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Paper
•
2402.03300
•
Published
•
140
This model is a fine-tuned version of UCL-CSSB/PlasmidGPT-SFT using Group Relative Policy Optimization (GRPO).
PlasmidGPT-RL is trained to generate functional plasmid DNA sequences. It was fine-tuned using reinforcement learning with a reward model that evaluates:
This model was trained with GRPO using the TRL library.
Training run: Weights & Biases
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("McClain/PlasmidGPT-RL")
model = AutoModelForCausalLM.from_pretrained("McClain/PlasmidGPT-RL")
# Generate a plasmid sequence
prompt = "ATG"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
inputs.input_ids,
max_new_tokens=256,
do_sample=True,
temperature=0.95,
top_p=0.9
)
sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(sequence)
If you use this model, please cite the GRPO paper:
@article{shao2024deepseekmath,
title={{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
author={Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
year={2024},
eprint={arXiv:2402.03300},
}
Base model
UCL-CSSB/PlasmidGPT-SFT