Question about LiveCodeBench evaluation setup and code RL

by s2580 - opened Dec 26, 2025

Dec 26, 2025

On the report, the LiveCodeBench score is listed as 65.9% for IF-RL, but when we run the evaluation ourselves, we can only reproduce around 37.7%. Could you please share the exact evaluation configuration used for the reported number, such as timeout (per test / per problem).
In addition, could you share the prompt setup used for evaluation? If possible, could you also open-source the evaluation code (or provide a script/config/command) so the results can be reproduced reliably?

Also, I noticed the code RL training datahas not been released yet. Is there any plan to release (or partially release) the code RL training dataset?

zhuoliny

NVIDIA org Jan 6

Hi @s2580 I thought everything about reproducing our eval results can be found here: https://huggingface.co/nvidia/Nemotron-Cascade-14B-Thinking/blob/main/evaluation/README.md. Could you please check it carefully? Everything you asked (prompt/eval config/scripts/command) can be found in this subfolder.

Regarding the release of Code-RL dataset, it contains data that we internally purchased from official CP platforms. We are making efforts on releasing part of the data in the future.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment