Question about LiveCodeBench evaluation setup and code RL

#2
by s2580 - opened

On the report, the LiveCodeBench score is listed as 65.9% for IF-RL, but when we run the evaluation ourselves, we can only reproduce around 37.7%. Could you please share the exact evaluation configuration used for the reported number, such as timeout (per test / per problem).
In addition, could you share the prompt setup used for evaluation? If possible, could you also open-source the evaluation code (or provide a script/config/command) so the results can be reproduced reliably?
21531766739614_.pic

Also, I noticed the code RL training datahas not been released yet. Is there any plan to release (or partially release) the code RL training dataset?

NVIDIA org

Hi @s2580 I thought everything about reproducing our eval results can be found here: https://huggingface.co/nvidia/Nemotron-Cascade-14B-Thinking/blob/main/evaluation/README.md. Could you please check it carefully? Everything you asked (prompt/eval config/scripts/command) can be found in this subfolder.

Regarding the release of Code-RL dataset, it contains data that we internally purchased from official CP platforms. We are making efforts on releasing part of the data in the future.

Sign up or log in to comment