There seem to be roughly two approaches?
At a high level the “common approach” is:
Treat your local / GCS dataset as normal files, and point TRL (or your script) at those file paths via datasets.load_dataset or the TRL datasets: config.
You are not forced to use a Hugging Face Hub dataset name.
Concretely, there are two standard patterns:
- Use
dataset_name as a local path (simple case).
- Use a YAML config with
datasets: + data_files (more explicit, ideal for JSONL/CSV on GCS).
And on Vertex AI, “local path” just means “path under /gcs/<BUCKET>” because Cloud Storage is mounted into the container. (Google Cloud)
1. Background: TRL and Hugging Face Datasets
1.1 TRL’s dataset_name is “path or name”
In the TRL docs for the CLI/script utilities, the key line is:
dataset_name (str, optional) — Path or name of the dataset to load. (Hugging Face)
This means:
- If you pass
dataset_name=timdettmers/openassistant-guanaco, it loads from the Hugging Face Hub.
- If you pass
dataset_name=/path/to/mycorpus, it treats it as a path and calls datasets.load_dataset(path="/path/to/mycorpus", ...).
You can see a real example of using a local path with TRL in a forum thread:
examples/scripts/sft.py \
--model_name google/gemma-7b \
--dataset_name path/to/mycorpus \
...
and the same script works with a Hub dataset name like OpenAssistant/oasst_top1_2023-08-25. (Hugging Face Forums)
So the CLI is designed to handle both.
1.2 Hugging Face datasets handles local and remote files
Hugging Face Datasets supports:
- Datasets from the Hub
- Local datasets
- Remote datasets (HTTP, S3/GCS/… via URLs or storage options)
The canonical docs say:
“Datasets can be loaded from local files stored on your computer and from remote files… CSV, JSON, TXT, parquet… load_dataset() can load each of these file types.” (Hugging Face)
Typical local example:
from datasets import load_dataset
ds = load_dataset(
"json",
data_files={
"train": "path/to/train.jsonl",
"validation": "path/to/val.jsonl",
},
)
You can also pass lists of paths or multiple splits. (Hugging Face Forums)
TRL just delegates to this API under the hood.
2. Vertex AI detail: GCS buckets are mounted under /gcs
For Vertex AI custom training jobs, Google uses Cloud Storage FUSE so that Cloud Storage looks like a normal filesystem inside the container:
“When you start a custom training job, the job sees a directory /gcs, which contains all your Cloud Storage buckets as subdirectories.” (Google Cloud)
So if you have data at:
gs://my-bucket/drug-herg/train.jsonl
gs://my-bucket/drug-herg/eval.jsonl
then inside the training container you see:
/gcs/my-bucket/drug-herg/train.jsonl
/gcs/my-bucket/drug-herg/eval.jsonl
From TRL / datasets.load_dataset perspective, these are just normal local paths.
That’s the key: GCS → /gcs/<BUCKET> → treat as local files.
3. Pattern 1 (simplest): use dataset_name as a path
If your data directory is something datasets can detect automatically (e.g., Parquet or a saved HF dataset), you can often just do:
3.1 Local machine
Assume:
/home/you/data/drug-herg/
train.jsonl
eval.jsonl
You could save this as a HF dataset first (optional):
from datasets import load_dataset
ds = load_dataset(
"json",
data_files={
"train": "/home/you/data/drug-herg/train.jsonl",
"validation": "/home/you/data/drug-herg/eval.jsonl",
},
)
ds.save_to_disk("/home/you/data/drug-herg-hf")
Then run TRL:
trl sft \
--model_name_or_path google/gemma-2b-it \
--dataset_name=/home/you/data/drug-herg-hf \
...
Here dataset_name is a path, and TRL will internally call datasets.load_from_disk / load_dataset as appropriate. The StackOverflow/GeeksforGeeks posts show exactly this pattern for local paths. (Stack Overflow)
3.2 Vertex AI
Upload your HF-saved dataset directory to GCS:
gs://my-bucket/drug-herg-hf/ (contains dataset files saved_to_disk)
Inside the container, that is /gcs/my-bucket/drug-herg-hf.
Then in your CustomContainerTrainingJob args:
args = [
"--model_name_or_path=google/gemma-2b-it",
"--dataset_name=/gcs/my-bucket/drug-herg-hf",
# other TRL args...
]
This is the simplest approach when you want to reuse a pre-saved HF dataset. But it requires you to create that HF dataset once (either locally and upload, or directly on GCS).
4. Pattern 2 (more flexible, common with JSONL/CSV): YAML datasets: with data_files
This is the pattern most people use when they have raw JSONL/CSV files and want full control, especially on Vertex AI.
4.1 Why use datasets: instead of dataset_name?
The TRL script-utils docs explicitly support a datasets mixture config:
dataset_name (str, optional) - Path or name of the dataset to load. If datasets is provided, this will be ignored. (Hugging Face)
That is, if you define datasets in the YAML:
- TRL ignores
dataset_name.
- Uses your
datasets entries (each mapping more or less directly to datasets.load_dataset).
This is the cleanest way to tell TRL:
- “Use the JSON builder”
- “Here are my train/validation files”
- “Use only the
prompt and completion columns”
4.2 Example dataset on GCS
Say you have:
gs://my-bucket/drug-herg/train.jsonl
gs://my-bucket/drug-herg/eval.jsonl
With prompt–completion records (your current SFT format):
{"prompt": "Instructions... SMILES: O=C(...)\nAnswer:", "completion": " (B)<eos>"}
{"prompt": "Instructions... SMILES: CCN(...)\nAnswer:", "completion": " (A)<eos>"}
...
Inside the Vertex container:
/gcs/my-bucket/drug-herg/train.jsonl
/gcs/my-bucket/drug-herg/eval.jsonl
4.3 YAML config for TRL CLI
trl.sft can be driven by a config like:
# sft_config.yaml
# ---------- Model ----------
model_name_or_path: google/gemma-2b-it
# ---------- Output ----------
output_dir: /gcs/my-bucket/outputs/txgemma-herg
overwrite_output_dir: true
# ---------- Training ----------
max_seq_length: 1024
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
gradient_accumulation_steps: 8
num_train_epochs: 3
learning_rate: 5e-5
warmup_ratio: 0.05
weight_decay: 0.01
bf16: true
# ---------- LoRA / PEFT ----------
use_peft: true
lora_r: 8
lora_alpha: 16
lora_dropout: 0.1
lora_target_modules: all-linear
# ---------- Dataset(s) ----------
datasets:
- path: json # use HF "json" dataset builder
data_files:
train: /gcs/my-bucket/drug-herg/train.jsonl
validation: /gcs/my-bucket/drug-herg/eval.jsonl
split: train # the split used for training
columns: [prompt, completion] # keep only these columns
# Ignore these when datasets: is defined
dataset_name: null
dataset_text_field: null
# ---------- SFT options ----------
completion_only_loss: true # train only on completion tokens
Key points:
path: json tells datasets.load_dataset("json", ...) to use the JSON builder. (Hugging Face)
data_files uses the GCS-mounted paths under /gcs/my-bucket.
columns trims the dataset to exactly the fields SFTTrainer needs.
completion_only_loss: true ensures the loss is applied only on the completion, not the prompt. (Hugging Face)
4.4 Running this locally vs Vertex
Locally (for testing):
On Vertex AI:
-
Upload sft_config.yaml itself to GCS, e.g. gs://my-bucket/configs/sft_config.yaml.
-
Inside container: /gcs/my-bucket/configs/sft_config.yaml.
-
In CustomContainerTrainingJob:
args = ["--config=/gcs/my-bucket/configs/sft_config.yaml"]
job = aiplatform.CustomContainerTrainingJob(
display_name="txgemma-herg-lora-sft",
container_uri=CONTAINER_URI,
command=[
"sh",
"-c",
'exec trl sft "$@"',
"--",
],
)
job.run(
args=args,
# machine_type, accelerator, etc.
)
From TRL’s perspective, this is indistinguishable from local training with a JSON dataset; the only difference is the /gcs/... prefix.
5. Summary: “common approach” in one place
Putting it all together, the standard practice to point TRL (and TRL CLI on Vertex) to local or GCS data instead of a Hub dataset is:
-
Store the dataset as normal files (JSONL/CSV/Parquet) either:
- on local disk for local runs, or
- in a Cloud Storage bucket for Vertex.
-
Treat the GCS paths as local paths under /gcs/<BUCKET> inside the Vertex container. (Google Cloud)
-
Use one of:
--dataset_name=/gcs/<BUCKET>/path/to/hf-saved-dataset if you’re using a dataset saved with save_to_disk, or
- A YAML
datasets: config that calls datasets.load_dataset("json"/"csv", data_files={...}) on those paths.
-
Avoid thinking of dataset_name as “must be from the Hub” – per TRL’s own docs, it is “path or name.” (Hugging Face)
That is the common and recommended approach when you want to keep data off the Hub and inside your own filesystem or GCS environment.