I am trying to fine-tune a pretrained model for NER tasks for which I have multiple datasets from different sources. They are a mishmash of different types ie, some are of standoff format and some use the IOB format.
The problem is that even among the datasets that use the IOB format, the tokens seem to be tokenized in different ways. For example, the words “U.S.A” is tokenized as:
dataset_A = [‘U’, ‘.’, ‘S’, ‘.’, ‘A’]
dataset_B = [‘U.’, ‘S.’, ‘A’]
As you can see, there is a big discrepancy here. How do I deal with this if I want to use both datasets to fine tune my model without any conflicts
1 Like
You’ve identified the core issue correctly: IOB labels are defined over tokens, not text, so once tokenization diverges, the labels are no longer comparable.
There are essentially three safe strategies, depending on how much control you want.
1. Canonicalize tokenization first (recommended)
Pick one tokenizer (ideally the pretrained model’s tokenizer) as the source of truth and re-tokenize all datasets to it.
Practical approach:
-
Convert all datasets (IOB + standoff) into a common character-span representation first.
-
Then re-tokenize the raw text using your chosen tokenizer.
-
Re-project labels from spans → tokens.
This avoids heuristics like splitting/merging tokens after the fact, which tends to silently corrupt labels.
2. Use alignment with offsets (acceptable, but brittle)
If you already have offset mappings:
This works, but you’ll still need to handle edge cases like punctuation splits (U.S.A vs U . S . A) carefully.
3. Don’t mix tokenizations at all (simplest)
If the datasets are small or heterogeneous:
This avoids conflicts but limits cross-dataset learning.
Important rule of thumb:
If two datasets disagree on token boundaries, fix it before training. Trying to “average it out” during training usually just teaches the model inconsistent supervision.
Your instinct is right — this isn’t a modeling problem, it’s a data contract problem.
Happy to clarify further if you want to share which model/tokenizer you’re targeting.
1 Like