Training a tokenizer clarification question

Hi, I’m building a custom tokenizer using tokenizers and want to confirm that my pre-tokenizer pipeline is behaving correctly.

My setup:

tokenizer.pre_tokenizer = Sequence([
   Whitespace(),                               # 
   PreTokenizer.custom(CustomTokenizer())
])

A few clarifications I’d like to confirm:

  1. Whitespace() creates separate segments, so these are hard boundaries, meaning BPE will never merge across words , even though the flat printed output doesn’t show the segment breaks.

  2. The chunks from my custom pre-tokenizer are soft boundaries, meaning BPE can merge them during training and build larger subword tokens , and the final vocabulary will be built from these rule-based chunks instead of raw bytes/characters.

  3. The fact that .pre_tokenize_str() prints a flat list without explicit space tokens is expected, internally, Whitespace has already set the segmentation boundaries and these are preserved in training.

Can you please confirm that this behavior is correct and that BPE will indeed treat:

  • word boundaries as hard, and
  • my rule-based chunks as mergeable units for building the vocabulary and form a starting point of my vocab?
1 Like

Hmm…?


Short version up front:

  1. Yes, Whitespace() creates hard word/punctuation boundaries. BPE will never produce a token that crosses those splits.
  2. No, your custom pre-tokenizer’s chunks are not soft boundaries. They are also hard bounds: BPE can only merge within each chunk, not across chunks.
  3. Yes, a flat list from .pre_tokenize_str() is expected; internal segmentation (from both Whitespace and your custom pre-tokenizer) is preserved and used during training.

So BPE will treat:

  • Word boundaries from Whitespace as hard,
  • Your rule-based chunks also as hard, not as mergeable units that can be glued together.

Below is the detailed explanation and context.


1. How the HF tokenizers pipeline is structured

The Hugging Face tokenizers library conceptually does:

  1. Normalizer → clean / canonicalize text.
  2. Pre-tokenizer → split text into “pretokens” (segments).
  3. Model (BPE, WordPiece, Unigram, etc.) → operate inside each segment to produce subword tokens.
  4. Post-processor / decoder → add special tokens, map IDs back to text, etc.(Hugging Face)

The official docs describe pre-tokenization as:

Pre-tokenization is the act of splitting a text into smaller objects that give an upper bound to what your tokens will be at the end of training. A good way to think of this is that the pre-tokenizer will split your text into “words” and then, your final tokens will be parts of those words. (Hugging Face)

And explicitly for pre-tokenizers:

The PreTokenizer takes care of splitting the input according to a set of rules. This pre-processing lets you ensure that the underlying Model does not build tokens across multiple “splits”. (Stack Overflow)

That “upper bound” / “does not build tokens across splits” is the core invariant you care about.

So:

  • Pre-tokenization determines where tokens may never cross.
  • BPE then decides how to merge characters/bytes inside each segment.

2. What Whitespace() actually does

The built-in Whitespace pre-tokenizer is documented as:

This pre-tokenizer splits on word boundaries according to the \w+|[^\w\s]+ regex pattern. It splits on word characters or characters that aren’t words or whitespaces (punctuation such as hyphens, apostrophes, commas, etc.). (Hugging Face)

Example from the docs (simplified):

from tokenizers.pre_tokenizers import Whitespace
Whitespace().pre_tokenize_str("Hello, how are you?")
# → [("Hello", (0,5)), (",", (5,6)), ("how", (7,10)), ("are", (11,14)),
#    ("you", (15,18)), ("?", (18,19))]

So after Whitespace:

  • The input is represented as a PreTokenizedString consisting of segments like "Hello", ",", "how", etc.
  • Each segment is now a hard upper bound: BPE can only merge characters/bytes inside a segment, never across two segments.(Hugging Face)

That directly confirms your point 1:

1. Whitespace() creates separate segments → yes, they are hard boundaries, and BPE will not merge across words or punctuation splits.


3. How Sequence([Whitespace(), CustomTokenizer]) behaves

You defined:

tokenizer.pre_tokenizer = Sequence([
   Whitespace(),
   PreTokenizer.custom(CustomTokenizer())
])

The Sequence pre-tokenizer simply applies each component in order to the same PreTokenizedString.(Hugging Face)

Internally, a custom pre-tokenizer is implemented by:

  • Receiving a mutable PreTokenizedString.
  • Calling pretokenized.split(...) to further split existing segments.

Key property from the design (and from both docs and related papers):

  • Pre-tokenization creates tokenization bounds. The tokenization algorithm (BPE, WordPiece, etc.) is applied within pretokens. “Pre-tokenization governs the maximum possible token length that can be learned from these pretokens.” (ACL Anthology)

Critically:

  • PreTokenizedString.split can only subdivide an existing segment.
  • There is no API to merge segments or to “soften” a boundary.

So in your exact pipeline:

  1. Whitespace() splits the normalized string into initial segments (words and punctuation).
  2. CustomTokenizer runs on each of those segments and can only split them into smaller pieces.

Result:

  • Boundaries from Whitespace are hard.
  • Boundaries you introduce in CustomTokenizer are also hard.
  • BPE only operates inside the smallest resulting segments.

So your point 2 as written:

2. The chunks from my custom pre-tokenizer are soft boundaries…

is not correct in the semantics of HF tokenizers. They behave just like the whitespace boundaries, only “nested” inside them.

BPE cannot:

  • Take segment A = "foo" and segment B = "bar" (produced by your custom pre-tokenizer)
  • And then learn a token "foobar" that spans A+B.

It will only ever see "foo" and "bar" in isolation during training.

This limitation (BPE cannot cross pretokens) is exactly what the BoundlessBPE paper calls the pre-tokenization barrier and explicitly modifies BPE to relax. (arXiv)


4. What BPE actually learns from your setup

During training with tokenizers:

  1. Text → normalized.

  2. Pre-tokenizer (Whitespace + CustomTokenizer) → PreTokenizedString with segments S1, …, Sn.

  3. For each segment Si:

    • Initialize as a sequence of characters or bytes (depending on whether you use byte-level or normal BPE).
    • Count frequency of adjacent symbol pairs inside Si only.
    • Iteratively merge frequent pairs to form subwords.

Because of the pre-tokenization bounds:

  • BPE never sees a pair (x, y) where x is the last symbol of one segment and y is the first symbol of another.
  • Therefore, it never learns merges that cross segment boundaries.(Hugging Face)

So in practice:

  • The vocabulary is built from sequences of characters/bytes inside the segments produced by your pre-tokenizer.
  • Your rule-based chunks do not become “mergeable units” at the segment boundary level; they instead define the regions inside which merges are allowed.

Rephrasing your point 2 correctly:

The chunks from my custom pre-tokenizer define upper bounds. BPE can merge characters/bytes within each chunk, but cannot merge across chunks. The final vocabulary is made of character/byte n-grams that never cross those rule-based boundaries.


5. Why .pre_tokenize_str() looks flat but segmentation is preserved

The helper:

tokenizer.pre_tokenizer.pre_tokenize_str(text)

returns a flat list of (piece, (start, end)) pairs. This is just a convenience API for inspection.

Internally, the library works with a richer structure PreTokenizedString + NormalizedString that:

  • Tracks segments and sub-segments,
  • Tracks offsets back to the original text,
  • Feeds those segments into the model (BPE).(Hugging Face)

So your point 3:

3. The fact that .pre_tokenize_str() prints a flat list without explicit space tokens is expected, internally, Whitespace has already set the segmentation boundaries and these are preserved in training.

is correct:

  • Flat output is just a pretty-printed view.
  • Actual training uses the segmented structure, with all boundaries (from both Whitespace and CustomTokenizer) enforced.

6. Direct answers to your final question

You asked whether you can rely on:

  1. Word boundaries as hard
  2. Rule-based chunks as mergeable starting units

Given the above:

6.1 Will BPE treat word boundaries as hard?

Yes.

  • Whitespace() splits at word/punctuation boundaries. (Hugging Face)
  • Pre-tokenization gives an upper bound on tokens; the model does not build tokens across splits. (Hugging Face)
  • So BPE will never create a token whose span crosses a Whitespace split.

Example: "hello world":

  • After Whitespace: ["hello", "world"].
  • BPE can create "hello", "hel", "lo", "world", etc., but it cannot create "hello world" as one token.

6.2 Will BPE treat your rule-based chunks as mergeable units across their boundaries?

No.

  • Your custom pre-tokenizer is implemented by splitting existing segments in PreTokenizedString.
  • These splits are just as hard as whitespace splits.
  • BPE will only merge inside each chunk, not across the chunk boundaries.

So BPE:

  • Does not see your chunks as smaller word-level “units” it can glue together.
  • Instead, it sees characters/bytes inside each chunk and builds tokens from those, constrained by your rules.

6.3 Is .pre_tokenize_str()’s flat output compatible with all of this?

Yes.

  • The flat list is only a visualization.
  • Internal segmentation (needed for model training) is preserved; splits from both Whitespace and CustomTokenizer are used to bound BPE.

7. If you actually want “soft” rule-based boundaries

Current HF tokenizers + standard BPE don’t support soft boundaries at the pre-tokenizer level. To get behavior closer to “mergeable rule-based units”, you typically:

  • Use coarser pre-tokenization (e.g. ByteLevel, sentence-level),
  • Encode your rules as markers inside the text rather than as splits,
  • Or switch to a model like Unigram/SentencePiece with minimal pre-tokenization.(Hugging Face)

Research like BoundlessBPE explicitly modifies BPE to allow merges across pretokens (“superwords”), but that is not what HF tokenizers does today. (arXiv)


Concise summary

  • Point 1: Correct. Whitespace() defines hard word/punctuation boundaries; BPE cannot cross them. (Hugging Face)
  • Point 2: Incorrect as stated. Your custom pre-tokenizer’s chunks are also hard bounds. BPE merges characters/bytes within each chunk, but cannot merge across chunk boundaries. (Hugging Face)
  • Point 3: Correct. .pre_tokenize_str() is just a flat view; internal segmentation (from both Whitespace and CustomTokenizer) is preserved and used during BPE training. (Hugging Face)