How to build a tokenizer from a vocab subset of a BPE tokenizer

alephpi · September 26, 2025, 8:13am

Hi community,

I want to distill a pretrained BPE tokenizer for my domain-specific corpus, is there anything to pay attention to?

What I will do in my mind is use the pretrained one to first tokenize all sentences of the corpus(I already did), find out the used token and get rid of the unused ones from the vocabulary. Should I also take care of the merges and make the new tokenizer again a BPE tokenizer or should I just use the subset of vocabulary to make a WordLevel tokenizer? Does anyone have already done the same thing?

Thanks!

alephpi

John6666 · September 26, 2025, 9:09am

It seems more stable to avoid modifying the existing BPE tokenizer as much as possible. Well, maybe because the core part of the Tokenizer library is written in Rust…

alephpi · September 26, 2025, 9:36am

I see, let me check your solution, since I really need to distill the vocabulary as it will enormously save my model size(from 50000 to <1000)

John6666 · September 26, 2025, 9:55am

Unless we change it to the WordLevel tokenizer, the distillation itself seems possible without affecting the Rust-written parts.

alephpi · September 26, 2025, 5:09pm

Hi John, I’m following your pruning script. It can be constructed and loaded, but the new tokenizer doesn’t have the same behavior as the original one, especially for merged tokens(original one merged but the new one doesn’t)

Is there a debug mode that we can find out how the token is merged during the tokenizer process?

alephpi · September 26, 2025, 5:23pm

I see, there are some nuances about the merging procedure. In my case I have f,r,a,c,frac as tokens. But I don’t have any merge paths from f,r,a,c to frac since none of the intermediate combinations exists in my keep vocab file

alephpi · September 26, 2025, 9:24pm

Ah ha, I find out a way to include the minimal merge closure for all my keep vocab can be merged to, just slightly modify the function below, and I’ve validated such closure would provide exactly same behavior as the original one(at least on my corpus)

def filter_merges_to_subset(merges: list[tuple[str,str]], keep: set[str]):
    # Keep merge (a,b) when (a+b) belongs to keep and join the a,b to keep to provide an accessible merge path to (a+b)
    # update the keep until no more merge paths can be found
    # BPE merges are greedy and ordered; preserve order.
    filtered_raw = []
    new_keep: Set[str] = set()
    while True:
        keep |= new_keep
        for a, b in merges:
            merged = a + b
            if merged in keep:
                if (a,b) not in filtered_raw:
                    filtered_raw.append((a,b))
                    new_keep.update((a,b))
        if new_keep - keep == set():
            break

    # reorder the filtered merges to preserve order as the raw will break the order as we add merges in multiple loops
    filtered = []
    for merge in merges:
        if merge in filtered_raw:
            filtered.append(merge)
    return filtered

To give some impression:

Before debugging: ~950 tokens + 741 merges

After debugging: 1264 tokens + 1004 merges (some intermediate tokens for merge paths are added, though no occurrence at the end of tokenization)

Original: 50000 tokens + 49721 merges

But after all, it worths distilling.

(Refined a little bit, the previous version worked but contains repetitive merges)

alephpi · September 26, 2025, 9:33pm

BTW, thank you so much for your very detailed answer. I’m so grateful that you add so much references, would you give me a reading list that I can learn Transformers or Tokenizers? I saw you refer to a Transformers notebook blog, but perhaps you know helpful materials more than that? Sometimes I just find the chat-AIs are not so intelligent when I ask about the Transformers/Tokenizers APIs.

John6666 · September 26, 2025, 10:09pm

I saw you refer to a Transformers notebook blog, but perhaps you know helpful materials more than that?

About Transformers…
by Me.

by GPT.

Start here

Tokenizers quicktour. Build and train BPE end-to-end; inspect tokenizer.json. (Hugging Face)
Transformers tokenizer API. Fast vs. slow, specials, saving, resizing. (Hugging Face)
LLM Course: train a new tokenizer from an old one (train_new_from_iterator). (Hugging Face)
Transformers quicktour for full workflow context. (Hugging Face)
Your earlier outline, consolidated.

Distillation and pruning (practical)

“Tokenizer shrinking recipes.” Multiple working scripts and caveats. (Hugging Face Forums)
Removing tokens from GPT/BPE tokenizers: why simple deletion fails; recreate backend. (Hugging Face Forums)
Tokenizers issue on vocab reduction pitfalls and current guidance. (GitHub)

SentencePiece / Unigram

Trim down SentencePiece vocabulary by editing ModelProto.pieces (step-by-step). (Hugging Face)
SentencePiece training options, including hard_vocab_limit.

Tokenizer types and behavior

Summary of tokenizers: BPE vs WordPiece vs Unigram, pros and trade-offs. (Hugging Face)
Fast tokenizers docs: offsets, alignment, performance notes. (Hugging Face)
Building a tokenizer from scratch (mix and match normalizers, pre-tokenizers, models). (Hugging Face)

Pitfalls to avoid

Cleaning or changing ByteLevel BPE alphabets alters coverage; know consequences. (Hugging Face Forums)
Keep config.vocab_size synced when resizing embeddings; common failure mode. (Hugging Face)
Space handling in BPE tokenizers (add_prefix_space) affects segmentation. (Hugging Face Forums)

Performance tips

Use fast tokenizers; confirm is_fast; batch properly; multiprocessing guidance. (Hugging Face Forums)
Tokenizers Python docs for API surface and saving formats. (Hugging Face)

Research for principled pruning

BPE-Knockout: prune merges with theory; paper + overview. (Hugging Face Forums)

Use order: quicktour → tokenizer API → LLM course train-new → shrinking threads/issues → SP trimming if Unigram → pitfalls/perf → BPE-Knockout.

system · September 27, 2025, 10:10am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How do I remove tokens from a BPE Tokenizer's vocabulary? 🤗Tokenizers	2	798	July 3, 2024
Tokenizer shrinking recipes 🤗Tokenizers	7	2835	December 24, 2023
Get intermediate tokens and merges used in tokenization 🤗Tokenizers	0	498	December 1, 2023
How to create a HF tokenizer's vocab file from a BPE model's merges.txt file? 🤗Tokenizers	0	490	May 13, 2023
How to properly clean vocabulary from BBPE tokenizer 🤗Tokenizers	3	1063	October 1, 2022