How to build a tokenizer from a vocab subset of a BPE tokenizer

Hi community,

I want to distill a pretrained BPE tokenizer for my domain-specific corpus, is there anything to pay attention to?

What I will do in my mind is use the pretrained one to first tokenize all sentences of the corpus(I already did), find out the used token and get rid of the unused ones from the vocabulary. Should I also take care of the merges and make the new tokenizer again a BPE tokenizer or should I just use the subset of vocabulary to make a WordLevel tokenizer? Does anyone have already done the same thing?

Thanks!

alephpi

1 Like

It seems more stable to avoid modifying the existing BPE tokenizer as much as possible. Well, maybe because the core part of the Tokenizer library is written in Rust…

1 Like

I see, let me check your solution, since I really need to distill the vocabulary as it will enormously save my model size(from 50000 to <1000)

1 Like

Unless we change it to the WordLevel tokenizer, the distillation itself seems possible without affecting the Rust-written parts.

Hi John, I’m following your pruning script. It can be constructed and loaded, but the new tokenizer doesn’t have the same behavior as the original one, especially for merged tokens(original one merged but the new one doesn’t)

Is there a debug mode that we can find out how the token is merged during the tokenizer process?

1 Like

I see, there are some nuances about the merging procedure. In my case I have f,r,a,c,frac as tokens. But I don’t have any merge paths from f,r,a,c to frac since none of the intermediate combinations exists in my keep vocab file

1 Like

Ah ha, I find out a way to include the minimal merge closure for all my keep vocab can be merged to, just slightly modify the function below, and I’ve validated such closure would provide exactly same behavior as the original one(at least on my corpus)

def filter_merges_to_subset(merges: list[tuple[str,str]], keep: set[str]):
    # Keep merge (a,b) when (a+b) belongs to keep and join the a,b to keep to provide an accessible merge path to (a+b)
    # update the keep until no more merge paths can be found
    # BPE merges are greedy and ordered; preserve order.
    filtered_raw = []
    new_keep: Set[str] = set()
    while True:
        keep |= new_keep
        for a, b in merges:
            merged = a + b
            if merged in keep:
                if (a,b) not in filtered_raw:
                    filtered_raw.append((a,b))
                    new_keep.update((a,b))
        if new_keep - keep == set():
            break

    # reorder the filtered merges to preserve order as the raw will break the order as we add merges in multiple loops
    filtered = []
    for merge in merges:
        if merge in filtered_raw:
            filtered.append(merge)
    return filtered

To give some impression:

Before debugging: ~950 tokens + 741 merges

After debugging: 1264 tokens + 1004 merges (some intermediate tokens for merge paths are added, though no occurrence at the end of tokenization)

Original: 50000 tokens + 49721 merges

But after all, it worths distilling.

(Refined a little bit, the previous version worked but contains repetitive merges)

1 Like

BTW, thank you so much for your very detailed answer. I’m so grateful that you add so much references, would you give me a reading list that I can learn Transformers or Tokenizers? I saw you refer to a Transformers notebook blog, but perhaps you know helpful materials more than that? Sometimes I just find the chat-AIs are not so intelligent when I ask about the Transformers/Tokenizers APIs.

1 Like

I saw you refer to a Transformers notebook blog, but perhaps you know helpful materials more than that?

About Transformers…
by Me.


by GPT.

Start here

  • Tokenizers quicktour. Build and train BPE end-to-end; inspect tokenizer.json. (Hugging Face)

  • Transformers tokenizer API. Fast vs. slow, specials, saving, resizing. (Hugging Face)

  • LLM Course: train a new tokenizer from an old one (train_new_from_iterator). (Hugging Face)

  • Transformers quicktour for full workflow context. (Hugging Face)

  • Your earlier outline, consolidated.

Distillation and pruning (practical)

  • “Tokenizer shrinking recipes.” Multiple working scripts and caveats. (Hugging Face Forums)

  • Removing tokens from GPT/BPE tokenizers: why simple deletion fails; recreate backend. (Hugging Face Forums)

  • Tokenizers issue on vocab reduction pitfalls and current guidance. (GitHub)

SentencePiece / Unigram

  • Trim down SentencePiece vocabulary by editing ModelProto.pieces (step-by-step). (Hugging Face)

  • SentencePiece training options, including hard_vocab_limit.

Tokenizer types and behavior

  • Summary of tokenizers: BPE vs WordPiece vs Unigram, pros and trade-offs. (Hugging Face)

  • Fast tokenizers docs: offsets, alignment, performance notes. (Hugging Face)

  • Building a tokenizer from scratch (mix and match normalizers, pre-tokenizers, models). (Hugging Face)

Pitfalls to avoid

  • Cleaning or changing ByteLevel BPE alphabets alters coverage; know consequences. (Hugging Face Forums)

  • Keep config.vocab_size synced when resizing embeddings; common failure mode. (Hugging Face)

  • Space handling in BPE tokenizers (add_prefix_space) affects segmentation. (Hugging Face Forums)

Performance tips

  • Use fast tokenizers; confirm is_fast; batch properly; multiprocessing guidance. (Hugging Face Forums)

  • Tokenizers Python docs for API surface and saving formats. (Hugging Face)

Research for principled pruning

Use order: quicktour → tokenizer API → LLM course train-new → shrinking threads/issues → SP trimming if Unigram → pitfalls/perf → BPE-Knockout.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.