How to build a tokenizer from a vocab subset of a BPE tokenizer

John6666 · September 26, 2025, 10:09pm

I saw you refer to a Transformers notebook blog, but perhaps you know helpful materials more than that?

About Transformers…
by Me.

by GPT.

Start here

Tokenizers quicktour. Build and train BPE end-to-end; inspect tokenizer.json. (Hugging Face)
Transformers tokenizer API. Fast vs. slow, specials, saving, resizing. (Hugging Face)
LLM Course: train a new tokenizer from an old one (train_new_from_iterator). (Hugging Face)
Transformers quicktour for full workflow context. (Hugging Face)
Your earlier outline, consolidated.

“Tokenizer shrinking recipes.” Multiple working scripts and caveats. (Hugging Face Forums)
Removing tokens from GPT/BPE tokenizers: why simple deletion fails; recreate backend. (Hugging Face Forums)
Tokenizers issue on vocab reduction pitfalls and current guidance. (GitHub)

Trim down SentencePiece vocabulary by editing ModelProto.pieces (step-by-step). (Hugging Face)
SentencePiece training options, including hard_vocab_limit.

Summary of tokenizers: BPE vs WordPiece vs Unigram, pros and trade-offs. (Hugging Face)
Fast tokenizers docs: offsets, alignment, performance notes. (Hugging Face)
Building a tokenizer from scratch (mix and match normalizers, pre-tokenizers, models). (Hugging Face)

Cleaning or changing ByteLevel BPE alphabets alters coverage; know consequences. (Hugging Face Forums)
Keep config.vocab_size synced when resizing embeddings; common failure mode. (Hugging Face)
Space handling in BPE tokenizers (add_prefix_space) affects segmentation. (Hugging Face Forums)

Use fast tokenizers; confirm is_fast; batch properly; multiprocessing guidance. (Hugging Face Forums)
Tokenizers Python docs for API surface and saving formats. (Hugging Face)

Use order: quicktour → tokenizer API → LLM course train-new → shrinking threads/issues → SP trimming if Unigram → pitfalls/perf → BPE-Knockout.

Topic		Replies	Views
How do I remove tokens from a BPE Tokenizer's vocabulary? 🤗Tokenizers	2	798	July 3, 2024
Tokenizer shrinking recipes 🤗Tokenizers	7	2835	December 24, 2023
Get intermediate tokens and merges used in tokenization 🤗Tokenizers	0	498	December 1, 2023
How to create a HF tokenizer's vocab file from a BPE model's merges.txt file? 🤗Tokenizers	0	490	May 13, 2023
How to properly clean vocabulary from BBPE tokenizer 🤗Tokenizers	3	1063	October 1, 2022