How to build a tokenizer from a vocab subset of a BPE tokenizer

I saw you refer to a Transformers notebook blog, but perhaps you know helpful materials more than that?

About Transformers…
by Me.


by GPT.

Start here

  • Tokenizers quicktour. Build and train BPE end-to-end; inspect tokenizer.json. (Hugging Face)

  • Transformers tokenizer API. Fast vs. slow, specials, saving, resizing. (Hugging Face)

  • LLM Course: train a new tokenizer from an old one (train_new_from_iterator). (Hugging Face)

  • Transformers quicktour for full workflow context. (Hugging Face)

  • Your earlier outline, consolidated.

Distillation and pruning (practical)

  • “Tokenizer shrinking recipes.” Multiple working scripts and caveats. (Hugging Face Forums)

  • Removing tokens from GPT/BPE tokenizers: why simple deletion fails; recreate backend. (Hugging Face Forums)

  • Tokenizers issue on vocab reduction pitfalls and current guidance. (GitHub)

SentencePiece / Unigram

  • Trim down SentencePiece vocabulary by editing ModelProto.pieces (step-by-step). (Hugging Face)

  • SentencePiece training options, including hard_vocab_limit.

Tokenizer types and behavior

  • Summary of tokenizers: BPE vs WordPiece vs Unigram, pros and trade-offs. (Hugging Face)

  • Fast tokenizers docs: offsets, alignment, performance notes. (Hugging Face)

  • Building a tokenizer from scratch (mix and match normalizers, pre-tokenizers, models). (Hugging Face)

Pitfalls to avoid

  • Cleaning or changing ByteLevel BPE alphabets alters coverage; know consequences. (Hugging Face Forums)

  • Keep config.vocab_size synced when resizing embeddings; common failure mode. (Hugging Face)

  • Space handling in BPE tokenizers (add_prefix_space) affects segmentation. (Hugging Face Forums)

Performance tips

  • Use fast tokenizers; confirm is_fast; batch properly; multiprocessing guidance. (Hugging Face Forums)

  • Tokenizers Python docs for API surface and saving formats. (Hugging Face)

Research for principled pruning

Use order: quicktour → tokenizer API → LLM course train-new → shrinking threads/issues → SP trimming if Unigram → pitfalls/perf → BPE-Knockout.

1 Like