I saw you refer to a Transformers notebook blog, but perhaps you know helpful materials more than that?
About Transformers…
by Me.
- smol course
- The Large Language Model Course
- So You Want to Learn LLMs? Here’s the Roadmap
- Transformers-Tutorials
- Triton: Tutorials
- AI Study Group
by GPT.
Start here
-
Tokenizers quicktour. Build and train BPE end-to-end; inspect
tokenizer.json. (Hugging Face) -
Transformers tokenizer API. Fast vs. slow, specials, saving, resizing. (Hugging Face)
-
LLM Course: train a new tokenizer from an old one (
train_new_from_iterator). (Hugging Face) -
Transformers quicktour for full workflow context. (Hugging Face)
-
Your earlier outline, consolidated.
Distillation and pruning (practical)
-
“Tokenizer shrinking recipes.” Multiple working scripts and caveats. (Hugging Face Forums)
-
Removing tokens from GPT/BPE tokenizers: why simple deletion fails; recreate backend. (Hugging Face Forums)
-
Tokenizers issue on vocab reduction pitfalls and current guidance. (GitHub)
SentencePiece / Unigram
-
Trim down SentencePiece vocabulary by editing
ModelProto.pieces(step-by-step). (Hugging Face) -
SentencePiece training options, including
hard_vocab_limit.
Tokenizer types and behavior
-
Summary of tokenizers: BPE vs WordPiece vs Unigram, pros and trade-offs. (Hugging Face)
-
Fast tokenizers docs: offsets, alignment, performance notes. (Hugging Face)
-
Building a tokenizer from scratch (mix and match normalizers, pre-tokenizers, models). (Hugging Face)
Pitfalls to avoid
-
Cleaning or changing ByteLevel BPE alphabets alters coverage; know consequences. (Hugging Face Forums)
-
Keep
config.vocab_sizesynced when resizing embeddings; common failure mode. (Hugging Face) -
Space handling in BPE tokenizers (
add_prefix_space) affects segmentation. (Hugging Face Forums)
Performance tips
-
Use fast tokenizers; confirm
is_fast; batch properly; multiprocessing guidance. (Hugging Face Forums) -
Tokenizers Python docs for API surface and saving formats. (Hugging Face)
Research for principled pruning
- BPE-Knockout: prune merges with theory; paper + overview. (Hugging Face Forums)
Use order: quicktour → tokenizer API → LLM course train-new → shrinking threads/issues → SP trimming if Unigram → pitfalls/perf → BPE-Knockout.