I want to distill a pretrained BPE tokenizer for my domain-specific corpus, is there anything to pay attention to?
What I will do in my mind is use the pretrained one to first tokenize all sentences of the corpus(I already did), find out the used token and get rid of the unused ones from the vocabulary. Should I also take care of the merges and make the new tokenizer again a BPE tokenizer or should I just use the subset of vocabulary to make a WordLevel tokenizer? Does anyone have already done the same thing?
Hi John, I’m following your pruning script. It can be constructed and loaded, but the new tokenizer doesn’t have the same behavior as the original one, especially for merged tokens(original one merged but the new one doesn’t)
Is there a debug mode that we can find out how the token is merged during the tokenizer process?
I see, there are some nuances about the merging procedure. In my case I have f,r,a,c,frac as tokens. But I don’t have any merge paths from f,r,a,c to frac since none of the intermediate combinations exists in my keep vocab file
Ah ha, I find out a way to include the minimal merge closure for all my keep vocab can be merged to, just slightly modify the function below, and I’ve validated such closure would provide exactly same behavior as the original one(at least on my corpus)
def filter_merges_to_subset(merges: list[tuple[str,str]], keep: set[str]):
# Keep merge (a,b) when (a+b) belongs to keep and join the a,b to keep to provide an accessible merge path to (a+b)
# update the keep until no more merge paths can be found
# BPE merges are greedy and ordered; preserve order.
filtered_raw = []
new_keep: Set[str] = set()
while True:
keep |= new_keep
for a, b in merges:
merged = a + b
if merged in keep:
if (a,b) not in filtered_raw:
filtered_raw.append((a,b))
new_keep.update((a,b))
if new_keep - keep == set():
break
# reorder the filtered merges to preserve order as the raw will break the order as we add merges in multiple loops
filtered = []
for merge in merges:
if merge in filtered_raw:
filtered.append(merge)
return filtered
To give some impression:
Before debugging: ~950 tokens + 741 merges
After debugging: 1264 tokens + 1004 merges (some intermediate tokens for merge paths are added, though no occurrence at the end of tokenization)
Original: 50000 tokens + 49721 merges
But after all, it worths distilling.
(Refined a little bit, the previous version worked but contains repetitive merges)
BTW, thank you so much for your very detailed answer. I’m so grateful that you add so much references, would you give me a reading list that I can learn Transformers or Tokenizers? I saw you refer to a Transformers notebook blog, but perhaps you know helpful materials more than that? Sometimes I just find the chat-AIs are not so intelligent when I ask about the Transformers/Tokenizers APIs.