MacBook Pro M2 Max 96gb macOS 13.3 tensorflow-macos 2.9.0 tensorflow-metal 0.5.0
Repro Code:
from transformers import AutoTokenizer, TFDistilBertForSequenceClassification
from datasets import load_dataset
import tqdm
imdb = load_dataset('imdb')
sentences = imdb['train']['text'][:500]
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased')
for i, sentence in tqdm(enumerate(sentences)):
inputs = tokenizer(sentence, truncation=True, return_tensors='tf')
output = model(inputs).logits
pred = np.argmax(output.numpy(), axis=1)
if i % 100 == 0:
print(f"len(input_ids): {inputs['input_ids'].shape[-1]}")
It comes excruciating slow after about 300-400th record. It even dropped below 2% (smaller than that Window Server proc). Here are the prints:
Metal device set to: Apple M2 Max
systemMemory: 96.00 GB
maxCacheSize: 36.00 GB
3it [00:00, 10.87it/s]
len(input_ids): 391
101it [00:13, 6.38it/s]
len(input_ids): 215
201it [00:34, 4.78it/s]
len(input_ids): 237
301it [00:55, 4.26it/s]
len(input_ids): 256
401it [01:54, 1.12it/s]
len(input_ids): 55
500it [03:40, 2.27it/s]
I am aware this loop looks wrong:
Use batches for GPU
Use CPU if you want to do it one at a time.
But it is still unsettling to observe the GPU utilization decay 'cos I don’t think this happens on colab (or just linux with CUDA). So it has something to do with Apple Meta Silicon.
Just wonder what could the root cause be. If a bug is indeed lurking around, this may rear its head on me when I do longer bigger real training.
Sad news is I did a real training trial, fine-tuning distill Bert (with TF). I used batch_size=128 (could be the culprit) and first 200 steps went ok… but then started hitting error:
Error: command buffer exited with error status.
The Metal Performance Shaders operations encoded on it may not have completed.
Error:
(null)
Internal Error (0000000e:Internal Error)
<AGXG14XFamilyCommandBuffer: 0xf259be4d0>
label = <none>
device = <AGXG14CDevice: 0x1196d8200>
name = Apple M2 Max
commandQueue = <AGXG14XFamilyCommandQueue: 0x2d9e5e800>
label = <none>
device = <AGXG14CDevice: 0x1196d8200>
name = Apple M2 Max
retainedReferences = 1
These errors seemed quite ominous, and I waited till it completed 1 epoch, which is like >3 slower than T4 (on colab), and quite bad accuracy (this could be due to large batch_size => less # steps/epoch).
I hope to reduce the batch_size and see when this will go away. Even if small batch_size is desired for my dataset size and fine tuning, it is still worrying to see it failed at “only” batch_size of 128. I got 96gb for the reason of exploring larger batch size… if I have to reduce this, it may be cheaper to just go with Nvidia (at risk of losing mobility and power cost).
Update: Again, I tracked this may as well due to unequal input length during training. I switched to max_len + padding and ensure 512 tokens at every batch and now I am getting perf on par with T4, and >90% GPU utilization. This is really pointing at a TF-Metal specific bug. Will try larger batch next, and get the $$ worth. M2Max is $$.