`decoder_start_token_id` discrepancy
Hey,
thanks for the model!
I have noticed that while both your mikr/whisper-large-v3-czech-cv13/generation_config.json
and original openai/whisper-large-v3/config.json
state:
"decoder_start_token_id": 50258,
Your mikr/whisper-large-v3-czech-cv13/config.json
states:
"decoder_start_token_id": 50257,
Is that intentional? Which token should be used as the decoder_start_token_id
? I believe the decoder_start_token_id
values in the two config
s should correspond.
If 50257
is the correct value, maybe it should also be fixed somewhere in the tokenizer, because:
from transformers import WhisperTokenizerFast
tokenizer = WhisperTokenizerFast.from_pretrained("mikr/whisper-large-v3-czech-cv13")
ids = tokenizer("Hello")["input_ids"]
print(ids)
print(tokenizer.decode(ids))
prints:
[50258, 50283, 50360, 50364, 15947, 50257]
<|startoftranscript|><|cs|><|transcribe|><|notimestamps|>Hello<|endoftext|>
Meaning the ids are in a wrong format for use e.g. with teacher forcing. I've checked and tokenizer.build_inputs_with_special_tokens
also prepends the 50258
token.
Hola,
I did some more testing and apparently 50257
is a wrong decoder_start_token_id
. The following code:
from transformers import WhisperForConditionalGeneration, WhisperProcessor
model_path = "mikr/whisper-large-v3-czech-cv13"
processor = WhisperProcessor.from_pretrained(model_path)
processor.tokenizer.set_prefix_tokens(task="transcribe", language="cs", predict_timestamps=False)
model = WhisperForConditionalGeneration.from_pretrained(model_path)
audio = dataset[1]["audio"]
in_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
labels = processor.tokenizer(dataset[1]["normalized"], return_tensors="pt", add_special_tokens=True).input_ids
labels = labels[:,1:] # Remove the 50258 added by tokenizer
res = model(input_features=in_features, labels=labels)
print(torch.argmax(res.logits, dim=-1))
Produces just a sequence of 50258
(<|startoftranscript|>
) tokens:
tensor([[50258, 50258, 50258, 50258, 50258, 50258, 50258, 50258]])
Whisper when provided with labels
prepends the self.config.decoder_start_token_id
to the input (snippet from HF GitHub):
decoder_input_ids = shift_tokens_right(labels, self.config.pad_token_id, self.config.decoder_start_token_id)
So in this case it prepends 50257
as per your config and produces wrong output.
If I manually set the decoder_start_token_id
first:
model.config.decoder_start_token_id = 50258
and run the same code, I get the correct output.
I believe it would be desirable to fix the decoder_start_token_id
to 50258
in the model config.