I'm finding that token suppression is not working as expected #128

filmo · 2023-11-12T01:56:15Z

filmo
Nov 12, 2023

My understanding is that there's a default set of tokens that are ignored in the config.json file for the whisper model.

I am looking to suppress 'um' and 'uh' from transcriptions. I thought Whisper in general was set up to ignore speech disfluencies, but I'm constantly finding them in the generated transcripts.

I'm using from faster_whisper import WhisperModel and setting model_size_or_path='large-v2'
as defined in this repo. In my huggingface .cache I see models--guillaumekln--faster-whisper-large-v2 which I'm assuming it the model actually used by WhisperModel.

In the config.json file there is a set of "suppress_tokens". I then created a python list that exactly duplicates that set and then also added the tokens for the various forms of Um and Uh. I also added some very common tokens: 'first' and 'about' which I know exist in the piece of audio I tested against.

config_suppressed = [

	449, 3232, 40937, 27727,  # 499 = um, 3232 = uh, 40937 = Um 27727 = Uh
   
	29581, 21970,  # 'first' and 'about'

	1, 2, 7, 8, 9, 10, 14, 25, 26, 27, 28, 29, 31, 58, 59, 60, 61, 62, 63, 90, 91, 92, 93,
	359, 503, 522, 542, 873, 893, 902, 922, 931, 1350, 1853, 1982, 2460, 2627, 3246, 3253,
	3268, 3536, 3846, 3961, 4183, 4667, 6585, 6647, 7273, 9061, 9383, 10428, 10929, 11938,
	12033, 12331, 12562, 13793, 14157, 14635, 15265, 15618, 16553, 16604, 18362, 18956, 20075,
	21675, 22520, 26130, 26161, 26435, 28279, 29464, 31650, 32302, 32470, 36865, 42863, 47425,
	49870, 50254, 50258, 50358, 50359, 50360, 50361, 50362
]

I pass those to the transcribe call:

segments, info = fast_whisper_model.transcribe(
	kwargs['vocal_target'],
	language='en',
	beam_size=8,
	word_timestamps=True,
	suppress_tokens=config_suppressed,
        initial_prompt='This is a sample sentence.'
)

And it proceeds to create the beam=8 transcript.

However when I open the transcript, there are still all the 'um's and 'uh's as well as the common words 'first' and 'about'. I had expected them all to be 100% suppressed.

Here are some snippets of the transcript using the config_suppressed list above

before, um, the procedure

And so, uh, the 

Speaker 1: Well, my first one, I was very

thinking about the

Am I doing something wrong to suppress tokens? I was expecting it to omit all the ums and uhs and also to make grammatical errors by forcing it to omit the two common test words 'first' and 'about'.

That didn't happen, it happily transcribed these four tokens that I specifically excluded. Furthermore, with respect to the 'ums' and 'uhs' it didn't even reduce the overall count versus when I ran the same transcription without the suppress list. In other words the overall number of 'ums' in the suppressed token document is roughly the same as it was in the document where I didn't try to suppress them???

MahmoudAshraf97 · 2023-11-24T00:42:18Z

MahmoudAshraf97
Nov 24, 2023
Maintainer

Hi, I haven't verified this exact case empirically but from past experience sometimes the token id changes based on context, have you verified that the token ids of the uh,um in the transcript are the same IDs as the suppressed ones? I've found token suppression to work quite well actually

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I'm finding that token suppression is not working as expected #128

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

I'm finding that token suppression is not working as expected #128

filmo Nov 12, 2023

Replies: 1 comment

MahmoudAshraf97 Nov 24, 2023 Maintainer

filmo
Nov 12, 2023

MahmoudAshraf97
Nov 24, 2023
Maintainer