Replies: 1 comment
-
Hi, I haven't verified this exact case empirically but from past experience sometimes the token id changes based on context, have you verified that the token ids of the uh,um in the transcript are the same IDs as the suppressed ones? I've found token suppression to work quite well actually |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
My understanding is that there's a default set of tokens that are ignored in the config.json file for the whisper model.
I am looking to suppress 'um' and 'uh' from transcriptions. I thought Whisper in general was set up to ignore speech disfluencies, but I'm constantly finding them in the generated transcripts.
I'm using
from faster_whisper import WhisperModel
and settingmodel_size_or_path='large-v2'
as defined in this repo. In my huggingface .cache I see models--guillaumekln--faster-whisper-large-v2 which I'm assuming it the model actually used by WhisperModel.
In the config.json file there is a set of "suppress_tokens". I then created a python list that exactly duplicates that set and then also added the tokens for the various forms of Um and Uh. I also added some very common tokens: 'first' and 'about' which I know exist in the piece of audio I tested against.
I pass those to the transcribe call:
And it proceeds to create the beam=8 transcript.
However when I open the transcript, there are still all the 'um's and 'uh's as well as the common words 'first' and 'about'. I had expected them all to be 100% suppressed.
Here are some snippets of the transcript using the
config_suppressed
list aboveAm I doing something wrong to suppress tokens? I was expecting it to omit all the ums and uhs and also to make grammatical errors by forcing it to omit the two common test words 'first' and 'about'.
That didn't happen, it happily transcribed these four tokens that I specifically excluded. Furthermore, with respect to the 'ums' and 'uhs' it didn't even reduce the overall count versus when I ran the same transcription without the suppress list. In other words the overall number of 'ums' in the suppressed token document is roughly the same as it was in the document where I didn't try to suppress them???
Beta Was this translation helpful? Give feedback.
All reactions