-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whisper : mark speakers/voices (diarization) #64
Comments
I think its a very not easy task - about quality. |
yeah, also saw this Seems as if they do it with two runs: one for the spoken text, one for the speakers and then merging the results. |
Personally, id be more than happy for whisper to just do speaker detection based on left & right channels on a stereo audio file. But I can achieve this by just running it twice. |
@jaybinks |
One option would be to use pyannote.audio to diarize first --> then run whisper on each recognized section @abelbabel |
Not tested - I don't have stereo dialog audio
@jaybinks |
Does this approach have the assumption that you only have two speakers and each speaker is well separated each on a single channel? This is a special case which is only applicable to special recordings in an audio studio - from my point of view. Or am I wrong? |
This absolutely is a special case, but its also simple to implement and
allows the problem to be broken up.
I'm lucky that in my scenario, I have a separate mic per speaker in the
conversation so it's perfectly isolated.
…On Sun, 27 Nov 2022, 9:51 am abelbabel, ***@***.***> wrote:
Personally, id be more than happy for whisper to just do speaker detection
based on left & right channels on a stereo audio file. But I can achieve
this by just running it twice.
Does this approach have the assumption that you only have two speakers and
each speaker is well separated each on a single channel? This is a special
case which is only applicable to special recordings in an audio studio -
from my point of view. Or am I wrong?
—
Reply to this email directly, view it on GitHub
<#64 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALQR62XRG2NRLGNR5BEUQLWKKO7HANCNFSM6AAAAAARH4FNAI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I've done some limited testing and was able to achieve reasonable split via |
Interestingly, in a mono-channel with two speakers, 1st speaker says three words, second speaker repeats those three words, and the transcript result is three words, expanded to the time of the two speakers as though a kind of WEBVTT 00:00:00.000 --> 00:00:04.000 00:00:04.000 --> 00:00:08.000 |
@chris-english Results with OpenAI Whisper
|
Im so sorry this took ages for me to test for you...
but the detection seems to work PERFECTLY!
Sorry, I cant comment for the output file formats for multi-speaker (
srt, vtt etc ) as I don't know these file formats.
I'm assuming that the speaker is available in the segment callback?
…On Sat, 26 Nov 2022 at 06:11, Georgi Gerganov ***@***.***> wrote:
@jaybinks <https://github.com/jaybinks>
Added support for stereo-channel diarization - add the --diarize argument
to main.
Not sure if it works, because I don't have any data to test with
—
Reply to this email directly, view it on GitHub
<#64 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALQR67RDHYOMVQSR4SVS43WKEMNZANCNFSM6AAAAAARH4FNAI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Sincerely
Jay
|
Great to hear! Btw, a failure case has been identified earlier when multiple speakers end up in the same segment: #216 (comment) Overall, this is a pretty basic approach and probably not worth investing too much time in it. |
@savchenko Could you give a small how-to on how you used |
In my testing pyannote.audio is extremely slow on CPU. Very interested if anyone finds a way to make it work. |
@ggerganov When running whisper.cpp, I get the speaker information only on the stdout result (I think it is VTT format), but the output JSON file does not include this. Is there a way to show the speaker information in the JSON format? |
Not tested - I don't have stereo dialog audio
I am not into technical specifics, just a user of an AI transcription tool that uses this library. For me it would be perfect if the system could detect different speakers and just label the line's where a new speaker starts. similar to the time stamps. Fingers crossed that will works sometime soon :-) |
Hi @ggerganov (and other maintainers of this awesome project!) - you might be interested in an early prototype that covers @SpusellaLo's use case over at https://github.com/akashmjn/tinydiarize This was designed keeping in mind ease of integration into whisper.cpp as the model structure is exactly the same, inference requires no extra dependencies (beyond the original repo), and it has marginal extra runtime cost. It can be run as Let me know what you think!
|
@akashmjn Great work!! I converted the small.en-trdz.pt to ggml using the whisper.cpp python script. I used the newly generated ggml model with whisper.cpp using the -m option but it doesn't seem to work. May be there is something else that I missing besides converting it to ggml? |
Thanks for the effort @pratikmohanty. The However to surface Here's a high-level implementation plan:
I'm wrapping up some things on my original repo after which I'll have a draft PR open shortly. In the meantime @ggerganov - how does this sound? Feel free to add any other code pointers in case there's something i've missed! |
@akashmjn that looks amazing! Can't wait to see how this performs! |
For anyone keen to give it a spin, I have an early hack over at https://github.com/akashmjn/whisper.cpp/tree/tdrz-hack-1
After running the above, you should see this: (tried to pick a sample keeping with the historical vibe of the others 😉 ) Will open a PR after some cleanup. In the meantime if you have any suggestions - feel free to drop comments directly on the branch! |
Awesome stuff! Looked at the branch - seems super clean |
:+1, it would be great if the speaker details would be present in the JSON output. Currently it's hard to make use of them. |
I assume you are referring to previous comment pertaining to the For Example
For the rest of the output types (txt/vtt/srt/lrc/wts/csv) - it will only be present in the text transcription as you saw in the apollo example above. Hope that works. |
@akashmjn Yes indeed, thanks for the pointer! |
@ggerganov - just opened an initial PR at #1058. Need some comments on how best to expose / integrate this. |
should this issue be closed now? |
Are there plans to include speaker number instead of "speaker turn"? One use case could be audio files with more than two speakers. |
https://github.com/akashmjn/tinydiarize#gotchas indicates that tinydiarize does not support speaker clustering, which is what you are referring to. A different diarization implementation would be needed to solve that problem, or to wait for this feature to be added to tinydiarize. |
I noticed that but I believe I also saw speaker followed by a number in the docs. Thank you |
There are two strategies for diarization that are implemented so far. One of which is stereo diarization, which allows for speaker numbers: #1031. You enable that with --diarize. It requires stereo audio because it essentially determines the location of the speakers voices. Tiny diarize is a different approach, and is enabled with -tdrz. It allows for mono audio, because it uses a different strategy of fine-tuning the whisper model to determine speakers by their voice timbre, not just location. Both strategies have their flaws and have different purposes, but are available in the master branch. |
Yes, @bachittle I get that tinydiarize is more recently added and different from separated audio tracks. I was referring to this when I made the comment about the speaker identification. I do see where it may be added later as you previously stated. I probably should have asked this in the tinydiarize project also. I appreciate your time and explanations. |
Not tested - I don't have stereo dialog audio
Hi. I saw earlier discussions mentioning pyannote.audio, but my understanding is that this is not integrated, right? --diarize: depends on stereo channels So I suppose this ticket could remain open since there's still chance to improve for multilingual use case? |
@wzxu yes, insanely-fast-whisper uses pyannote.audio, as does lots of other libraries for whisper diarization like WhisperX. Ticket can remain open until we get quality as good as pyannote.audio for multilingual use case, or make that a separate issue. |
Thanks for the great work on this. Is it straightforward to use tinydiarize with the larger models, not just the tiny one? |
Is there a python version of this? |
don't remove bindings/javascript/package.json during build
Looking at github.com/akashmjn/tinydiarize |
It would also be of some help if the diarization info appeared in the subtitle output when -osrt is given. Currently I have to parse the stdout data. |
Hi,
I'm not so much into the details of whisper or whisper.cpp and I don't know if it is currently even possible with the foundation, but it would be nice if speakers could be marked or speaker-changes / voice-changes.
This would be very handy when processing interviews, radio/tv shows, films, etc.
Kind regards,
abelbabel
The text was updated successfully, but these errors were encountered: