OpenAi whisper transcription workflow.

emcodem · Post by **emcodem** » Fri Aug 04, 2023 4:25 pm

Hey @taner

yeah it takes a while until one gets familiar with the concepts of python packaging...

faster-whisper seems to be the better choice mainly because they follow the development of original whisper project so you get best of both worlds. As far as i see they sometimes still underly the repeat forever issue but i feel it is only a matter of time until all manifestations of this error are dealt with. What i am not 100% sure about is if the reduced vram of faster-whisper comes from them quantisizing the model which means they could never reach the accuracy of the original model. (e.g. if they used the original model, output would be even better)

Const-Me in contrast "only" shows off how to use the ggml format models very performant with windows but apart from that which is VERY interesting for developers of windows apps that want a native API but he doesnt seem to want to follow the development of the others, nor is he very responsive when others send code to his repository, so even i turn away from experimenting with it. It does not pay off for me to dive much deeper into Const's version because it needs to much trickery to run on linux and i always prefer cross OS compatibility when possible.
However, the repeat forever stuff was relatively easy to overcome for me, the concept could be applied in all other whisper projects too:
https://github.com/Const-me/Whisper/issues/26

The thing is, it is not a very good idea to recover on error (as i do) but we must find ways how prevent errors up in front when talking to the model. Lots of research is done around whisper and i am sure it is just a matter of time until they figure it out

Question is if OpenAI will share their results or stop sharing at some point in time because they recognize that others provide cheaper and better whisper cloud services...

taner · Post by **taner** » Sat Aug 05, 2023 8:25 am

Thanks for the detailed information!
Great that you have a workaround for repeated sentences!
I will look into it.
And yes, in the long term it is defnitely better using faster-whisper where the development follows the original.

Concerning the quantized models I haven't tested so far faster-whisper against original whisper.
Are there differences in precision?
Apart from that I will use for myself large-v1 model because there is often not only a slight difference to medium when it comes to a more precise transcription.
Luckily speed is fast.

As far as I can see beam size is set to 5 per default.
Have you played around with it?
Does it really affect precision?

The windows executable of faster-whisper seems to lack of some word_timestamps refine parameters (max characters per line, max lines ect.)
I mean, it's not really necessary, but would be nice when it comes to burn in as subtitles into a videofile and thus having more control about the max characters per line.

taner · Post by **taner** » Sat Aug 05, 2023 10:58 am

And you know what would be great?
If the system could detect all supported spoken languages within the file and transcibe those respective to it.
I mean...sometimes they speak in more than 1 language.
And: its analyzing the first 30 seconds which in some cases leads either way to a wrong transcription.
Aaaand: i don't want to handle with probability thresholds or such things, it should just work.
Aaaaaaand: that one can translate to a specific target language and not only to english.
German would be a good start

Alternatively one could let translate to english and then the english translation automated to german by using DeepL or something similar.
Short said: why cant we have a system where our dreams can come true right from the start...

emcodem · Post by **emcodem** » Mon Aug 07, 2023 6:53 pm

taner wrote: ↑Sat Aug 05, 2023 8:25 am Concerning the quantized models I haven't tested so far faster-whisper against original whisper.
Are there differences in precision?

Well so far, i did not compare anything to anything in terms of precision really. It is very important and my next goal that i come up with a method that allows to verify transcription quality results in an automated way (e.g. using WER rate), otherwise i will never be able to impact of an update or similar.

So far, i use Const-me, cpp and faster-whisper in production. The results were always pleasing for the users, except when looping repetitions occured, e.g. forever repeating, forever counting up, forever lowercase and no punctuation.
Const-me shows off the base-line. It has least features of all implementations (no beam,compression,whatever) but the results are still of great quality. Currently i have the impression that most of the stuff is kind of overengineered, they seem to apply the lessons learned from GPT (text in- text out) models but this might not 100% fit the whisper model "sound in, text out". Not sure about that but it's the state of my current knowledge/feeling.

taner wrote: ↑Sat Aug 05, 2023 8:25 am Apart from that I will use for myself large-v1..

Interesting, why not v2?

taner wrote: ↑Sat Aug 05, 2023 8:25 am As far as I can see beam size is set to 5 per default.
Have you played around with it?
Does it really affect precision?

In Const-me we have beam size fixed to 1 (= greedy) but best_of = 4. From my current playaround, i believe i was able to see that when having beam_size=5, we can have much better error correction. E.g. When const-me does not output anything, using beam_size=5, we get something meaningful.

taner wrote: ↑Sat Aug 05, 2023 8:25 am The windows executable of faster-whisper seems to lack of some word_timestamps refine parameters (max characters per line, max lines ect.)
I mean, it's not really necessary, but would be nice when it comes to burn in as subtitles into a videofile and thus having more control about the max characters per line.

Stay tuned...

emcodem · Post by **emcodem** » Tue Aug 08, 2023 11:15 am

... to follow up my Stay tuned above, @taner :
The whole timestamp thing is unfortunately just guesswork, it is not really anyhow related to the AI part and there are lots of problems to come around when guessing the timestamps. Especially the start of the first sentence in the current 30 second segment is hard or even impossible to guess.
So the programs need to do a lot of coding around the timestamp stuff, one program can work better, the other worse. (WhisperX concentrates a lot on timestamp work i think).
Anyway, even if they get the timestamps correct, i don't think the max characters per line and max lines stuff belongs anyhow to the speech to text part because the requirements are really different and often pretty custom/random.
Actually it has already been done, SubtitleEdit for example can clean up Titles and convert to max characters per line, minimum gap between titles and what not.

However, it might pay off for you to work on a scripted solution to convert the text output to srt on your own, so you have full control over it. The reason why i say this is that i don't think you will be really happy with any existing or future solution available opensource.

emcodem · Post by **emcodem** » Tue Aug 08, 2023 11:19 am

taner wrote: ↑Sat Aug 05, 2023 10:58 am And you know what would be great?
...

Yes, all of it great ideas and we will see a lot of it in future, be it in commercial programs, services or open-source stuff. But beware, it does not make sense that someone implements a small tool and puts it open source. At the current state of the Whisper, it is important that any tool is constantly maintained. So it would not make sense for e.g. me to write some small open-source helper tool because i would not maintain it forever.

That is one reason why i don't want to see too much functionality "builtin" to any of the root tools (e.g. faster-whisper, Const.me, cpp and such). From my perspective, additional functionality should be as modular as possible, so you can e.g. exchange the root tool and still benefit from the enhanced functionality.

taner · Post by **taner** » Tue Aug 08, 2023 7:16 pm

Heyho Emcodem,

Thanks for your detailed reply.
To be clear:
I didn‘t want to push you in any kind/direction of enhancing those stuff.
I know that you are a prodigy among programmers and I‘m a great fan of you.
Really are!
Really like your enthusiasm and professional help!

My wishes were rather aimed to a general hope of overcoming the limitations/having more features by the original developers.
I would like to have all possible features/parameters

Concerning models.
I‘ve tested so far medium and large-1 and the difference is massive.
Both in const and faster.
Have to test large-2 before using in production environment because I‘ve read mixed reviews.

Concerning max characters.
I made already a some kind of all in one workflow where i get also subtitle files converted by subtitle edit, for e.g. to avidsubcap.
Hmmm that occurs to me that i wanted to have a look into the formatting style and coding of an avid script sync file.
Anyway.
Or automatically burned into a video file.
Will have a look into the command line parameters of subtitle edit if it is provding what i want.
But that is actually a nice to have feature.
Just want to be prepared for the moment i will need it.

But what „the redaktion“ (the editors, not the video editors but „the other ones“

) requires is automatic language detection.
Transcription is performed at my company mainly by „Trint“, an online service.
There you have to set the language in advance.
For several reasons I will not and can‘t abandon Trint in favor to an inhouse solution.
But maybe in the long term.
Anyway, I mean it‘s easy to provide the appropriate language in whisper manually but the redaktion would like to have things simplified.
Which would also make it easier for my operators when it comes to mass transcript.
So I think that a proper automatic language detection is one of the crucial things those apps should provide, commercial or open source.

But the most annoying thing were repeated sentences.
Luckily by switching to faster I get much more reliable outcomes.

And you know what would be great?
A voice isolation app which can be compared quality wise to those in DaVinci Resolve Studio.
Reason is that I would like to keep things simple.
I mean, the audio we get is 90% of the cases clean.
External audio.
But as you know it has to be synced first with the cameras and than to be exported for transcription.
What we do is to avoid it by adding a separate micro to the cameras so that we can use straight those files.
Submit -> bäm!
In my newer workflow a voice isolation is performed automatically before transcription.
Htdemucs is good.
But.
DaVinci is better.
And having a good quality audio source makes a huge huge huge difference when it comes to transcription quality.
Anyway.
Next box stop is testing band-splitter.
Maybe better than htdemucs.
Sadly those apps are function wise more related to music separation.

Best
Taner

emcodem · Post by **emcodem** » Tue Aug 08, 2023 8:25 pm

@taner no worries and thanks for the flowers a.l.

(@momocampo might explain you happily what an a.l. is

)
Language detection is basically pretty simple, just split into 30 second chunks (which is what most of the programs do internally as you might know) and run lang detect on all the segments and generate a simple list of detected lang codes along with probability value. Tiny model might suffice.

Note this uses whisper.cpp (not const-me)

emcodem_whisper_lang_detect.json: (4.7 KiB) Downloaded 731 times

The thing is, now we have a list with languages (each line is 30 seconds duration)... You now need to go through the list and check the time and duration of changes. You'll need to mitigate little discrepances e.g. single chunk different language with low probability probably just means nothing was spoken in this 30 seconds.
Of course it would be better to have this all with VAD and done by a single running instance but on the other hand, this approach is simple and simple means solid which is always a good thing

Let me know your voice isolation test results

FFAStrans forum

OpenAi whisper transcription workflow.

Re: OpenAi whisper transcription workflow.

Re: OpenAi whisper transcription workflow.

Re: OpenAi whisper transcription workflow.

Re: OpenAi whisper transcription workflow.

Re: OpenAi whisper transcription workflow.

Re: OpenAi whisper transcription workflow.

Re: OpenAi whisper transcription workflow.

Re: OpenAi whisper transcription workflow.