This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Amgadoz on 2024-03-30 21:23:22.


Hey everyone!

I recently compared all the open source whisper-based packages that support long-form transcription.

Long-form transcription is basically transcribing audio files that are more than 30 seconds.

This can be useful if you want to chat with a youtube video or podcast etc.

I compared the following packages:

  1. OpenAI’s official whisper package
  2. Huggingface Transformers
  3. Huggingface BetterTransformer
  4. FasterWhisper
  5. WhisperX
  6. Whisper.cpp

I compared between them in the following areas:

  1. Accuracy - using word error rate (wer) and character error rate (cer)
  2. Efficieny - using vram usage and latency

I’ve written a detailed blog post about this. If you just want the results, here they are:

I hope you find it useful!