Changelog

2025/06/08: The installation process was updated after the new PEP 668 protection was introduced in Python with Debian; the formatting script now only produces a TXT file, not a DOCX file. You can copy the content of the TXT file into any text processor.

A Dalmatian in business clothes thinking about automating audio-to-text transcription.

This Is For You If …

You need to transcribe multi-speaker audio to text.
You are fine with a transcript containing errors, as long as it gets you some of the way (don’t expect a perfect transcript ready for analysis).
You don’t want to spend money and/or want to keep data processing local to comply with privacy regulations.
You have access to a computer with either Linux or Windows and administrative privileges (the process should also work on Mac, but I don’t go through the steps).

Motivation For This Post

Studying law can be quite isolating. Curricula often force students to cram extensive knowledge to pass exams, leaving little room for collaboration. After years of studying, students face all-or-nothing exams, determining whether they earn a law degree or nothing, often leaving them with significant student loan debt. It’s no surprise that law students experience high levels of psychological distress (Larcombe & Fethers, 2013).

I collaborated on a project aimed at identifying the needs of law students at my university to reduce psychological distress and promote collaboration. We conducted several group interviews lasting 1.5 to 2 hours each. With no budget, I needed a way to transcribe these long interviews with 5-7 speakers without spending a lot of time. This type of problem is common, especially among students working on theses or dissertations involving qualitative research.

After some research, I found a workflow that produces a decent transcript providing a head start that saves time and money. Following along, you will (hopefully) end up with a formatted transcript that identifies speakers and is ready for quality control.

What Are We Going To Do?

We will use WhisperX, an open source project that combines the transcription power of OpenAI’s Whisper model, with Pyannote’s speaker diarization capabilities. Whisper’s transcription abilities are impressive, especially for widely spoken international languages like English and German, but it’s important to check for errors.

Whisper transcribes everything as a continuous text block, ignoring different speakers. To identify who said what, we use Pyannote, another open-source model known for its speaker diarization capabilities. While Pyannote is effective, speaker diarization is still an active research field, so expect some inaccuracies in speaker assignments.

After generating the final transcript, you may want to import it into qualitative research software. For our project, we used MAXQDA. In the following sections, I’ll guide you through setting up WhisperX and Pyannote, running the transcription and diarization process, and preparing your final transcript for use in qualitative research software.

If you are reading this before conducting your interviews, you can improve the quality of the final WhisperX transcript by considering the following audio quality factors. In general, the better the audio quality, the better the transcript Whisper X will produce:

Use good microphones. Place them firmly in the direction of and at a reasonable distance from the speakers. Make sure there are no obstructions between the speakers and the microphones.
Keep background noise to a minimum.
If possible, conduct a calm, well-articulated interview. Rapid speaker changes, overlapping speech, very soft or loud speech, and thick dialect all create problems for the transcription models. Other factors you probably can’t control include voice similarity and the number of speakers.

The Workflow

Tip: You can find an example of a transcription using the following workflow on my Nextcloud. Have a look at the ‘output_formatted.txt’ file and compare with the audio file to see what you can expect to get out of it. I recommend using the provided short MP4 audio file to test the workflow on your system before using it on your actual audio files. Transcribing larger audio files can take take a long time, and encountering an error midway is no fun.

Step 0: Preparations for Windows Users

In this tutorial, we will use Bash code that works in Ubuntu/Debian-based Linux distributions. Windows users can easily follow along by installing the Windows Subsystem for Linux (WSL). It lets you use Linux applications, utilities, and Bash command-line tools directly on Windows.

Installing WSL

Open PowerShell or Windows Command Prompt in Administrator mode. If you don’t know where to find this, search for PowerShell in your task bar, right-click on it, and choose “Run as Administrator”.
Enter the following command:

wsl --install

This will install WSL based on Ubuntu (a major Linux distribution). Reboot your system after the installation.

After rebooting, search for “WSL” and open a Linux terminal, where you will enter the following commands.

Step 1: Install Software

First, you need to install Python, a popular programming language, if you don’t already have it on your system. Open a terminal window (search for “terminal” on your system if you can’t find it). Start by entering the following command. You can click on the clipboard icon to copy commands and then paste them with Ctrl+Shift+V in the terminal. Press Enter to execute.

sudo apt update

1: This just updates which software packages can be upgraded. Use “sudo apt upgrade”, if you want to upgrade to the latest versions.

Then install python, pipx, and ffmpeg (to convert your audio into the WAV format), entering:

sudo apt install python3 python3-pip python3-venv pipx ffmpeg

After the installation has finished, install WhisperX. Note that we use pipx, so WhisperX will be installed in its own isolated environment, which is necessary due to a new protection rule in Python for Debian (makes sure python stuff does not interfere with your system libraries):

pipx install --verbose git+https://github.com/m-bain/whisperx.git

This will take a while (~10-20 minutes). We use the –verbose flag to see the progress. It may seem like nothing is happening, but you just have to be patient.

After the installation finished, try the following to confirm it is working:

whisperx --help

All the commands and options of whisperx should appear. If you receive a symlink error or something similar, try entering ‘pipx reinstall whisperx’ This will recreate the symlink without reinstalling everything.

Step 2: Convert Your Audio File to WAV Format

Make sure your terminal operates in the directory (folder) where your audio file is located. For example, if it is on the Desktop, you would use:

cd Desktop

1: cd means “current directory”; this command let’s you specify, where on your system you want to operate right now. Tip: If you don’t want to bother too much with commands, just place your file on the Desktop and use “cd Desktop”.

Another example: If your audio file is in a folder called “Transcriptions” on your Desktop, use:

cd Desktop/Transcriptions

Note for Windows Users

If you want to specify a filepath on your Windows system in WSL, you have to add a /mnt/ before the actual file path. For example: cd /mnt/c/Users/YourUsername/Desktop/Transcriptions.

Note that using the Windows file system will result in slower performance than using the WSL (Ubuntu) file system.

Convert your audio file to WAV format using the following command. Replace “your_audio_file.mp4” and “your_audio_file.wav” with your actual file name and desired file name of the WAV file:

ffmpeg -i your_audio_file.mp4 your_audio_file.wav

1: ffmpeg is the command, using the -i flag will provide some information when finished. You need to specify the source, followed by the desired file name of the to-be-created file. Needs to be .wav!

Step 3: Get Access to the Pyannote Model on Hugging Face

WhisperX uses Pyannote for speaker diarization. The model is free, but you need to request access and accept the terms of use. The model is stored on Hugging Face.

Create an account on Hugging Face.
Go to Hugging Face Tokens and generate a token with “read” privileges. Copy the token to your clipboard.
Accept the terms of usage for the Pyannote Speaker Diarization 3.1 and Pyannote Segmentation 3.0 models. These URLs may change with updates, so check the WhisperX GitHub for the latest information.

Step 4: Use WhisperX on Your WAV File

Ensure you have some time on your hands for the transcription process. Depending on the length of the audio file and your system’s performance, this can take several hours. Aborting the process by closing the terminal will lead to a complete loss of progress. Expect your machine to be quite busy and not able to perform many other tasks simultaneously.

In your terminal, enter the following command. Replace your_audio_file.wav with the actual file name, --min/max_speakers with the number of speakers, de with the language you need, and your_token_here with the generated token from Hugging Face. Officially supported languages are: en, fr, de, es, it, ja, zh, nl, uk, pt (consult the WhisperX site or the troubleshooting section below if you need another language).

whisperx your_audio_file.wav --model large-v3 --diarize --min_speakers 7 --max_speakers 7 --compute_type int8 --language de --hf_token your_token_here

1: “–model large-v3” = Use the largest and most capable Whisper model at the time of writing; “–diarize” = Utilize speaker diarization; “–min_speakers” and “max_speakers” = If you have prior knowledge about how many speakers are in your audio, specify that here (specific value or range). If not, delete both parameters. Pyannote-Audio will estimate the number of speakers for you. “–compute_type int8” = use CPU only; “–language de” = Specify the language of the audio and transcription (here “de” for German); –hf_token your_token_here = your Hugging Face token (looks something like hf_sdiahiSOjifniasdFSADkfajawQDdaomj).

Step 5: Formatting the Output (Optional)

I hope that worked for you! There should be some output files in your directory now, e.g. a .srt file. Have look, what WhisperX produced.

Troubleshooting

If you encountered an error, paste the error note into your favorite LLM (such as ChatGPT), or your favorite search tool. Often, it’s just a simple issue such as a missing package or a typo in your command. If you’re stuck or found an error in this tutorial, write me an email and I’ll try to help.

If the process worked but the quality of the transcription is poor, there isn’t much you can do without getting technical. Search the “issues” section on the WhisperX GitHub page; you might find that someone has asked a similar question. You can also experiment with the parameters of the command (e.g., using other model versions). However, adjusting parameters can be time-consuming and may not yield better results.

Use whisperx –help in your terminal, to get a list of all command parameters you can use, including possible values (e.g. languages) and a short description.

The final step involves formatting the output file to make it easier to read and import into software like MAXQDA for analysis. Use the file on my Nextcloud, or create a new file in your directory and name it “output_formatting.py”. Copy the following Python code into that file:

import re

def process_srt(srt_file, speaker_names=None):
    if speaker_names is None:
        speaker_names = {}

    with open(srt_file, 'r', encoding='utf-8') as file:
        content = file.read()

    blocks = content.split('\n\n')
    results = []
    current_speaker = None
    current_text = []
    end_time = None

    for block in blocks:
        lines = block.strip().split('\n')
        if len(lines) < 3:
            continue

        times = lines[1].split(' --> ')
        end_time = times[1].replace(',', '.')

        text_line = ' '.join(lines[2:])
        speaker_match = re.match(r'\[(SPEAKER_\d+)\]:', text_line)

        if speaker_match:
            speaker = speaker_names.get(speaker_match.group(1), speaker_match.group(1))
            text = text_line[len(speaker_match.group(0)):].strip()

            if current_speaker == speaker:
                current_text.append(text)
            else:
                if current_speaker:
                    results.append(f"{current_speaker}:\n{' '.join(current_text)}\n[{end_time}]\n")
                current_speaker = speaker
                current_text = [text]
        else:
            if current_text:
                current_text[-1] += ' ' + text_line.strip()

    if current_speaker and current_text:
        results.append(f"{current_speaker}:\n{' '.join(current_text)}\n[{end_time}]\n")

    return results

def write_txt(output_file, processed_text):
    with open(output_file, 'w', encoding='utf-8') as file:
        file.write('\n'.join(processed_text))

# Customize speaker labels here
speaker_names = {
    'SPEAKER_00': 'Abed Nadir',
    'SPEAKER_01': 'Annie Edison',
    'SPEAKER_02': 'Britta Perry',
    'SPEAKER_03': 'Dean Pelton',
    'SPEAKER_04': 'Pierce Hawthorne',
    'SPEAKER_05': 'Shirley Bennett',
    'SPEAKER_06': 'Troy Barnes',
    'SPEAKER_07': 'Jeff Winger'
}

srt_file = 'test.srt'
processed_text = process_srt(srt_file, speaker_names)

txt_output_file = 'output_formatted.txt'
write_txt(txt_output_file, processed_text)

print("Processing complete. Check the output file.")

1: Crucial: Here you can change the speaker names. You have to figure out which “SPEAKER_XX” corresponds to which voice in your audio manually. The order can be anything. You don’t have to delete superfluous “SPEAKER_XX” values (if you have less than eight speakers). If there is no match, they won’t appear in the formatted output. Also, you can add more “SPEAKER_XX” if you need them.
2: Crucial: Change to the actual name of your WhisperX SRT output file.
3: Specify the desired name of the formatted output file (TXT). The default is “output_formatted.txt”. Keep the .txt extension.

Review the Code

Make sure to review the code and update it as necessary. For example, you need to specify your actual output file and might want to change the names of the speakers. Compare your .srt output file with your audio to identify which speaker (e.g., SPEAKER_03) corresponds to which person. Note that speakers are not necessarily listed in order of appearance.

Run the Formatting Script

Run the script on your transcription output by entering the following command in your terminal:

python3 output_formatting.py

Hopefully, you now have a nicely formatted transcript of your audio file with speakers identified. You can copy the content in your text processor of choice. Remember to check it thoroughly for errors, especially ensuring that sentences are correctly matched to each speaker. There will likely be some mixups! After that, you can import the transcript into software like MAXQDA. You can add text such as a header and a description at the beginning of your file; MAXQDA ignores everything until the first “SPEAKER_NAME:” appears. You can reuse this workflow for as many audio files as you want: Convert to WAV - Use WhisperX - Format the output. Note that the provided timestamps may be slightly off due to errors or after your corrections.

If you have any questions, feel free to contact me. Happy (less painful) transcribing!

Giving Back

Pretty awesome tools, right? If they were useful to you, consider buying the developers a coffee. It will motivate them to keep working on the project. You’ll find the info on how to donate on the project’s webpage (panel on the right): WhisperX; Pyannote-Audio