Whisper, from OpenAI, for Transcribing Audio

Whisper is a trained neural network model for transcribing audio. You can read about it at https://openai.com/blog/whisper/. I can see this as being enormously useful for public historians, oral historians… anyone who deals with recorded speech. I’ve tested it on audio recordings from the 1920s in English, and more recent recordings in French. A very cool feature of the language model is the ability to translate another language into an English transcription. But first, here’s how you get started.

$ conda create -n py39 python=3.9
$ conda activate py39
$ conda install pytorch torchvision torchaudio -c pytorch-nightly
$ brew install rust
$ pip install git+https://github.com/openai/whisper.git

I’m using miniconda on a mac mini m1.

Here’s the result on a recording of Eamon de Valera’s Saint Patrick’s Day address of 1920:

$ whisper 'Eamon_de_Valera-Saint_Patricks_Day_address_(03.04.1920).mp3'
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: english
[00:00.000 --> 00:06.960]  sons and daughters of the gale wherever you be today in the name of the motherland greeting
[00:07.920 --> 00:15.120]  whatever flag be the flag you guard and cherish it is consistent with your highest duty to link
[00:15.120 --> 00:23.360]  yourself together to use your united strength to break the chains that bind our sweet sad mother
[00:23.360 --> 00:30.480]  and never before have the scattered children of era had put an opportunity for noble service
[00:31.760 --> 00:39.760]  today you can serve not only Ireland but the world a cruel war and a more cruel
[00:39.760 --> 00:48.320]  peak have shattered the generous of souls apathy marks the high minded and heartless cynicism
[00:48.320 --> 00:56.960]  points the way of selfishness we the children of a race that has endured for ages the blight

_and so on_.

When I asked it to translate a French recording, using the default small model, things got stuck, with the model outputting the same line over and over. I wasn’t surprised; the recording is not very clear – listen to it here. But I re-ran the command with the medium model (which is the largest model that Whisper has that is multilingual). The results were much better, and quite impressive:

$ whisper RL10059-CS-1751_02.mp3 --language French --task translate --model medium
[00:00.000 --> 00:09.000]  We will stop in the framework of the investigation that will maintain good relations with the American embassy.
[00:09.000 --> 00:12.000]  Listen to the explanations of Senator Danny Toussaint.
[00:12.000 --> 00:34.000]  But think about the name of Philip MacKenton, that Ana Rana told you that he was linked with Daniel Whitman.
[00:34.000 --> 00:42.000]  He said that during the arrest, the American embassy came to look for him.
[00:42.000 --> 00:46.000]  The judge asked him to sign, and he did.
[00:46.000 --> 00:49.000]  He asked him who was coming to look for him.
[00:49.000 --> 01:03.000]  It's clear that during the trial, the judge, the father who was in prison,
[01:03.000 --> 01:07.000]  who was authorized by the American embassy to look for Philip MacKenton,
[01:07.000 --> 01:14.000]  Ana Rana said that he was linked with Daniel Whitman.
[01:14.000 --> 01:17.000]  He said that it was higher authorities that authorized him.
[01:17.000 --> 01:21.000]  So the judge declared that during the arrest, the embassy came to look for him.

_and so on_.

If you have the computing power then, and you work with oral history or other recorded speech, Whisper is well worth investment of time and energy.