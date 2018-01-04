Share this:

This tutorial will walk through using Google Cloud Speech API to transcribe a large audio file.

All code and sample files can be found in speech-to-text GitHub repo.

Sample Results

This approach works, but I found that result will vary greatly based on the quality of input.

Transcribing a Reading by My Wife

I asked my wife to read something out loud as if she was dictating to Siri for about 1.5 minutes. She is a native English speaker and we recorded using a microphone on iPhone 6s.

https://www.alexkras.com/wp-content/uploads/genevieve.mp3



Which resulted in the following transcript:

00:00:00 this Dynamic Workshop aims to provide up to date information on pharmacological approaches, issues, and treatment in the geriatric population to assist in preventing medication-related problems, appropriately and effectively managing medications and compliance. The concept of polypharmacy parentheses taking multiple types of drugs parentheses will also be discussed, as the

00:00:30 is a common issue that can impact adverse side effects in the geriatric population. Participants will leave with a knowledge and considerations of common drug interaction and how to minimize the effects that limit function. Summit professional education is approved provider of continuing education. This course is offered for 6

00:01:00 . this course contains a Content classified under the both the domain of occupational therapy and professional issues.

I think that Google Cloud Speech API did an amazing job, getting over 95% of the content right. Especially considering that this was not a professional recording and that you can hear my kid saying something in the background 🙂

Transcribing a Speech by Winston Churchill

I wanted to challenge the script further, so I decided to run in on a famous speech by Winston Churchill, titled The Threat of Nazi Germany.

Here is the audio file:

https://www.alexkras.com/wp-content/uploads/winston-churchill-the-threat-of-germany.mp3



Which resulted in the following transcript:

00:00:00 many people think that the best way to escape War if the dwelling and then print them DVD for the younger generation they plump the grizzly photographs Before Their Eyes they feel that they dilate of generals and admirals they do not fit the crime I didn’t think they’d father

00:00:30 human strife how old is teaching in preventing us from attacking or invading any other country with the do so how would it help if we were attacked or invaded on stove that is a question we have to ask what did they does contempt of the Lord Beaverbrook

00:01:00 I’ll listen to the impassioned the field by George would they agree to meet that famous South African general identity I have bone responsibilities for the safety of this country in grievance time

00:01:30 we could convince and persuade them to go back play my play it seems to me you are rich we are what we are hungry it would be in Victoria’s we have been defeated you have valuable, we have not you have your name you have had the phone

00:02:00 set up pencil future about all I see are they would say you are weak and we are strong after all my friend your nephew all the way by that railing for nation of nearly 70 million the most educated industrial scientific discipline people in the world loving cup from childhood

00:02:30 all Epic Gloria Texas iron and death in battle at the noblest face for men yeah I need the nation we could have been done in order to augment its Collective Strength yeah definition of a group of preaching a gospel of intolerance and unrestrained by the wall by Parliament

00:03:00 public opinion in that country all packages speeches or morbid Wahlberg off of getting off the press I’m down you cable of Columbus they have a meeting dial shalt not kill it is the plenty of photos and or both now

00:03:30 play Ariana me with the upload speed I’m ready to that end lamentable weapon Javier against which all Navy is no defense and before which women and children so weak and frail capacity of the warriors on the front-line trenches all live equal adding partial patio

00:04:00 play with you but with the new weapon, new method of compelling the submission of racing bike terrorizing and torturing population and worst of all the more

00:04:30 the ball in cricket the structure of its social and economic life some more of those who may make it there praying love you too fat Grim despicable fact and invasive affect ionic again what are we to do

The result is an order of magnitude worse than my wife’s recording. Most likely it is caused by poor audio quality. In addition, Churchill used a lot of words that are no longer commonly used.

If you are still reading, let’s get started.

1. Sign Up for a Free Tier Account

Google Cloud offers a Free Tier plan, which will be used in this tutorial. An account is required to get an API key.

2. Generate an API Key

Follow these steps to generate an API key:

Sign-in to Google Cloud Console Click “API Manager” Click “Credentials” Click “Create Credentials” Select “Service Account Key” Under “Service Account” select “New service account” Name service (whatever you’d like) Select Role: “Project” -> “Owner” Leave “JSON” option selected Click “Create” Save generated API key file Rename file to api-key.json

Make sure to move the key into speech-to-text cloned repo, if you plan to test this code.

3. Convert Audio File to Wav format

I ran into issues when trying to convert my audio file via a command line tools. Instead, I used Audacity (an open source audio editing tool) to convert my file to wav format. Audacity is great and I highly recommended it.

The steps to convert:

Open file in Audacity Click “File” menu Click “Save other” Click “Export as Wav” Export it with default setting

4. Break up audio file into smaller parts

Google Cloud Speech API only accepts files no longer than 60 seconds. To be on the safe side, I broke my files in 30-second chunks. To do that I used an open source command line library called ffmpeg. It can be download from its site. On Mac, I installed it with Homebrew via brew install ffmpeg .

Here is the command I used to break up my file:

# Clean out old parts if needed via rm -rf parts/* ffmpeg -i source/genevieve.wav -f segment -segment_time 30 -c copy parts/out%09d.wav 1 2 3 # Clean out old parts if needed via rm -rf parts/* ffmpeg - i source / genevieve . wav - f segment - segment _ time 30 - c copy parts / out % 09d.wav

Where, source/genevieve.wav is the name of the input file, and parts/out%09d.wav is the format for output files. %09d indicated that the file number will be padded with 9 zeros (i.e. out000000001.wav ), allowing files to be sorted alphabetically. This way ls command returns files sorted in the right order.

5. Install required Python modules

I added requirements.txt in example repo with all needed libraries. It can be used to install all via:

pip3 install -r requirements.txt 1 2 pip3 install - r requirements . txt

The real hero on this list is the SpeechRecognition. It does most of the heavy lifting.

The rest of the libraries came with the official google-api-python-client package.

I also used tqdm module to show progress in the slower version of the script.

6. Running the Code

Finally, we can run the Python script to get the transcript.

The slow version

Here is the Github link.

This script:

Loads API key from step 2 in memory Gets a list of files (chunks) For every file, calls speech to text API endpoint Adds results to a list Combines all results and adds a timestamp (every 30 seconds) Saves results to transcript.txt

import os import speech_recognition as sr from tqdm import tqdm with open("api-key.json") as f: GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read() r = sr.Recognizer() files = os.listdir('parts/') all_text = [] for f in tqdm(files): name = "parts/" + f # Load audio file with sr.AudioFile(name) as source: audio = r.record(source) # Transcribe audio file text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS) all_text.append(text) transcript = "" for i, t in enumerate(all_text): total_seconds = i * 30 # Cool shortcut from: # https://stackoverflow.com/questions/775049/python-time-seconds-to-hms # to get hours, minutes and seconds m, s = divmod(total_seconds, 60) h, m = divmod(m, 60) # Format time as h:m:s - 30 seconds of text transcript = transcript + "{:0>2d}:{:0>2d}:{:0>2d} {}

".format(h, m, s, t) print(transcript) with open("transcript.txt", "w") as f: f.write(transcript) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 import os import speech_recognition as sr from tqdm import tqdm with open ( "api-key.json" ) as f : GOOGLE_CLOUD_SPEECH_CREDENTIALS = f . read ( ) r = sr . Recognizer ( ) files = os . listdir ( 'parts/' ) all_text = [ ] for f in tqdm ( files ) : name = "parts/" + f # Load audio file with sr . AudioFile ( name ) as source : audio = r . record ( source ) # Transcribe audio file text = r . recognize_google_cloud ( audio , credentials_json = GOOGLE_CLOUD_SPEECH_CREDENTIALS ) all_text . append ( text ) transcript = "" for i , t in enumerate ( all_text ) : total_seconds = i * 30 # Cool shortcut from: # https://stackoverflow.com/questions/775049/python-time-seconds-to-hms # to get hours, minutes and seconds m , s = divmod ( total_seconds , 60 ) h , m = divmod ( m , 60 ) # Format time as h:m:s - 30 seconds of text transcript = transcript + "{:0>2d}:{:0>2d}:{:0>2d} {}

" . format ( h , m , s , t ) print ( transcript ) with open ( "transcript.txt" , "w" ) as f : f . write ( transcript )

The code works, but it does take a while on longer source files.

Faster version

To speed things up, I added threading to my slow version. I describe the method used in detail in Simple Python Threading Example post.

Here is the GitHub Link.

The main difference is that I moved processing into a function and added logic, in the end, to sort processed results in the right order.

import os import speech_recognition as sr from tqdm import tqdm from multiprocessing.dummy import Pool pool = Pool(8) # Number of concurrent threads with open("api-key.json") as f: GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read() r = sr.Recognizer() files = os.listdir('parts/') def transcribe(data): idx, file = data name = "parts/" + file print(name + " started") # Load audio file with sr.AudioFile(name) as source: audio = r.record(source) # Transcribe audio file text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS) print(name + " done") return { "idx": idx, "text": text } all_text = pool.map(transcribe, enumerate(files)) pool.close() pool.join() transcript = "" for t in sorted(all_text, key=lambda x: x['idx']): total_seconds = t['idx'] * 30 # Cool shortcut from: # https://stackoverflow.com/questions/775049/python-time-seconds-to-hms # to get hours, minutes and seconds m, s = divmod(total_seconds, 60) h, m = divmod(m, 60) # Format time as h:m:s - 30 seconds of text transcript = transcript + "{:0>2d}:{:0>2d}:{:0>2d} {}

".format(h, m, s, t['text']) print(transcript) with open("transcript.txt", "w") as f: f.write(transcript) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 import os import speech_recognition as sr from tqdm import tqdm from multiprocessing . dummy import Pool pool = Pool ( 8 ) # Number of concurrent threads with open ( "api-key.json" ) as f : GOOGLE_CLOUD_SPEECH_CREDENTIALS = f . read ( ) r = sr . Recognizer ( ) files = os . listdir ( 'parts/' ) def transcribe ( data ) : idx , file = data name = "parts/" + file print ( name + " started" ) # Load audio file with sr . AudioFile ( name ) as source : audio = r . record ( source ) # Transcribe audio file text = r . recognize_google_cloud ( audio , credentials_json = GOOGLE_CLOUD_SPEECH_CREDENTIALS ) print ( name + " done" ) return { "idx" : idx , "text" : text } all_text = pool . map ( transcribe , enumerate ( files ) ) pool . close ( ) pool . join ( ) transcript = "" for t in sorted ( all_text , key = lambda x : x [ 'idx' ] ) : total_seconds = t [ 'idx' ] * 30 # Cool shortcut from: # https://stackoverflow.com/questions/775049/python-time-seconds-to-hms # to get hours, minutes and seconds m , s = divmod ( total_seconds , 60 ) h , m = divmod ( m , 60 ) # Format time as h:m:s - 30 seconds of text transcript = transcript + "{:0>2d}:{:0>2d}:{:0>2d} {}

" . format ( h , m , s , t [ 'text' ] ) print ( transcript ) with open ( "transcript.txt" , "w" ) as f : f . write ( transcript )

Conclusion

Results may vary, but there is utility even in poor transcriptions. For example, I had an hour and a half audio recording from a hand-over meeting with my former co-worker. I remembered that he mentioned something at some point, but was dreading listening through 1.5-hour audio file to find it. I ran the recording through this script and was able to quickly find needed keywords and timestamp pointed me to the right part of the audio file.

For native English speakers like my wife, Google Cloud Speech API can easily replace a professional transcribing service, at a fraction of a cost.

Subscribe to Blog via Email

P.P.S You may also like: