This tutorial will walk through using Google Cloud Speech API to transcribe a large audio file.
All code and sample files can be found in speech-to-text GitHub repo.
Transcribe large audio files using Python & our Cloud Speech API. @akras14 shows how https://t.co/dY56lmE0TD
— Google Cloud (@googlecloud) January 11, 2018
Sample Results
This approach works, but I found that result will vary greatly based on the quality of input.
Transcribing a Reading by My Wife
I asked my wife to read something out loud as if she was dictating to Siri for about 1.5 minutes. She is a native English speaker and we recorded using a microphone on iPhone 6s.
Which resulted in the following transcript:
00:00:00 this Dynamic Workshop aims to provide up to date information on pharmacological approaches, issues, and treatment in the geriatric population to assist in preventing medication-related problems, appropriately and effectively managing medications and compliance. The concept of polypharmacy parentheses taking multiple types of drugs parentheses will also be discussed, as the
00:00:30 is a common issue that can impact adverse side effects in the geriatric population. Participants will leave with a knowledge and considerations of common drug interaction and how to minimize the effects that limit function. Summit professional education is approved provider of continuing education. This course is offered for 6
00:01:00 . this course contains a Content classified under the both the domain of occupational therapy and professional issues.
I think that Google Cloud Speech API did an amazing job, getting over 95% of the content right. Especially considering that this was not a professional recording and that you can hear my kid saying something in the background 🙂
Transcribing a Radio Broadcast with Few Different Voices
A reader sent me the following audio file recorded from 95.5 Sports Hub radio (broadcast on January 26th 2018), Toucher & Rich morning show. This too, turned out better than I expected.
00:00:00 announced that there was going to be a new XXX FL it was going to start in two years and here’s what he had to say that you accept kickoff in 2020 quite frankly we’re going to give the game of football back to fans I’m sure everyone has a lot of questions for me but I also have a lot of questions for you in fact we’re going to ask a lot of questions and listen to players coaches
00:00:30 call experts technology executive members of the media and anyone else who understands and loves the game of football but most importantly we’re going to be listening to someone ask that the will the question of what would you do if you can reimagine the game of professional football would you frenchtons eliminate halftime would you have if you were commercial breaks but the game of foot
00:01:00 I’ll be faster when the rules be simpler can you ask Chef elevated fan Centric with all the things you like to see in the last of the things you don’t and no doubt a lot of Innovations along the way we will put you at a shorter faster-paced family-friendly and easier to understand game don’t get me wrong it’s still football but it’s professional football reimagined Sims 4 launching a 20
00:01:30 hey we have two years which is plenty of time to really get it right so aside from family friendly which I just think means that you have to stand for the national anthem I have no idea because the other one was very sex. That’s why is it either it was the cheerleaders with the super tight outfits and stuff cheerleaders were dressed and I stripped it sounds like a very good idea sounds like he has he has no plan no he does he’s taking everything he does have
00:02:00 and it said all the teams are going to be owned by the same entity he knows that they’re starting with a team and that they’re going to be shorter games with maybe no halftime with inferior Talent no not necessarily interior Town there’s already a saturation of football as is that is the biggest thing that people been complaining about the game what is he thinking you know what he said you ate yesterday you said we’re going to make it short and then we want your ideas no gimmicks all the things that God was just playing around
00:02:30 this does feel like a guy who’s had enormous prefer
Transcribing a Speech by Winston Churchill
I wanted to challenge the script further, so I decided to run in on a famous speech by Winston Churchill, titled The Threat of Nazi Germany.
Here is the audio file:
Which resulted in the following transcript:
00:00:00 many people think that the best way to escape War if the dwelling and then print them DVD for the younger generation they plump the grizzly photographs Before Their Eyes they feel that they dilate of generals and admirals they do not fit the crime I didn’t think they’d father
00:00:30 human strife how old is teaching in preventing us from attacking or invading any other country with the do so how would it help if we were attacked or invaded on stove that is a question we have to ask what did they does contempt of the Lord Beaverbrook
00:01:00 I’ll listen to the impassioned the field by George would they agree to meet that famous South African general identity I have bone responsibilities for the safety of this country in grievance time
00:01:30 we could convince and persuade them to go back play my play it seems to me you are rich we are what we are hungry it would be in Victoria’s we have been defeated you have valuable, we have not you have your name you have had the phone
00:02:00 set up pencil future about all I see are they would say you are weak and we are strong after all my friend your nephew all the way by that railing for nation of nearly 70 million the most educated industrial scientific discipline people in the world loving cup from childhood
00:02:30 all Epic Gloria Texas iron and death in battle at the noblest face for men yeah I need the nation we could have been done in order to augment its Collective Strength yeah definition of a group of preaching a gospel of intolerance and unrestrained by the wall by Parliament
00:03:00 public opinion in that country all packages speeches or morbid Wahlberg off of getting off the press I’m down you cable of Columbus they have a meeting dial shalt not kill it is the plenty of photos and or both now
00:03:30 play Ariana me with the upload speed I’m ready to that end lamentable weapon Javier against which all Navy is no defense and before which women and children so weak and frail capacity of the warriors on the front-line trenches all live equal adding partial patio
00:04:00 play with you but with the new weapon, new method of compelling the submission of racing bike terrorizing and torturing population and worst of all the more
00:04:30 the ball in cricket the structure of its social and economic life some more of those who may make it there praying love you too fat Grim despicable fact and invasive affect ionic again what are we to do
The result is an order of magnitude worse than my wife’s recording. Most likely it is caused by poor audio quality. In addition, Churchill used a lot of words that are no longer commonly used.
If you are still reading, let’s get started.
1. Sign Up for a Free Tier Account
Google Cloud offers a Free Tier plan, which will be used in this tutorial. An account is required to get an API key.
2. Generate an API Key
Follow these steps to generate an API key:
- Sign-in to Google Cloud Console
- Click “APIs & Services”
- Click “Credentials”
- Click “Create Credentials”
- Select “Service Account Key”
- Under “Service Account” select “New service account”
- Name service (whatever you’d like)
- Select Role: “Project” -> “Owner”
- Leave “JSON” option selected
- Click “Create”
- Save generated API key file
- Rename file to
api-key.json
Make sure to move the key into speech-to-text cloned repo, if you plan to test this code.
3. Convert Audio File to Wav format
I ran into issues when trying to convert my audio file via a command line tools. Instead, I used Audacity (an open source audio editing tool) to convert my file to wav format. Audacity is great and I highly recommended it.
The steps to convert:
- Open file in Audacity
- Click “File” menu
- Click “Save other”
- Click “Export as Wav”
- Export it with default setting
4. Break up audio file into smaller parts
Google Cloud Speech API only accepts files no longer than 60 seconds. To be on the safe side, I broke my files in 30-second chunks. To do that I used an open source command line library called ffmpeg. It can be download from its site. On Mac, I installed it with Homebrew via brew install ffmpeg
.
Here is the command I used to break up my file:
# Clean out old parts if needed via rm -rf parts/*
ffmpeg -i source/genevieve.wav -f segment -segment_time 30 -c copy parts/out%09d.wav
Where, source/genevieve.wav
is the name of the input file, and parts/out%09d.wav
is the format for output files. %09d
indicated that the file number will be padded with 9 zeros (i.e. out000000001.wav
), allowing files to be sorted alphabetically. This way ls
command returns files sorted in the right order.
5. Install required Python modules
I added requirements.txt in example repo with all needed libraries. It can be used to install all via:
pip3 install -r requirements.txt
The real hero on this list is the SpeechRecognition. It does most of the heavy lifting.
The rest of the libraries came with the official google-api-python-client
package.
I also used tqdm module to show progress in the slower version of the script.
6. Running the Code
Finally, we can run the Python script to get the transcript. For example python3 fast.py
.
The slow version
Here is the Github link.
This script:
- Loads API key from step 2 in memory
- Gets a list of files (chunks)
- For every file, calls speech to text API endpoint
- Adds results to a list
- Combines all results and adds a timestamp (every 30 seconds)
- Saves results to
transcript.txt
import os
import speech_recognition as sr
from tqdm import tqdm
with open("api-key.json") as f:
GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read()
r = sr.Recognizer()
files = sorted(os.listdir('parts/'))
all_text = []
for f in tqdm(files):
name = "parts/" + f
# Load audio file
with sr.AudioFile(name) as source:
audio = r.record(source)
# Transcribe audio file
text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
all_text.append(text)
transcript = ""
for i, t in enumerate(all_text):
total_seconds = i * 30
# Cool shortcut from:
# https://stackoverflow.com/questions/775049/python-time-seconds-to-hms
# to get hours, minutes and seconds
m, s = divmod(total_seconds, 60)
h, m = divmod(m, 60)
# Format time as h:m:s - 30 seconds of text
transcript = transcript + "{:0>2d}:{:0>2d}:{:0>2d} {}\n".format(h, m, s, t)
print(transcript)
with open("transcript.txt", "w") as f:
f.write(transcript)
The code works, but it does take a while on longer source files.
Faster version
To speed things up, I added threading to my slow version. I describe the method used in detail in Simple Python Threading Example post.
Here is the GitHub Link.
The main difference is that I moved processing into a function and added logic, in the end, to sort processed results in the right order.
import os
import speech_recognition as sr
from tqdm import tqdm
from multiprocessing.dummy import Pool
pool = Pool(8) # Number of concurrent threads
with open("api-key.json") as f:
GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read()
r = sr.Recognizer()
files = sorted(os.listdir('parts/'))
def transcribe(data):
idx, file = data
name = "parts/" + file
print(name + " started")
# Load audio file
with sr.AudioFile(name) as source:
audio = r.record(source)
# Transcribe audio file
text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
print(name + " done")
return {
"idx": idx,
"text": text
}
all_text = pool.map(transcribe, enumerate(files))
pool.close()
pool.join()
transcript = ""
for t in sorted(all_text, key=lambda x: x['idx']):
total_seconds = t['idx'] * 30
# Cool shortcut from:
# https://stackoverflow.com/questions/775049/python-time-seconds-to-hms
# to get hours, minutes and seconds
m, s = divmod(total_seconds, 60)
h, m = divmod(m, 60)
# Format time as h:m:s - 30 seconds of text
transcript = transcript + "{:0>2d}:{:0>2d}:{:0>2d} {}\n".format(h, m, s, t['text'])
print(transcript)
with open("transcript.txt", "w") as f:
f.write(transcript)
Conclusion
Results may vary, but there is utility even in poor transcriptions. For example, I had an hour and a half audio recording from a hand-over meeting with my former co-worker. I remembered that he mentioned something at some point, but was dreading listening through 1.5-hour audio file to find it. I ran the recording through this script and was able to quickly find needed keywords and timestamp pointed me to the right part of the audio file.
For native English speakers like my wife, Google Cloud Speech API can easily replace a professional transcribing service, at a fraction of a cost.
Google API is not free you still need to enter the CC details in order to use the 60mins for free/month. You should’ve mentioned it in the beginning so no one will try to find out it’s not gonna work.
Hey there, I really appreciate this post. I would like to recommend this service – https://audext.com/speech-to-text/ – which helped me a lot in hard times 🙂 Thanks again.
What if the file contains 4 minutes of audio? I think its gonna be bit messy; instead breaking them in to smaller parts, is there anyway to break them 4 minutes each for 8 minutes audio file? Why does Google Cloud Speech API only accepts files no longer than 60 seconds?If Google Cloud Speech API works to transcribe a large audio file in one shot instead splitting them then it could have been easier for us. Since we are not a tech geek though we have caliber to learn a bit of coding.
Does this API app really help me to transcribe both small n larger audio files into the text format? Since I am a Transcriber
Hi Alex! Thank you for this article, excelent!!!;
I tried to run the script to slice the audio and got the following error:
SyntaxError: invalid syntax
[Finished in 0.9s with exit code 1]
[shell_cmd: python3 -OO -u “/Users/SilvinoDiaz/Desktop/speech-to-text-master/untitled.py”]
[dir: /Users/SilvinoDiaz/Desktop/speech-to-text-master]
[path: /Users/SilvinoDiaz/opt/anaconda3/bin:/Users/SilvinoDiaz/opt/anaconda3/condabin:/Users/SilvinoDiaz/anaconda3/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/anaconda3/bin:/Library/Frameworks/Python.framework/Versions/3.6/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/share/dotnet:/opt/X11/bin:~/.dotnet/tools:/Library/Frameworks/Mono.framework/Versions/Current/Commands]
The IDLE is ST3
I don’t know if it has to do with the installation of ‘anconda’ which causes the failure.
Any idea?
Thank you very much.
Hi, Thanks for this code. For more than 10 minutes, the chunk number 11 and 12 appears as the second oaragraph and this part of the text becomes misplaced. My question is why is this happening?
Alex, when I try and run ffmpeg to break up the audio file, it keeps giving me an error saying that it couldn’t segment and write the headers, how would I change the command so that ffmpeg creates each wav file as it goes??
Alex, I am getting this error when I try and use ffmpeg to break up my audio file:
C:\Users\hmkur\Desktop\Python\Transcribing_Audio>ffmpeg -i source/valve.wav -f segment -segment_time 30 -c copy parts/out%09d.wav
ffmpeg version 4.2.1 Copyright (c) 2000-2019 the FFmpeg developers
built with gcc 9.1.1 (GCC) 20190807
configuration: –enable-gpl –enable-version3 –enable-sdl2 –enable-fontconfig –enable-gnutls –enable-iconv –enable-libass –enable-libdav1d –enable-libbluray –enable-libfreetype –enable-libmp3lame –enable-libopencore-amrnb –enable-libopencore-amrwb –enable-libopenjpeg –enable-libopus –enable-libshine –enable-libsnappy –enable-libsoxr –enable-libtheora –enable-libtwolame –enable-libvpx –enable-libwavpack –enable-libwebp –enable-libx264 –enable-libx265 –enable-libxml2 –enable-libzimg –enable-lzma –enable-zlib –enable-gmp –enable-libvidstab –enable-libvorbis –enable-libvo-amrwbenc –enable-libmysofa –enable-libspeex –enable-libxvid –enable-libaom –enable-libmfx –enable-amf –enable-ffnvcodec –enable-cuvid –enable-d3d11va –enable-nvenc –enable-nvdec –enable-dxva2 –enable-avisynth –enable-libopenmpt
libavutil 56. 31.100 / 56. 31.100
libavcodec 58. 54.100 / 58. 54.100
libavformat 58. 29.100 / 58. 29.100
libavdevice 58. 8.100 / 58. 8.100
libavfilter 7. 57.100 / 7. 57.100
libswscale 5. 5.100 / 5. 5.100
libswresample 3. 5.100 / 3. 5.100
libpostproc 55. 5.100 / 55. 5.100
[wav @ 0000015fe3028d80] Discarding ID3 tags because more suitable tags were found.
Guessed Channel Layout for Input Stream #0.0 : stereo
Input #0, wav, from ‘source/valve.wav’:
Metadata:
title : valve
encoder : Lavf58.20.100 (libsndfile-1.0.24)
Duration: 00:06:47.20, bitrate: 1411 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s16, 1411 kb/s
[segment @ 0000015fe3461640] Opening ‘parts/out000000000.wav’ for writing
[segment @ 0000015fe3461640] Failed to open segment ‘parts/out000000000.wav’
Could not write header for output file #0 (incorrect codec parameters ?): No such file or directory
Stream mapping:
Stream #0:0 -> #0:0 (copy)
Last message repeated 1 times
How can I change the code so that it creates a new wav file everytime it needs to??
Alex, when I run ffmpeg to try and break up my audio file, it is giving me this error:
C:\Users\hmkur\Desktop\Python\Transcribing_Audio>ffmpeg -i source/valve.wav -f segment -segment_time 30 -c copy parts/out%09d.wav
ffmpeg version 4.2.1 Copyright (c) 2000-2019 the FFmpeg developers
built with gcc 9.1.1 (GCC) 20190807
configuration: –enable-gpl –enable-version3 –enable-sdl2 –enable-fontconfig –enable-gnutls –enable-iconv –enable-libass –enable-libdav1d –enable-libbluray –enable-libfreetype –enable-libmp3lame –enable-libopencore-amrnb –enable-libopencore-amrwb –enable-libopenjpeg –enable-libopus –enable-libshine –enable-libsnappy –enable-libsoxr –enable-libtheora –enable-libtwolame –enable-libvpx –enable-libwavpack –enable-libwebp –enable-libx264 –enable-libx265 –enable-libxml2 –enable-libzimg –enable-lzma –enable-zlib –enable-gmp –enable-libvidstab –enable-libvorbis –enable-libvo-amrwbenc –enable-libmysofa –enable-libspeex –enable-libxvid –enable-libaom –enable-libmfx –enable-amf –enable-ffnvcodec –enable-cuvid –enable-d3d11va –enable-nvenc –enable-nvdec –enable-dxva2 –enable-avisynth –enable-libopenmpt
libavutil 56. 31.100 / 56. 31.100
libavcodec 58. 54.100 / 58. 54.100
libavformat 58. 29.100 / 58. 29.100
libavdevice 58. 8.100 / 58. 8.100
libavfilter 7. 57.100 / 7. 57.100
libswscale 5. 5.100 / 5. 5.100
libswresample 3. 5.100 / 3. 5.100
libpostproc 55. 5.100 / 55. 5.100
[wav @ 0000015fe3028d80] Discarding ID3 tags because more suitable tags were found.
Guessed Channel Layout for Input Stream #0.0 : stereo
Input #0, wav, from ‘source/valve.wav’:
Metadata:
title : valve
encoder : Lavf58.20.100 (libsndfile-1.0.24)
Duration: 00:06:47.20, bitrate: 1411 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s16, 1411 kb/s
[segment @ 0000015fe3461640] Opening ‘parts/out000000000.wav’ for writing
[segment @ 0000015fe3461640] Failed to open segment ‘parts/out000000000.wav’
Could not write header for output file #0 (incorrect codec parameters ?): No such file or directory
Stream mapping:
Stream #0:0 -> #0:0 (copy)
Last message repeated 1 times
It is saying that it failed to open segment, it seems that though this might mean that and empty .wav needs to be waiting for each segment??? How can I change the code so that it creates a .wav file when it needs to?
Hi Alex,
I am using your code to convert some voice commands to text, but run into this error when I run the ‘fast.py’ script.
—
File “/Users/Tony/anaconda3/lib/python3.7/site-packages/speech_recognition/init.py”, line 937, in recognize_google_cloud
if “results” not in response or len(response[“results”]) == 0: raise UnknownValueError()
UnknownValueError
I think I’ve followed all the steps correctly, except for step 4, as my files are already smaller than 30 seconds. Have very little coding experience, any insight on this would be greatly appreciated! 🙂
Kind regards,
Tony
Hi, have you thought about implementing a self-hosted audio transcribe server. This would be a great addition to the community as I agree that many of the professional services costs too much for individuals who uses it occasionally (like me!). Thanks for the insightful article.
I have, it would still need Google Cloud Auth, unless I wanted to pay for it myself. I think it would be fairly simple for somebody to do using Google Cloud API as outlined in this article, but ultimately I didn’t feel like I wanted to make a business out of it and didn’t have time to work on it as a side project (my free time is fairly limited since I have two little kids).
Alex, probably a duplicate reply here, didn’t save first, my bad. I have made a fork and a couple of enhancements without over engineering and didn’t know if you want “forks” or “contributions to new branch or master. Sent a Tweet as well.
Hi Alex,
FYI – First, love it, great example of how to get off the ground! Thank you so much for what you have produced and shared!
QUESTION / ACTION REQUESTED: I have a couple of DCR/Issues I found and I have made changes to address them and wanted to know how you would propose integrating them?
My proposals
2a. a new git hub project branched from yours since it is reference for the article
2b. You determine and establish collaboration guidelines on your github project and I and others like MP below create issues and code check-ins against them (with maybe dev tests 🙂 ) on a separate branch which you can review and decide if they warrant inclusion in your project based on your goal and scope and release as a new version
2c. Something better you or MP or others come up with.
Cheers!
Sorry, I don’t think I ever got notified of this. I just changed jobs, and it’s possible that I overlooked it.
I think it’s a great idea and I am happy to make you a co-owner of that if you are interested. Can you ping me on Twitter again or drop me a line here https://www.alexkras.com/contact/ and we can continue the discussion via email.
Now that I think about it, I can just move the article version into a branch and make master a living thing. The repo already has 69 starts, so it would be a shame to give it up 🙂
I also faced the same error. It’s because of the ‘google-api-python-client’ version. Install the google-api-python-client as:
pip install google-api-python-client==1.6.4
So my previous post I’ve solved all the issues that came about and reading over the comments the following function may help others too. I found that reducing the silence blocks much like what would be useful for podcasts solved all issues with returning null transcripts.
Silence how-to https://digitalcardboard.com/blog/2009/08/25/the-sox-of-silence/
remove_silence () {
tempfile=
date '+%Y%m%d%H%M%S'
Removes short periods of silence
sox $1 $tempfile.wav silence -l 1 0.1 1% -1 2.0 1%
Shorting long period of silence and ignoring noise burst
sox $1 $tempfile.wav silence -l 1 0.3 1% -1 2.0 1%
mv -v $1 $tempfile'_original_'$1
mv -v $tempfile.wav $1
}
Hi Alex, I’ve been updating the components of processing larger files and the fast and slow scripts are pausing on seemingly kosher wav files, and the fast script seems to bring down the network even when I bring down the threads, I was wondering if there were any thoughts on writing out the transcription files more often so that the whole batch of queries is not lost? And has anyone updated the script to work a little more failsafey over a say 10 hour audio chunk? Thanks a bunch its nice to have something to use to bring down the cost of online transcription services!
Hi Alex,
I am using a shorter version of the code on a single file:
##############
import speech_recognition as sr
r = sr.Recognizer()
with open(“api-key.json”) as f:
GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read()
test_audio = sr.AudioFile(‘C://users//me//desktop//page2.wav’)
with test_audio as source:
audio = r.record(source)
r.recognize_google_cloud(audio, language = ‘es-MX’,
credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
##############
but I am getting two error messages for this snippet. The first is ModuleNotFoundError: No module named ‘oauth2client’. I have pip installed oauth2client as well as oauthlib and google auth.
The second related error is:
RequestError: missing google-api-python-client module: ensure that google-api-python-client is set up correctly.
I haven’t been able to solve these issues despite troubleshooting at length. Do you have any idea how to fix this?
Sorry, no idea. Try using virtual environment if you haven’t already, and may be Python 2 instead of 3. You can control that with virtual environment as well.
https://www.alexkras.com/how-to-use-virtualenv-in-python-to-install-packages-locally/
This post is getting kind of old, may be it’s also a good time to check out Google official Python client, and see if it works better.
Hi Alex,
First off, thank you so much for this code! Now, I don’t know if the below error is an issue from my side or GCloud is being messy, but I would love any help you and this community can provide. Here is my error –
Traceback (most recent call last):
File “C:\Python36\lib\site-packages\speech_recognition__init__.py”, line 930, in recognize_google_cloud
response = request.execute()
File “C:\Python36\lib\site-packages\oauth2client_helpers.py”, line 133, in positional_wrapper
return wrapped(*args, **kwargs)
File “C:\Python36\lib\site-packages\googleapiclient\http.py”, line 842, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “fast.py”, line 28, in
all_text = pool.map(transcribe, enumerate(files))
File “C:\Python36\lib\multiprocessing\pool.py”, line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File “C:\Python36\lib\multiprocessing\pool.py”, line 608, in get
raise self._value
File “C:\Python36\lib\multiprocessing\pool.py”, line 119, in worker
result = (True, func(*args, **kwds))
File “C:\Python36\lib\multiprocessing\pool.py”, line 44, in mapstar
return list(map(*args))
File “fast.py”, line 21, in transcribe
text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
File “C:\Python36\lib\site-packages\speech_recognition__init__.py”, line 932, in recognize_google_cloud
raise RequestError(e)
speech_recognition.RequestError:
I’ve waited for 10 minutes after enabling the API and tried again, but no luck.
Thanks in advance.
Regards,
Rashmil.
Not sure, could be file formatting. Have you tried with sample files?
Hi Alex and Rashmil,
Have you found any solution to this issue. I have the same issue and dont know how to proceed.
Thanks in advance
Best
Ali
Hi Alex,
After changing the sound file I had better results. Still if google.cloud could not recognize some parts of the audio an error pops. So is there any way to tell google client to ignore if some parts of the audio not clear.
Thank you so much for providing this code. I would like to run the code for 100 audio file. How would that be possible?
Not sure, I think if you look at the pull requests in the repo somebody automated file conversion (although I haven’t merged that in yet). From there you may be able to automate it further.
Hi Alex, thanks for sharing your code. I managed to run it as it is and also used different mp3 audio files, which I converted to wav using Audacity. Works perfectly! I will trying using a microphone as an audio source.
Once more many thanks.
Gideon
Thank you for this grate work. I follow your steps, but I faced this error:
“C:\Program Files (x86)\Python37-32\python.exe” C:/Users/hudad/PycharmProjects/speech-to-text-master/slow.py
0%| | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
File “C:\Users\hudad\AppData\Roaming\Python\Python37\site-packages\speech_recognition__init__.py”, line 885, in recognize_google_cloud
try: json.loads(credentials_json)
File “C:\Program Files (x86)\Python37-32\lib\json__init__.py”, line 348, in loads
return _default_decoder.decode(s)
File “C:\Program Files (x86)\Python37-32\lib\json\decoder.py”, line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “C:\Program Files (x86)\Python37-32\lib\json\decoder.py”, line 355, in raw_decode
raise JSONDecodeError(“Expecting value”, s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “C:/Users/hudad/PycharmProjects/speech-to-text-master/slow.py”, line 19, in
text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
File “C:\Users\hudad\AppData\Roaming\Python\Python37\site-packages\speech_recognition__init__.py”, line 886, in recognize_google_cloud
except Exception: raise AssertionError(“
credentials_json
must beNone
or a valid JSON string”)AssertionError:
credentials_json
must beNone
or a valid JSON stringProcess finished with exit code 1
Please help
Luke, your last audio file is crashing the code because there is no speech to transcribe, listen to your last file, if it is just music and no voice, delete it and it should work.
Hey Alex,
Thanks for putting together the comprehensive tutorial and code – I’ve managed to transcribe some of my own audio but am running into problems with other files.
I have a collection of files, all of which I’m converting to mono @ 48000hz (doing this to remove variables for debugging) and then running through fast.py.
The problem I’m encountering appears to occur when attempting to process the final 30s audio chunk in the ‘parts’ folder. For example, my current file has been split into 74 parts – all of which were successfully processed apart from #74.
This is the traceback I’m getting:
Traceback (most recent call last):
File “fast.py”, line 28, in
all_text = pool.map(transcribe, enumerate(files))
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py”, line 253, in map
return self.map_async(func, iterable, chunksize).get()
File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py”, line 572, in get
raise self._value
speech_recognition.UnknownValueError
Do you have any suggestions why this might be the case?
Unsure why it’s working fine for some files, but not for others.
Thanks
Luke
Thanks again Alex for this code and your guide.
I am having the same problem as Luke,
Some files just keep getting the same error ^^
Try listening to the last track. If there is no speech and just music, or audio without words, delete that track and try again.
Very good job. Thank you.
I tried your code for my country France (World champion ;=)). Excellent
Change in fast.py
1/ text = r.recognize_google_cloud(audio_data=audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS,language=”fr-FR”)
2/ transcript = transcript + “{:0>2d}:{:0>2d}:{:0>2d} {}\n”.format(h, m, s, t[‘text’].encode(‘utf8’))
and it’s OK to have text in french language.
Hi Alex,
Your code is very helpful…can you tell me what will be code for Punctuation of the end of the line.
Please share me …….
Regards,
Milan
Hello … any update for my problem?? please share…
Hi,
I am getting below error::
“Sync input too long. For audio longer than 1 min use LongRunningRecognize with a ‘uri’ parameter.”>”
Which I understand is due to the length of the audio file(more than 1 min). I googled the error and I got the suggestion mentioned in the web link:-
https://stackoverflow.com/questions/44835522/why-does-my-python-script-not-recognize-speech-from-audio-file?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
The above link unltimatly takes me to the below sample code
=======================
def transcribe_gcs(gcs_uri):
“””Asynchronously transcribes the audio file specified by the gcs_uri.”””
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types
client = speech.SpeechClient()
print(‘Confidence: {}’.format(result.alternatives[0].confidence))
So does this means I will have to re-write the code using different sets of module, or can we adjust the “.long_running_recognize” function somewhere in your code?
amitesh
Hi Alex, does Google Speech to Text API support multi-speaker recognition while transcribing? Also, does it output timestamps for each word or sentence as well? Sorry for shooting so many questions, but my final question is does it have a offline version that one can use? Thanks.
I don’t know of a way to do this. There is an open GitHub issue if somebody want’s to pitch in.
https://github.com/akras14/speech-to-text/issues/1
Hello alex, i tried to generate an api key and it says that i have to create a billing account which requires credit card infromation.So, how does it work? Is that free? Do i need to pay to get the script work?Thanks.
Yes, unfortunately credit card is required to register, but they do offer a free tier, so you shouldn’t be charged anything.
How can we use this google API to convert streaming speech to text? What should be our code be looking like?
Hello Alex,
I am at the very early stage of this activity. i.e. I have installed all the libraries mentioned by you. I am using windows 10 to perform the activity.
I wanted to generate the API key, but I guess I need to pay for that, right? Second, I couldn’t locate “API Manager” in the google cloud console. All I could see 3 tiles
I am not sure. You should be able to do it under the free trial. Re UI, may be they redesigned it. Seems like other people were able to get it to work. I’ll have to check it out later. If anybody knows, please comment.
Hi, Finally, I got the API key generated. I just had to browse extra to the website. Than you
The ffmpeg command “ffmpeg -i source/genevieve.wav -f segment -segment_time 30 -c copy parts/out%09d.wav” doesen’t work when i try to run it,
Guessed Channel Layout for Input Stream #0.0 : mono
Input #0, wav, from ‘source/genevieve.wav’:
Duration: 00:01:10.33, bitrate: 768 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, mono, s16, 768 kb/s
[segment @ 0000021e48be0dc0] Opening ‘parts/out000000000.wav’ for writing
[segment @ 0000021e48be0dc0] Failed to open segment ‘parts/out000000000.wav’
Could not write header for output file #0 (incorrect codec parameters ?): No such file or directory
Stream mapping:
Stream #0:0 -> #0:0 (copy)
Last message repeated 1 times
I don’t know how to fix it or what am i doing wrong.
I’m seeing the same problem… did you find a solution?
Hey José,
The -c copy parts/out%09d.wav part of the code expects there to be a folder in the speech-to-text-master folder called “parts”.
Create this and the parts will be saved there!
Found a way to avoid breaking up a long audio file:
1. Convert the audio file to FLAC (downmix from stereo to mono) — Audacity can export to FLAC, make note of the bitrate
2. Upload FLAC file to Google Cloud Storage — create new bucket if need be, no need to make it public
3. Edit transcribe_async.py — find bitrate for FLAC and change it accordingly also update the timeout value to 600 (10m)
4. Run command: python transcribe_async.py gs://bucketname/filename.flac
Hello Alex, thank you very much for your collaboration.
Alex, if I wanted to change the language of the API, for example, the parameter language_code = ‘es-CO’, where should I do it? Thank you
I didn’t have this use case, and not sure that the 3rd party library that I used supports this option.
This example from Google might be helpful, but I did not try it myself: https://github.com/GoogleCloudPlatform/python-docs-samples/blob/c6fe72714517e6660bc758e6358623eea0a48608/speech/cloud-client/quickstart.py
did you manage to make it work?
The tutorial is great. It is working for me. Nevertheless, my audiofiles are also non-english. Have you found a solution for setting the language
I managed to set the language. If you use the slow.py version, you could modify line 19 where the “recognize_google_cloud” function of the library is used like this:
text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS, language=”de-DE”)
See the documentation here: https://github.com/Uberi/speech_recognition/blob/master/reference/library-reference.rst#recognizer_instancerecognize_google_cloudaudio_data-audiodata-credentials_json-unionstr-none–none-language-str–en-us-preferred_phrases-unioniterablestr-none–none-show_all-bool–false—unionstr-dictstr-any
Seems to work for me 🙂
Here’s something I tried. I already had WAV recordings I obtained from an MP3 Player.
Hence, I decided to skip the MP3->WAV conversion step.
I ran into multiple errors, mainly due to format inconsistency with the native WAV type.
And so, I’m posting this.
I’ve used “VOICE001.wav” as an example. It works well with MP3 inputs as well.
For MP3, skip step 1.
Converting to the right WAV format
Check for your WAV file’s properties.
ffprobe VOICE001.wav
# Input #0, wav, from ‘VOICE001.wav’:
Duration: 00:01:16.54, bitrate: 128 kb/s
Stream #0:0: Audio: adpcm_ima_wav ([17][0][0][0] / 0x0011), 32000 Hz, 1 channels, s16p, 128 kb/s
Convert & Replace the WAV file to native type using Audacity.
Again
ffprobe VOICE001.wav
# Input #0, wav, from ‘VOICE001.wav’:
Duration: 00:01:16.28, bitrate: 512 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 32000 Hz, 1 channels, s16, 512 kb/s
For remaining WAV files, use the native format details for conversion using ffmpeg.
ffmpeg -i VOICE001.wav -acodec pcm_s16le -ar 32000 VOICE001-win.wav
# Output #0, wav, to ‘VOICE001-win.wav’:
Metadata:
ISFT : Lavf58.3.100
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 32000 Hz, mono, s16, 512 kb/s
Metadata:
encoder : Lavc58.9.100 pcm_s16le
size= 4768kB time=00:01:16.28 bitrate= 512.0kbits/s speed= 246x
video:0kB audio:4768kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.001598%
* Here, The Audio Codec & Sampling Rate fields have been altered to fit the native format settings.
I tried using the code with the source files that you provided (genevieve.wav), however I get the following error:
ValueError: Audio file could not be read as PCM WAV, AIFF/AIFF-C, or Native FLAC; check if file is corrupted or in another format
I did not change any code. Any ideas on what I’m doing wrong here?
Did you generate parts with ffmpeg?
I just re-run it fresh and it worked for me.I am using Python3 on MacOS.
What system are you on, at what point does it fail?
I’m not sure exactly what I was doing wrong, but it works now. Sorry for the inconvenience.
Hi,
Like @Jamshed, I’m getting that same error when I run on genevieve.wav :
ValueError: Audio file could not be read as PCM WAV, AIFF/AIFF-C, or Native FLAC; check if file is corrupted or in another format
It also includes this in the result:
wave.Error: file does not start with RIFF id
.I check the file:
$ file out000000002.wav
out000000002.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 48000 Hz
$ file -i out000000002.wav
out000000002.wav: regular file
`$ mediainfo out000000002.wav
General
Complete name : out000000002.wav
Format : Wave
File size : 966 KiB
Duration : 10 s 302 ms
Overall bit rate mode : Constant
Overall bit rate : 768 kb/s
Writing application : Lavf56.36.100
Audio
Format : PCM
Format settings : Little / Signed
Codec ID : 1
Duration : 10 s 302 ms
Bit rate mode : Constant
Bit rate : 768 kb/s
Channel(s) : 1 channel
Sampling rate : 48.0 kHz
Bit depth : 16 bits
Stream size : 966 KiB (100%)`
So I’m wondering if something is wrong with my ffmpeg install? Any advice appreciated, and thank you for sharing all this.
Sorry, not sure. I did mine on Mac OS, with ffmpeg installed via Homebrew. What is your set up?
I solved it. It seemed to be conflicting packages in my python install. I set up a fresh python3 environment, re-installed ffmpeg etc, and it works really really well now. Thanks!
Hi Alex
One issue I found is that if the number of files in the parts folder exceed the pool workers,say you have 20 files in the parts folder and u have pool = Pool(8) only the first 8 files are processed in ORDER and after that alll remaining files in the parts folder are processed OUT of sequence. Tried a few thing but still not working. Seems like even though the map function is supposed to keep the sort order but for some reason the order is only kept for the first 8 files.
Strange, what platform/python version are you using?
Using aws ec2 amazon Linux , python 36,
Have a wav file about 60 mb, I partikn the file in 55 or 60 sec, generates about 57 files in the part folder,use a pool size of 8, the first 8 files are in order, the remaining are all in mixed orders.
Tried to sort the list first and confirmed that’s in order but after the first 8 files, the order is lost. Trying the google asyn but not working yet.
Reading over the code I see that I am taking by an extra step to sort by idx. So the only thing I can think off of those ids come in the wrong order.
Can you confirm that when you call os.listdir files show up in the right order?
No, they are not and what I had done was to apply sort like: file = sorted(os.listdir(‘parts/’). If I don’t use the sort, the entire transcript is all over the place, meaning the beggining of the wav file could be transcribed in the middle of the text and so on. Next I applied sort(os.listdir(‘parts/’) and confirmed in the shell that all the “files” are sorted. Next I ran the script and I confirmed that ONLY the first batch of the pool (in this case only the first 8 files) are ordered correctly, the next pool worker loses the sort again. do you know what I mean?
here is the list dir wihtout the sort:
Here is the list dir with sort
But still for some reason only the first batch of the pool workers are in the right order in the transcribe file, starting 0009.wav on wards the transcribe file is no longer in order.
Even though the map function is supposed to keep the order.
Strange
Even if
map
doesn’t keep them sorted, this linesorted(all_text, key=lambda x: x['idx']):
shoudl re-sort them back in order.Try to debug this sort/idx and see if something funky happens around there.
I am having the same problem as daz… I added the sort also and it is not sorting correctly.. ( on the fast)
I am testing the slow ( unthreaded verison) to see if it is the threading that is causing the ordering problem.
files =sorted(os.listdir(‘parts/’))
parts/out0000.wav started
parts/out0002.wav started
parts/out0006.wav started
parts/out0010.wav started
parts/out0014.wav started
parts/out0008.wav started
parts/out0004.wav started
The limitation for 60 seconds only applies to synchronous requests (https://cloud.google.com/speech/quotas). Is there a reason you didn’t use an asynchronous request rather than splitting up the file?
I just didn’t know that was an option. Thanks for the tip, I’ll have to investigate. May be it was just a limitation of the library I was using.
Tried the google async example but it fails half way through. Do u have a working example with the google async to concert a wav file to text?
Thanks
Is there a way to overcome the 30 sec limitation where I can do the whole file in one try? Or if I have to break the file would it be possible to have the transcript numbered? Like if the input wave file is wave01.wav wave02.wav the output be transcript0102.txt? thanks for the great script.
Sorry I don’t think I follow. I believe it already does both, final transcript is one text file.
Here is the use case: I have multiple wav files, Alex.wav, Vida.wav, Jim.wav. I like to modify the program such that it reads the inputwav folder containing all the wav files (alex.wav, vida.wav, jim.wav) and runs it through the python program to output alex_transcript.txt, vida_transcript.wav, jim_transcript.wav. But I am having difficulty getting it to work. So I ran each files individually. Thanks Alex
Ah I c. Yes, then it goes back to figuring a way to convert file to proper wav programmatically and then calling the split files command (and probably adding a clean up step later).
I didn’t get this far.
Another idea that I didn’t get to do is splitting file by silence around 30 seconds, instead of hard 30 second split, which can cut mid sentence/word.
God luck! Let me know if you figure any of this out.
“ffmpeg -i input.mp3 output.wav” converts the mp4 file to wav file without any compression.
It is better to have a command to do the task instead of a new software if we are we are automating a task
Unfortunately something was off about this type of wav, which I did not dig in to figure out. Transcription did not work with wav created like this. May be it was just something local to my Mac.
I tried the same thing, but for some reason I think it read the wav file backwards, meaning it starts from the end of the file transcribing. Thanks Alex for pointing this out. I go back to using Audacity.
Thanks for writing all this up! It’s been super helpful. Not sure if it’s still an issue, but I had the same problem. It seems like ffmpeg ignores the format when you’re doing the segmentation… Running it in two lines works for me, though there was probably a better way to actually fix the problem. ffmpeg -i db/foo.m4a -c:a pcm_s16le db/stage1.wav