alexkras.com

  • Home
  • Top Posts
  • Resume
  • Projects
  • YouTube
  • Boots to Bytes
  • About
  • Contact

Transcribing Speech to Text with Python and Google Cloud Speech API

January 4, 2018 by Alex Kras 78 Comments

This tutorial will walk through using Google Cloud Speech API to transcribe a large audio file.

All code and sample files can be found in speech-to-text GitHub repo.

Transcribe large audio files using Python & our Cloud Speech API. @akras14 shows how https://t.co/dY56lmE0TD

— Google Cloud (@googlecloud) January 11, 2018

Sample Results

This approach works, but I found that result will vary greatly based on the quality of input.

Transcribing a Reading by My Wife

I asked my wife to read something out loud as if she was dictating to Siri for about 1.5 minutes. She is a native English speaker and we recorded using a microphone on iPhone 6s.

https://www.alexkras.com/wp-content/uploads/genevieve.mp3

Which resulted in the following transcript:

00:00:00 this Dynamic Workshop aims to provide up to date information on pharmacological approaches, issues, and treatment in the geriatric population to assist in preventing medication-related problems, appropriately and effectively managing medications and compliance. The concept of polypharmacy parentheses taking multiple types of drugs parentheses will also be discussed, as the
00:00:30 is a common issue that can impact adverse side effects in the geriatric population. Participants will leave with a knowledge and considerations of common drug interaction and how to minimize the effects that limit function. Summit professional education is approved provider of continuing education. This course is offered for 6
00:01:00 . this course contains a Content classified under the both the domain of occupational therapy and professional issues.

I think that Google Cloud Speech API did an amazing job, getting over 95% of the content right. Especially considering that this was not a professional recording and that you can hear my kid saying something in the background 🙂

Transcribing a Radio Broadcast with Few Different Voices

A reader sent me the following audio file recorded from 95.5 Sports Hub radio (broadcast on January 26th 2018), Toucher & Rich morning show. This too, turned out better than I expected.

https://www.alexkras.com/wp-content/uploads/radio_sample.mp3

00:00:00 announced that there was going to be a new XXX FL it was going to start in two years and here’s what he had to say that you accept kickoff in 2020 quite frankly we’re going to give the game of football back to fans I’m sure everyone has a lot of questions for me but I also have a lot of questions for you in fact we’re going to ask a lot of questions and listen to players coaches
00:00:30 call experts technology executive members of the media and anyone else who understands and loves the game of football but most importantly we’re going to be listening to someone ask that the will the question of what would you do if you can reimagine the game of professional football would you frenchtons eliminate halftime would you have if you were commercial breaks but the game of foot
00:01:00 I’ll be faster when the rules be simpler can you ask Chef elevated fan Centric with all the things you like to see in the last of the things you don’t and no doubt a lot of Innovations along the way we will put you at a shorter faster-paced family-friendly and easier to understand game don’t get me wrong it’s still football but it’s professional football reimagined Sims 4 launching a 20
00:01:30 hey we have two years which is plenty of time to really get it right so aside from family friendly which I just think means that you have to stand for the national anthem I have no idea because the other one was very sex. That’s why is it either it was the cheerleaders with the super tight outfits and stuff cheerleaders were dressed and I stripped it sounds like a very good idea sounds like he has he has no plan no he does he’s taking everything he does have
00:02:00 and it said all the teams are going to be owned by the same entity he knows that they’re starting with a team and that they’re going to be shorter games with maybe no halftime with inferior Talent no not necessarily interior Town there’s already a saturation of football as is that is the biggest thing that people been complaining about the game what is he thinking you know what he said you ate yesterday you said we’re going to make it short and then we want your ideas no gimmicks all the things that God was just playing around
00:02:30 this does feel like a guy who’s had enormous prefer

Transcribing a Speech by Winston Churchill

I wanted to challenge the script further, so I decided to run in on a famous speech by Winston Churchill, titled The Threat of Nazi Germany.

Here is the audio file:

https://www.alexkras.com/wp-content/uploads/winston-churchill-the-threat-of-germany.mp3

Which resulted in the following transcript:

00:00:00 many people think that the best way to escape War if the dwelling and then print them DVD for the younger generation they plump the grizzly photographs Before Their Eyes they feel that they dilate of generals and admirals they do not fit the crime I didn’t think they’d father
00:00:30 human strife how old is teaching in preventing us from attacking or invading any other country with the do so how would it help if we were attacked or invaded on stove that is a question we have to ask what did they does contempt of the Lord Beaverbrook
00:01:00 I’ll listen to the impassioned the field by George would they agree to meet that famous South African general identity I have bone responsibilities for the safety of this country in grievance time
00:01:30 we could convince and persuade them to go back play my play it seems to me you are rich we are what we are hungry it would be in Victoria’s we have been defeated you have valuable, we have not you have your name you have had the phone
00:02:00 set up pencil future about all I see are they would say you are weak and we are strong after all my friend your nephew all the way by that railing for nation of nearly 70 million the most educated industrial scientific discipline people in the world loving cup from childhood
00:02:30 all Epic Gloria Texas iron and death in battle at the noblest face for men yeah I need the nation we could have been done in order to augment its Collective Strength yeah definition of a group of preaching a gospel of intolerance and unrestrained by the wall by Parliament
00:03:00 public opinion in that country all packages speeches or morbid Wahlberg off of getting off the press I’m down you cable of Columbus they have a meeting dial shalt not kill it is the plenty of photos and or both now
00:03:30 play Ariana me with the upload speed I’m ready to that end lamentable weapon Javier against which all Navy is no defense and before which women and children so weak and frail capacity of the warriors on the front-line trenches all live equal adding partial patio
00:04:00 play with you but with the new weapon, new method of compelling the submission of racing bike terrorizing and torturing population and worst of all the more
00:04:30 the ball in cricket the structure of its social and economic life some more of those who may make it there praying love you too fat Grim despicable fact and invasive affect ionic again what are we to do

The result is an order of magnitude worse than my wife’s recording. Most likely it is caused by poor audio quality. In addition, Churchill used a lot of words that are no longer commonly used.

If you are still reading, let’s get started.

1. Sign Up for a Free Tier Account

Google Cloud offers a Free Tier plan, which will be used in this tutorial. An account is required to get an API key.

2. Generate an API Key

Follow these steps to generate an API key:

  1. Sign-in to Google Cloud Console
  2. Click “APIs & Services”
  3. Click “Credentials”
  4. Click “Create Credentials”
  5. Select “Service Account Key”
  6. Under “Service Account” select “New service account”
  7. Name service (whatever you’d like)
  8. Select Role: “Project” -> “Owner”
  9. Leave “JSON” option selected
  10. Click “Create”
  11. Save generated API key file
  12. Rename file to api-key.json

Make sure to move the key into speech-to-text cloned repo, if you plan to test this code.

3. Convert Audio File to Wav format

I ran into issues when trying to convert my audio file via a command line tools. Instead, I used Audacity (an open source audio editing tool) to convert my file to wav format. Audacity is great and I highly recommended it.

The steps to convert:

  1. Open file in Audacity
  2. Click “File” menu
  3. Click “Save other”
  4. Click “Export as Wav”
  5. Export it with default setting

4. Break up audio file into smaller parts

Google Cloud Speech API only accepts files no longer than 60 seconds. To be on the safe side, I broke my files in 30-second chunks. To do that I used an open source command line library called ffmpeg. It can be download from its site. On Mac, I installed it with Homebrew via brew install ffmpeg.

Here is the command I used to break up my file:

# Clean out old parts if needed via rm -rf parts/*
ffmpeg -i source/genevieve.wav -f segment -segment_time 30 -c copy parts/out%09d.wav

Where, source/genevieve.wav is the name of the input file, and parts/out%09d.wav is the format for output files. %09d indicated that the file number will be padded with 9 zeros (i.e. out000000001.wav), allowing files to be sorted alphabetically. This way ls command returns files sorted in the right order.

5. Install required Python modules

I added requirements.txt in example repo with all needed libraries. It can be used to install all via:

pip3 install -r requirements.txt

The real hero on this list is the SpeechRecognition. It does most of the heavy lifting.

The rest of the libraries came with the official google-api-python-client package.

I also used tqdm module to show progress in the slower version of the script.

6. Running the Code

Finally, we can run the Python script to get the transcript. For example python3 fast.py.

The slow version

Here is the Github link.

This script:

  1. Loads API key from step 2 in memory
  2. Gets a list of files (chunks)
  3. For every file, calls speech to text API endpoint
  4. Adds results to a list
  5. Combines all results and adds a timestamp (every 30 seconds)
  6. Saves results to transcript.txt
import os
import speech_recognition as sr
from tqdm import tqdm

with open("api-key.json") as f:
    GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read()

r = sr.Recognizer()
files = sorted(os.listdir('parts/'))

all_text = []

for f in tqdm(files):
    name = "parts/" + f
    # Load audio file
    with sr.AudioFile(name) as source:
        audio = r.record(source)
    # Transcribe audio file
    text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
    all_text.append(text)

transcript = ""
for i, t in enumerate(all_text):
    total_seconds = i * 30
    # Cool shortcut from:
    # https://stackoverflow.com/questions/775049/python-time-seconds-to-hms
    # to get hours, minutes and seconds
    m, s = divmod(total_seconds, 60)
    h, m = divmod(m, 60)

    # Format time as h:m:s - 30 seconds of text
    transcript = transcript + "{:0>2d}:{:0>2d}:{:0>2d} {}\n".format(h, m, s, t)

print(transcript)

with open("transcript.txt", "w") as f:
    f.write(transcript)

The code works, but it does take a while on longer source files.

Faster version

To speed things up, I added threading to my slow version. I describe the method used in detail in Simple Python Threading Example post.

Here is the GitHub Link.

The main difference is that I moved processing into a function and added logic, in the end, to sort processed results in the right order.

import os
import speech_recognition as sr
from tqdm import tqdm
from multiprocessing.dummy import Pool
pool = Pool(8) # Number of concurrent threads

with open("api-key.json") as f:
    GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read()

r = sr.Recognizer()
files = sorted(os.listdir('parts/'))

def transcribe(data):
    idx, file = data
    name = "parts/" + file
    print(name + " started")
    # Load audio file
    with sr.AudioFile(name) as source:
        audio = r.record(source)
    # Transcribe audio file
    text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
    print(name + " done")
    return {
        "idx": idx,
        "text": text
    }

all_text = pool.map(transcribe, enumerate(files))
pool.close()
pool.join()

transcript = ""
for t in sorted(all_text, key=lambda x: x['idx']):
    total_seconds = t['idx'] * 30
    # Cool shortcut from:
    # https://stackoverflow.com/questions/775049/python-time-seconds-to-hms
    # to get hours, minutes and seconds
    m, s = divmod(total_seconds, 60)
    h, m = divmod(m, 60)

    # Format time as h:m:s - 30 seconds of text
    transcript = transcript + "{:0>2d}:{:0>2d}:{:0>2d} {}\n".format(h, m, s, t['text'])

print(transcript)

with open("transcript.txt", "w") as f:
    f.write(transcript)

Conclusion

Results may vary, but there is utility even in poor transcriptions. For example, I had an hour and a half audio recording from a hand-over meeting with my former co-worker. I remembered that he mentioned something at some point, but was dreading listening through 1.5-hour audio file to find it. I ran the recording through this script and was able to quickly find needed keywords and timestamp pointed me to the right part of the audio file.

For native English speakers like my wife, Google Cloud Speech API can easily replace a professional transcribing service, at a fraction of a cost.

Filed Under: Python, Tools

I work for Evernote and we are hiring!

Subscribe to this Blog via Email

New posts only. No other messages will be sent.

You can find me on LinkedIn, GitHub, Twitter or Facebook.

This blog is moving to techtldr.com

Comments

    Leave a Reply Cancel reply

  1. Praveen says

    July 14, 2020 at 7:12 am

    Google API is not free you still need to enter the CC details in order to use the 60mins for free/month. You should’ve mentioned it in the beginning so no one will try to find out it’s not gonna work.

    Reply
  2. Vivi says

    June 1, 2020 at 8:29 am

    Hey there, I really appreciate this post. I would like to recommend this service – https://audext.com/speech-to-text/ – which helped me a lot in hard times 🙂 Thanks again.

    Reply
  3. vishwanath reddy says

    January 11, 2020 at 3:16 am

    What if the file contains 4 minutes of audio? I think its gonna be bit messy; instead breaking them in to smaller parts, is there anyway to break them 4 minutes each for 8 minutes audio file? Why does Google Cloud Speech API only accepts files no longer than 60 seconds?If Google Cloud Speech API works to transcribe a large audio file in one shot instead splitting them then it could have been easier for us. Since we are not a tech geek though we have caliber to learn a bit of coding.

    Does this API app really help me to transcribe both small n larger audio files into the text format? Since I am a Transcriber

    Reply
  4. silvino diaz carreras says

    December 20, 2019 at 12:52 am

    Hi Alex! Thank you for this article, excelent!!!;

    I tried to run the script to slice the audio and got the following error:
    SyntaxError: invalid syntax
    [Finished in 0.9s with exit code 1]
    [shell_cmd: python3 -OO -u “/Users/SilvinoDiaz/Desktop/speech-to-text-master/untitled.py”]
    [dir: /Users/SilvinoDiaz/Desktop/speech-to-text-master]
    [path: /Users/SilvinoDiaz/opt/anaconda3/bin:/Users/SilvinoDiaz/opt/anaconda3/condabin:/Users/SilvinoDiaz/anaconda3/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/anaconda3/bin:/Library/Frameworks/Python.framework/Versions/3.6/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/share/dotnet:/opt/X11/bin:~/.dotnet/tools:/Library/Frameworks/Mono.framework/Versions/Current/Commands]

    The IDLE is ST3
    I don’t know if it has to do with the installation of ‘anconda’ which causes the failure.
    Any idea?
    Thank you very much.

    Reply
  5. Mary says

    December 3, 2019 at 1:22 pm

    Hi, Thanks for this code. For more than 10 minutes, the chunk number 11 and 12 appears as the second oaragraph and this part of the text becomes misplaced. My question is why is this happening?

    Reply
  6. Herman Kurrelmeier says

    November 6, 2019 at 10:49 pm

    Alex, when I try and run ffmpeg to break up the audio file, it keeps giving me an error saying that it couldn’t segment and write the headers, how would I change the command so that ffmpeg creates each wav file as it goes??

    Reply
  7. Herman Kurrelmeier says

    November 6, 2019 at 10:21 pm

    Alex, I am getting this error when I try and use ffmpeg to break up my audio file:

    C:\Users\hmkur\Desktop\Python\Transcribing_Audio>ffmpeg -i source/valve.wav -f segment -segment_time 30 -c copy parts/out%09d.wav
    ffmpeg version 4.2.1 Copyright (c) 2000-2019 the FFmpeg developers
    built with gcc 9.1.1 (GCC) 20190807
    configuration: –enable-gpl –enable-version3 –enable-sdl2 –enable-fontconfig –enable-gnutls –enable-iconv –enable-libass –enable-libdav1d –enable-libbluray –enable-libfreetype –enable-libmp3lame –enable-libopencore-amrnb –enable-libopencore-amrwb –enable-libopenjpeg –enable-libopus –enable-libshine –enable-libsnappy –enable-libsoxr –enable-libtheora –enable-libtwolame –enable-libvpx –enable-libwavpack –enable-libwebp –enable-libx264 –enable-libx265 –enable-libxml2 –enable-libzimg –enable-lzma –enable-zlib –enable-gmp –enable-libvidstab –enable-libvorbis –enable-libvo-amrwbenc –enable-libmysofa –enable-libspeex –enable-libxvid –enable-libaom –enable-libmfx –enable-amf –enable-ffnvcodec –enable-cuvid –enable-d3d11va –enable-nvenc –enable-nvdec –enable-dxva2 –enable-avisynth –enable-libopenmpt
    libavutil 56. 31.100 / 56. 31.100
    libavcodec 58. 54.100 / 58. 54.100
    libavformat 58. 29.100 / 58. 29.100
    libavdevice 58. 8.100 / 58. 8.100
    libavfilter 7. 57.100 / 7. 57.100
    libswscale 5. 5.100 / 5. 5.100
    libswresample 3. 5.100 / 3. 5.100
    libpostproc 55. 5.100 / 55. 5.100
    [wav @ 0000015fe3028d80] Discarding ID3 tags because more suitable tags were found.
    Guessed Channel Layout for Input Stream #0.0 : stereo
    Input #0, wav, from ‘source/valve.wav’:
    Metadata:
    title : valve
    encoder : Lavf58.20.100 (libsndfile-1.0.24)
    Duration: 00:06:47.20, bitrate: 1411 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s16, 1411 kb/s
    [segment @ 0000015fe3461640] Opening ‘parts/out000000000.wav’ for writing
    [segment @ 0000015fe3461640] Failed to open segment ‘parts/out000000000.wav’
    Could not write header for output file #0 (incorrect codec parameters ?): No such file or directory
    Stream mapping:
    Stream #0:0 -> #0:0 (copy)
    Last message repeated 1 times

    How can I change the code so that it creates a new wav file everytime it needs to??

    Reply
  8. Herman Kurrelmeier says

    November 6, 2019 at 10:18 pm

    Alex, when I run ffmpeg to try and break up my audio file, it is giving me this error:

    C:\Users\hmkur\Desktop\Python\Transcribing_Audio>ffmpeg -i source/valve.wav -f segment -segment_time 30 -c copy parts/out%09d.wav
    ffmpeg version 4.2.1 Copyright (c) 2000-2019 the FFmpeg developers
    built with gcc 9.1.1 (GCC) 20190807
    configuration: –enable-gpl –enable-version3 –enable-sdl2 –enable-fontconfig –enable-gnutls –enable-iconv –enable-libass –enable-libdav1d –enable-libbluray –enable-libfreetype –enable-libmp3lame –enable-libopencore-amrnb –enable-libopencore-amrwb –enable-libopenjpeg –enable-libopus –enable-libshine –enable-libsnappy –enable-libsoxr –enable-libtheora –enable-libtwolame –enable-libvpx –enable-libwavpack –enable-libwebp –enable-libx264 –enable-libx265 –enable-libxml2 –enable-libzimg –enable-lzma –enable-zlib –enable-gmp –enable-libvidstab –enable-libvorbis –enable-libvo-amrwbenc –enable-libmysofa –enable-libspeex –enable-libxvid –enable-libaom –enable-libmfx –enable-amf –enable-ffnvcodec –enable-cuvid –enable-d3d11va –enable-nvenc –enable-nvdec –enable-dxva2 –enable-avisynth –enable-libopenmpt
    libavutil 56. 31.100 / 56. 31.100
    libavcodec 58. 54.100 / 58. 54.100
    libavformat 58. 29.100 / 58. 29.100
    libavdevice 58. 8.100 / 58. 8.100
    libavfilter 7. 57.100 / 7. 57.100
    libswscale 5. 5.100 / 5. 5.100
    libswresample 3. 5.100 / 3. 5.100
    libpostproc 55. 5.100 / 55. 5.100
    [wav @ 0000015fe3028d80] Discarding ID3 tags because more suitable tags were found.
    Guessed Channel Layout for Input Stream #0.0 : stereo
    Input #0, wav, from ‘source/valve.wav’:
    Metadata:
    title : valve
    encoder : Lavf58.20.100 (libsndfile-1.0.24)
    Duration: 00:06:47.20, bitrate: 1411 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, stereo, s16, 1411 kb/s
    [segment @ 0000015fe3461640] Opening ‘parts/out000000000.wav’ for writing
    [segment @ 0000015fe3461640] Failed to open segment ‘parts/out000000000.wav’
    Could not write header for output file #0 (incorrect codec parameters ?): No such file or directory
    Stream mapping:
    Stream #0:0 -> #0:0 (copy)
    Last message repeated 1 times

    It is saying that it failed to open segment, it seems that though this might mean that and empty .wav needs to be waiting for each segment??? How can I change the code so that it creates a .wav file when it needs to?

    Reply
  9. Tony says

    September 25, 2019 at 2:21 am

    Hi Alex,

    I am using your code to convert some voice commands to text, but run into this error when I run the ‘fast.py’ script.

    —
    File “/Users/Tony/anaconda3/lib/python3.7/site-packages/speech_recognition/init.py”, line 937, in recognize_google_cloud
    if “results” not in response or len(response[“results”]) == 0: raise UnknownValueError()

    UnknownValueError

    I think I’ve followed all the steps correctly, except for step 4, as my files are already smaller than 30 seconds. Have very little coding experience, any insight on this would be greatly appreciated! 🙂

    Kind regards,

    Tony

    Reply
  10. Dennis Lee says

    September 6, 2019 at 4:06 am

    Hi, have you thought about implementing a self-hosted audio transcribe server. This would be a great addition to the community as I agree that many of the professional services costs too much for individuals who uses it occasionally (like me!). Thanks for the insightful article.

    Reply
    • Alex Kras says

      September 6, 2019 at 7:21 am

      I have, it would still need Google Cloud Auth, unless I wanted to pay for it myself. I think it would be fairly simple for somebody to do using Google Cloud API as outlined in this article, but ultimately I didn’t feel like I wanted to make a business out of it and didn’t have time to work on it as a side project (my free time is fairly limited since I have two little kids).

      Reply
  11. David W. Grigsby says

    June 28, 2019 at 11:08 am

    Alex, probably a duplicate reply here, didn’t save first, my bad. I have made a fork and a couple of enhancements without over engineering and didn’t know if you want “forks” or “contributions to new branch or master. Sent a Tweet as well.

    Reply
  12. David W. Grigsby says

    June 24, 2019 at 11:50 am

    Hi Alex,

    FYI – First, love it, great example of how to get off the ground! Thank you so much for what you have produced and shared!
    QUESTION / ACTION REQUESTED: I have a couple of DCR/Issues I found and I have made changes to address them and wanted to know how you would propose integrating them?

    My proposals
    2a. a new git hub project branched from yours since it is reference for the article
    2b. You determine and establish collaboration guidelines on your github project and I and others like MP below create issues and code check-ins against them (with maybe dev tests 🙂 ) on a separate branch which you can review and decide if they warrant inclusion in your project based on your goal and scope and release as a new version
    2c. Something better you or MP or others come up with.

    Cheers!

    Reply
    • Alex Kras says

      July 3, 2019 at 9:03 am

      Sorry, I don’t think I ever got notified of this. I just changed jobs, and it’s possible that I overlooked it.

      I think it’s a great idea and I am happy to make you a co-owner of that if you are interested. Can you ping me on Twitter again or drop me a line here https://www.alexkras.com/contact/ and we can continue the discussion via email.

      Reply
      • Alex Kras says

        July 3, 2019 at 9:06 am

        Now that I think about it, I can just move the article version into a branch and make master a living thing. The repo already has 69 starts, so it would be a shame to give it up 🙂

        Reply
        • neha says

          July 21, 2019 at 11:38 pm

          I also faced the same error. It’s because of the ‘google-api-python-client’ version. Install the google-api-python-client as:

          pip install google-api-python-client==1.6.4

          Reply
  13. MP says

    May 18, 2019 at 2:07 pm

    So my previous post I’ve solved all the issues that came about and reading over the comments the following function may help others too. I found that reducing the silence blocks much like what would be useful for podcasts solved all issues with returning null transcripts.

    Silence how-to https://digitalcardboard.com/blog/2009/08/25/the-sox-of-silence/

    remove_silence () {
    tempfile=date '+%Y%m%d%H%M%S'

    Removes short periods of silence

    sox $1 $tempfile.wav silence -l 1 0.1 1% -1 2.0 1%

    Shorting long period of silence and ignoring noise burst

    sox $1 $tempfile.wav silence -l 1 0.3 1% -1 2.0 1%

    mv -v $1 $tempfile'_original_'$1
    mv -v $tempfile.wav $1

    }

    Reply
  14. MP says

    May 17, 2019 at 1:16 pm

    Hi Alex, I’ve been updating the components of processing larger files and the fast and slow scripts are pausing on seemingly kosher wav files, and the fast script seems to bring down the network even when I bring down the threads, I was wondering if there were any thoughts on writing out the transcription files more often so that the whole batch of queries is not lost? And has anyone updated the script to work a little more failsafey over a say 10 hour audio chunk? Thanks a bunch its nice to have something to use to bring down the cost of online transcription services!

    Reply
  15. walter says

    April 29, 2019 at 11:30 am

    Hi Alex,

    I am using a shorter version of the code on a single file:

    ##############
    import speech_recognition as sr

    r = sr.Recognizer()

    with open(“api-key.json”) as f:
    GOOGLE_CLOUD_SPEECH_CREDENTIALS = f.read()

    test_audio = sr.AudioFile(‘C://users//me//desktop//page2.wav’)
    with test_audio as source:
    audio = r.record(source)

    r.recognize_google_cloud(audio, language = ‘es-MX’,
    credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
    ##############

    but I am getting two error messages for this snippet. The first is ModuleNotFoundError: No module named ‘oauth2client’. I have pip installed oauth2client as well as oauthlib and google auth.

    The second related error is:
    RequestError: missing google-api-python-client module: ensure that google-api-python-client is set up correctly.

    I haven’t been able to solve these issues despite troubleshooting at length. Do you have any idea how to fix this?

    Reply
    • Alex Kras says

      April 29, 2019 at 11:35 am

      Sorry, no idea. Try using virtual environment if you haven’t already, and may be Python 2 instead of 3. You can control that with virtual environment as well.

      https://www.alexkras.com/how-to-use-virtualenv-in-python-to-install-packages-locally/

      This post is getting kind of old, may be it’s also a good time to check out Google official Python client, and see if it works better.

      Reply
  16. Rashmil says

    April 12, 2019 at 11:30 am

    Hi Alex,

    First off, thank you so much for this code! Now, I don’t know if the below error is an issue from my side or GCloud is being messy, but I would love any help you and this community can provide. Here is my error –

    Traceback (most recent call last):
    File “C:\Python36\lib\site-packages\speech_recognition__init__.py”, line 930, in recognize_google_cloud
    response = request.execute()
    File “C:\Python36\lib\site-packages\oauth2client_helpers.py”, line 133, in positional_wrapper
    return wrapped(*args, **kwargs)
    File “C:\Python36\lib\site-packages\googleapiclient\http.py”, line 842, in execute
    raise HttpError(resp, content, uri=self.uri)
    googleapiclient.errors.HttpError:

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File “fast.py”, line 28, in
    all_text = pool.map(transcribe, enumerate(files))
    File “C:\Python36\lib\multiprocessing\pool.py”, line 260, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
    File “C:\Python36\lib\multiprocessing\pool.py”, line 608, in get
    raise self._value
    File “C:\Python36\lib\multiprocessing\pool.py”, line 119, in worker
    result = (True, func(*args, **kwds))
    File “C:\Python36\lib\multiprocessing\pool.py”, line 44, in mapstar
    return list(map(*args))
    File “fast.py”, line 21, in transcribe
    text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
    File “C:\Python36\lib\site-packages\speech_recognition__init__.py”, line 932, in recognize_google_cloud
    raise RequestError(e)
    speech_recognition.RequestError:

    I’ve waited for 10 minutes after enabling the API and tried again, but no luck.

    Thanks in advance.
    Regards,
    Rashmil.

    Reply
    • Alex Kras says

      April 12, 2019 at 5:34 pm

      Not sure, could be file formatting. Have you tried with sample files?

      Reply
    • Ali Karaduman says

      December 7, 2019 at 6:59 am

      Hi Alex and Rashmil,
      Have you found any solution to this issue. I have the same issue and dont know how to proceed.
      Thanks in advance
      Best
      Ali

      Reply
      • Ali Karaduman says

        December 7, 2019 at 7:19 am

        Hi Alex,
        After changing the sound file I had better results. Still if google.cloud could not recognize some parts of the audio an error pops. So is there any way to tell google client to ignore if some parts of the audio not clear.

        Reply
  17. Helia says

    April 11, 2019 at 5:10 pm

    Thank you so much for providing this code. I would like to run the code for 100 audio file. How would that be possible?

    Reply
    • Alex Kras says

      April 12, 2019 at 5:35 pm

      Not sure, I think if you look at the pull requests in the repo somebody automated file conversion (although I haven’t merged that in yet). From there you may be able to automate it further.

      Reply
  18. Gideon Aswani says

    January 30, 2019 at 4:25 am

    Hi Alex, thanks for sharing your code. I managed to run it as it is and also used different mp3 audio files, which I converted to wav using Audacity. Works perfectly! I will trying using a microphone as an audio source.

    Once more many thanks.

    Gideon

    Reply
  19. Huda says

    November 12, 2018 at 10:55 am

    Thank you for this grate work. I follow your steps, but I faced this error:
    “C:\Program Files (x86)\Python37-32\python.exe” C:/Users/hudad/PycharmProjects/speech-to-text-master/slow.py
    0%| | 0/3 [00:00<?, ?it/s]
    Traceback (most recent call last):
    File “C:\Users\hudad\AppData\Roaming\Python\Python37\site-packages\speech_recognition__init__.py”, line 885, in recognize_google_cloud
    try: json.loads(credentials_json)
    File “C:\Program Files (x86)\Python37-32\lib\json__init__.py”, line 348, in loads
    return _default_decoder.decode(s)
    File “C:\Program Files (x86)\Python37-32\lib\json\decoder.py”, line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    File “C:\Program Files (x86)\Python37-32\lib\json\decoder.py”, line 355, in raw_decode
    raise JSONDecodeError(“Expecting value”, s, err.value) from None
    json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
    File “C:/Users/hudad/PycharmProjects/speech-to-text-master/slow.py”, line 19, in
    text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS)
    File “C:\Users\hudad\AppData\Roaming\Python\Python37\site-packages\speech_recognition__init__.py”, line 886, in recognize_google_cloud
    except Exception: raise AssertionError(“credentials_json must be None or a valid JSON string”)
    AssertionError: credentials_json must be None or a valid JSON string

    Process finished with exit code 1

    Please help

    Reply
  20. Phillip says

    July 31, 2018 at 1:54 pm

    Luke, your last audio file is crashing the code because there is no speech to transcribe, listen to your last file, if it is just music and no voice, delete it and it should work.

    Reply
  21. theendurancehub says

    July 19, 2018 at 6:21 am

    Hey Alex,

    Thanks for putting together the comprehensive tutorial and code – I’ve managed to transcribe some of my own audio but am running into problems with other files.

    I have a collection of files, all of which I’m converting to mono @ 48000hz (doing this to remove variables for debugging) and then running through fast.py.

    The problem I’m encountering appears to occur when attempting to process the final 30s audio chunk in the ‘parts’ folder. For example, my current file has been split into 74 parts – all of which were successfully processed apart from #74.

    This is the traceback I’m getting:

    Traceback (most recent call last):
    File “fast.py”, line 28, in
    all_text = pool.map(transcribe, enumerate(files))
    File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py”, line 253, in map
    return self.map_async(func, iterable, chunksize).get()
    File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/multiprocessing/pool.py”, line 572, in get
    raise self._value
    speech_recognition.UnknownValueError

    Do you have any suggestions why this might be the case?

    Unsure why it’s working fine for some files, but not for others.

    Thanks
    Luke

    Reply
    • Phillip Hardy says

      July 31, 2018 at 1:32 pm

      Thanks again Alex for this code and your guide.

      I am having the same problem as Luke,

      Some files just keep getting the same error ^^

      Reply
    • Phillip Hardy says

      July 31, 2018 at 1:47 pm

      Try listening to the last track. If there is no speech and just music, or audio without words, delete that track and try again.

      Reply
  22. Norbert CORDIER says

    July 17, 2018 at 12:23 pm

    Very good job. Thank you.
    I tried your code for my country France (World champion ;=)). Excellent
    Change in fast.py
    1/ text = r.recognize_google_cloud(audio_data=audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS,language=”fr-FR”)
    2/ transcript = transcript + “{:0>2d}:{:0>2d}:{:0>2d} {}\n”.format(h, m, s, t[‘text’].encode(‘utf8’))
    and it’s OK to have text in french language.

    Reply
  23. Milan Hazra says

    June 25, 2018 at 5:31 am

    Hi Alex,
    Your code is very helpful…can you tell me what will be code for Punctuation of the end of the line.

    Please share me …….

    Regards,
    Milan

    Reply
    • Milan Hazra says

      June 28, 2018 at 3:31 am

      Hello … any update for my problem?? please share…

      Reply
  24. Amitesh says

    June 12, 2018 at 5:36 am

    Hi,

    I am getting below error::

    “Sync input too long. For audio longer than 1 min use LongRunningRecognize with a ‘uri’ parameter.”>”

    Which I understand is due to the length of the audio file(more than 1 min). I googled the error and I got the suggestion mentioned in the web link:-

    https://stackoverflow.com/questions/44835522/why-does-my-python-script-not-recognize-speech-from-audio-file?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

    The above link unltimatly takes me to the below sample code

    =======================
    def transcribe_gcs(gcs_uri):
    “””Asynchronously transcribes the audio file specified by the gcs_uri.”””
    from google.cloud import speech
    from google.cloud.speech import enums
    from google.cloud.speech import types
    client = speech.SpeechClient()

    audio = types.RecognitionAudio(uri=gcs_uri)
    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=16000,
        language_code='en-US')
    
    operation = client.long_running_recognize(config, audio)
    
    print('Waiting for operation to complete...')
    response = operation.result(timeout=90)
    
    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        print(u'Transcript: {}'.format(result.alternatives[0].transcript))
    

    print(‘Confidence: {}’.format(result.alternatives[0].confidence))

    So does this means I will have to re-write the code using different sets of module, or can we adjust the “.long_running_recognize” function somewhere in your code?

    amitesh

    Reply
  25. BW2013 says

    June 11, 2018 at 3:12 pm

    Hi Alex, does Google Speech to Text API support multi-speaker recognition while transcribing? Also, does it output timestamps for each word or sentence as well? Sorry for shooting so many questions, but my final question is does it have a offline version that one can use? Thanks.

    Reply
    • Alex Kras says

      June 14, 2018 at 2:18 pm

      I don’t know of a way to do this. There is an open GitHub issue if somebody want’s to pitch in.
      https://github.com/akras14/speech-to-text/issues/1

      Reply
  26. 123gamer110 says

    June 5, 2018 at 6:18 am

    Hello alex, i tried to generate an api key and it says that i have to create a billing account which requires credit card infromation.So, how does it work? Is that free? Do i need to pay to get the script work?Thanks.

    Reply
    • Alex Kras says

      June 5, 2018 at 8:08 am

      Yes, unfortunately credit card is required to register, but they do offer a free tier, so you shouldn’t be charged anything.

      Reply
  27. amitesh sahay says

    April 26, 2018 at 2:36 am

    How can we use this google API to convert streaming speech to text? What should be our code be looking like?

    Reply
  28. amitesh sahay says

    April 10, 2018 at 4:17 am

    Hello Alex,

    I am at the very early stage of this activity. i.e. I have installed all the libraries mentioned by you. I am using windows 10 to perform the activity.

    I wanted to generate the API key, but I guess I need to pay for that, right? Second, I couldn’t locate “API Manager” in the google cloud console. All I could see 3 tiles

    Reply
    • Alex Kras says

      April 10, 2018 at 8:43 am

      I am not sure. You should be able to do it under the free trial. Re UI, may be they redesigned it. Seems like other people were able to get it to work. I’ll have to check it out later. If anybody knows, please comment.

      Reply
      • Amitesh says

        April 10, 2018 at 10:52 pm

        Hi, Finally, I got the API key generated. I just had to browse extra to the website. Than you

        Reply
  29. José Jara says

    March 15, 2018 at 6:37 pm

    The ffmpeg command “ffmpeg -i source/genevieve.wav -f segment -segment_time 30 -c copy parts/out%09d.wav” doesen’t work when i try to run it,
    Guessed Channel Layout for Input Stream #0.0 : mono
    Input #0, wav, from ‘source/genevieve.wav’:
    Duration: 00:01:10.33, bitrate: 768 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 48000 Hz, mono, s16, 768 kb/s
    [segment @ 0000021e48be0dc0] Opening ‘parts/out000000000.wav’ for writing
    [segment @ 0000021e48be0dc0] Failed to open segment ‘parts/out000000000.wav’
    Could not write header for output file #0 (incorrect codec parameters ?): No such file or directory
    Stream mapping:
    Stream #0:0 -> #0:0 (copy)
    Last message repeated 1 times

    I don’t know how to fix it or what am i doing wrong.

    Reply
    • theendurancehub says

      June 7, 2018 at 12:57 pm

      I’m seeing the same problem… did you find a solution?

      Reply
    • theendurancehub says

      June 7, 2018 at 1:01 pm

      Hey José,

      The -c copy parts/out%09d.wav part of the code expects there to be a folder in the speech-to-text-master folder called “parts”.

      Create this and the parts will be saved there!

      Reply
  30. JohnDoe says

    February 11, 2018 at 9:56 pm

    Found a way to avoid breaking up a long audio file:
    1. Convert the audio file to FLAC (downmix from stereo to mono) — Audacity can export to FLAC, make note of the bitrate
    2. Upload FLAC file to Google Cloud Storage — create new bucket if need be, no need to make it public
    3. Edit transcribe_async.py — find bitrate for FLAC and change it accordingly also update the timeout value to 600 (10m)
    4. Run command: python transcribe_async.py gs://bucketname/filename.flac

    Reply
  31. Juan Oliveros says

    February 5, 2018 at 5:26 pm

    Hello Alex, thank you very much for your collaboration.
    Alex, if I wanted to change the language of the API, for example, the parameter language_code = ‘es-CO’, where should I do it? Thank you

    Reply
    • Alex Kras says

      February 5, 2018 at 9:41 pm

      I didn’t have this use case, and not sure that the 3rd party library that I used supports this option.

      This example from Google might be helpful, but I did not try it myself: https://github.com/GoogleCloudPlatform/python-docs-samples/blob/c6fe72714517e6660bc758e6358623eea0a48608/speech/cloud-client/quickstart.py

      Reply
    • José Jara says

      March 16, 2018 at 7:32 am

      did you manage to make it work?

      Reply
    • Hans Maiers says

      May 9, 2018 at 6:36 am

      The tutorial is great. It is working for me. Nevertheless, my audiofiles are also non-english. Have you found a solution for setting the language

      Reply
      • Jonas Verhoelen (@jverhoelen) says

        June 5, 2018 at 12:11 pm

        I managed to set the language. If you use the slow.py version, you could modify line 19 where the “recognize_google_cloud” function of the library is used like this:

        text = r.recognize_google_cloud(audio, credentials_json=GOOGLE_CLOUD_SPEECH_CREDENTIALS, language=”de-DE”)

        See the documentation here: https://github.com/Uberi/speech_recognition/blob/master/reference/library-reference.rst#recognizer_instancerecognize_google_cloudaudio_data-audiodata-credentials_json-unionstr-none–none-language-str–en-us-preferred_phrases-unioniterablestr-none–none-show_all-bool–false—unionstr-dictstr-any

        Seems to work for me 🙂

        Reply
  32. Chaitanya Téjaswi says

    January 27, 2018 at 2:05 pm

    Here’s something I tried. I already had WAV recordings I obtained from an MP3 Player.
    Hence, I decided to skip the MP3->WAV conversion step.
    I ran into multiple errors, mainly due to format inconsistency with the native WAV type.
    And so, I’m posting this.
    I’ve used “VOICE001.wav” as an example. It works well with MP3 inputs as well.
    For MP3, skip step 1.

    Converting to the right WAV format

    Check for your WAV file’s properties.
    ffprobe VOICE001.wav
    # Input #0, wav, from ‘VOICE001.wav’:
    Duration: 00:01:16.54, bitrate: 128 kb/s
    Stream #0:0: Audio: adpcm_ima_wav ([17][0][0][0] / 0x0011), 32000 Hz, 1 channels, s16p, 128 kb/s
    Convert & Replace the WAV file to native type using Audacity.
    Again
    ffprobe VOICE001.wav
    # Input #0, wav, from ‘VOICE001.wav’:
    Duration: 00:01:16.28, bitrate: 512 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 32000 Hz, 1 channels, s16, 512 kb/s
    For remaining WAV files, use the native format details for conversion using ffmpeg.
    ffmpeg -i VOICE001.wav -acodec pcm_s16le -ar 32000 VOICE001-win.wav
    # Output #0, wav, to ‘VOICE001-win.wav’:
    Metadata:
    ISFT : Lavf58.3.100
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 32000 Hz, mono, s16, 512 kb/s
    Metadata:
    encoder : Lavc58.9.100 pcm_s16le
    size= 4768kB time=00:01:16.28 bitrate= 512.0kbits/s speed= 246x
    video:0kB audio:4768kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.001598%

    * Here, The Audio Codec & Sampling Rate fields have been altered to fit the native format settings.

    Reply
  33. Jamshed Kaikaus says

    January 23, 2018 at 10:39 am

    I tried using the code with the source files that you provided (genevieve.wav), however I get the following error:

    ValueError: Audio file could not be read as PCM WAV, AIFF/AIFF-C, or Native FLAC; check if file is corrupted or in another format

    I did not change any code. Any ideas on what I’m doing wrong here?

    Reply
    • Alex Kras says

      January 23, 2018 at 1:06 pm

      Did you generate parts with ffmpeg?

      I just re-run it fresh and it worked for me.I am using Python3 on MacOS.

      What system are you on, at what point does it fail?

      Reply
      • Jamshed Kaikaus says

        January 23, 2018 at 1:41 pm

        I’m not sure exactly what I was doing wrong, but it works now. Sorry for the inconvenience.

        Reply
    • Shawn says

      November 26, 2018 at 8:55 am

      Hi,

      Like @Jamshed, I’m getting that same error when I run on genevieve.wav : ValueError: Audio file could not be read as PCM WAV, AIFF/AIFF-C, or Native FLAC; check if file is corrupted or in another format

      It also includes this in the result: wave.Error: file does not start with RIFF id.

      I check the file:

      $ file out000000002.wav
      out000000002.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 48000 Hz

      $ file -i out000000002.wav
      out000000002.wav: regular file

      `$ mediainfo out000000002.wav
      General
      Complete name : out000000002.wav
      Format : Wave
      File size : 966 KiB
      Duration : 10 s 302 ms
      Overall bit rate mode : Constant
      Overall bit rate : 768 kb/s
      Writing application : Lavf56.36.100

      Audio
      Format : PCM
      Format settings : Little / Signed
      Codec ID : 1
      Duration : 10 s 302 ms
      Bit rate mode : Constant
      Bit rate : 768 kb/s
      Channel(s) : 1 channel
      Sampling rate : 48.0 kHz
      Bit depth : 16 bits
      Stream size : 966 KiB (100%)`

      So I’m wondering if something is wrong with my ffmpeg install? Any advice appreciated, and thank you for sharing all this.

      Reply
      • Alex Kras says

        November 26, 2018 at 8:58 am

        Sorry, not sure. I did mine on Mac OS, with ffmpeg installed via Homebrew. What is your set up?

        Reply
        • Shawn says

          November 28, 2018 at 8:08 am

          I solved it. It seemed to be conflicting packages in my python install. I set up a fresh python3 environment, re-installed ffmpeg etc, and it works really really well now. Thanks!

          Reply
  34. Daz says

    January 19, 2018 at 3:38 am

    Hi Alex

    One issue I found is that if the number of files in the parts folder exceed the pool workers,say you have 20 files in the parts folder and u have pool = Pool(8) only the first 8 files are processed in ORDER and after that alll remaining files in the parts folder are processed OUT of sequence. Tried a few thing but still not working. Seems like even though the map function is supposed to keep the sort order but for some reason the order is only kept for the first 8 files.

    Reply
    • Alex Kras says

      January 19, 2018 at 6:43 am

      Strange, what platform/python version are you using?

      Reply
      • Daz says

        January 19, 2018 at 10:19 am

        Using aws ec2 amazon Linux , python 36,
        Have a wav file about 60 mb, I partikn the file in 55 or 60 sec, generates about 57 files in the part folder,use a pool size of 8, the first 8 files are in order, the remaining are all in mixed orders.
        Tried to sort the list first and confirmed that’s in order but after the first 8 files, the order is lost. Trying the google asyn but not working yet.

        Reply
        • Alex Kras says

          January 19, 2018 at 10:44 am

          Reading over the code I see that I am taking by an extra step to sort by idx. So the only thing I can think off of those ids come in the wrong order.

          Can you confirm that when you call os.listdir files show up in the right order?

          Reply
          • daz says

            January 19, 2018 at 11:10 am

            No, they are not and what I had done was to apply sort like: file = sorted(os.listdir(‘parts/’). If I don’t use the sort, the entire transcript is all over the place, meaning the beggining of the wav file could be transcribed in the middle of the text and so on. Next I applied sort(os.listdir(‘parts/’) and confirmed in the shell that all the “files” are sorted. Next I ran the script and I confirmed that ONLY the first batch of the pool (in this case only the first 8 files) are ordered correctly, the next pool worker loses the sort again. do you know what I mean?

            here is the list dir wihtout the sort:

            import os
            files=os.listdir(‘parts/’)
            files
            [‘0039.wav’, ‘0048.wav’, ‘0029.wav’, ‘0007.wav’, ‘0025.wav’, ‘0013.wav’, ‘0030.wav’, ‘0020.wav’, ‘0041.wav’, ‘0016.wav’, ‘0010.wav’, ‘0037.wav’, ‘0012.wav’, ‘0017.wav’, ‘0028.wav’, ‘0044.wav
            ‘, ‘0038.wav’, ‘0009.wav’, ‘0000.wav’, ‘0024.wav’, ‘0031.wav’, ‘0022.wav’, ‘0023.wav’, ‘0045.wav’, ‘0043.wav’, ‘0036.wav’, ‘0026.wav’, ‘0018.wav’, ‘0014.wav’, ‘0003.wav’, ‘0008.wav’, ‘0005.w
            av’, ‘0046.wav’, ‘0002.wav’, ‘0033.wav’, ‘0042.wav’, ‘0027.wav’, ‘0011.wav’, ‘0004.wav’, ‘0040.wav’, ‘0019.wav’, ‘0001.wav’, ‘0021.wav’, ‘0032.wav’, ‘0006.wav’, ‘0015.wav’, ‘0047.wav’, ‘0034
            .wav’, ‘0035.wav’]

            Here is the list dir with sort

            import os
            files=sorted(os.listdir(‘parts/’))
            files
            [‘0000.wav’, ‘0001.wav’, ‘0002.wav’, ‘0003.wav’, ‘0004.wav’, ‘0005.wav’, ‘0006.wav’, ‘0007.wav’, ‘0008.wav’, ‘0009.wav’, ‘0010.wav’, ‘0011.wav’, ‘0012.wav’, ‘0013.wav’, ‘0014.wav’, ‘0015.wav
            ‘, ‘0016.wav’, ‘0017.wav’, ‘0018.wav’, ‘0019.wav’, ‘0020.wav’, ‘0021.wav’, ‘0022.wav’, ‘0023.wav’, ‘0024.wav’, ‘0025.wav’, ‘0026.wav’, ‘0027.wav’, ‘0028.wav’, ‘0029.wav’, ‘0030.wav’, ‘0031.w
            av’, ‘0032.wav’, ‘0033.wav’, ‘0034.wav’, ‘0035.wav’, ‘0036.wav’, ‘0037.wav’, ‘0038.wav’, ‘0039.wav’, ‘0040.wav’, ‘0041.wav’, ‘0042.wav’, ‘0043.wav’, ‘0044.wav’, ‘0045.wav’, ‘0046.wav’, ‘0047
            .wav’, ‘0048.wav’]

            But still for some reason only the first batch of the pool workers are in the right order in the transcribe file, starting 0009.wav on wards the transcribe file is no longer in order.
            Even though the map function is supposed to keep the order.
            Strange

          • Alex Kras says

            January 19, 2018 at 11:29 am

            Even if map doesn’t keep them sorted, this line sorted(all_text, key=lambda x: x['idx']): shoudl re-sort them back in order.

            Try to debug this sort/idx and see if something funky happens around there.

          • ric says

            April 24, 2018 at 10:51 am

            I am having the same problem as daz… I added the sort also and it is not sorting correctly.. ( on the fast)

            I am testing the slow ( unthreaded verison) to see if it is the threading that is causing the ordering problem.

            files =sorted(os.listdir(‘parts/’))

            parts/out0000.wav started
            parts/out0002.wav started
            parts/out0006.wav started
            parts/out0010.wav started
            parts/out0014.wav started
            parts/out0008.wav started
            parts/out0004.wav started

  35. Stephen Hopkins says

    January 17, 2018 at 9:35 pm

    The limitation for 60 seconds only applies to synchronous requests (https://cloud.google.com/speech/quotas). Is there a reason you didn’t use an asynchronous request rather than splitting up the file?

    Reply
    • Alex Kras says

      January 18, 2018 at 9:06 am

      I just didn’t know that was an option. Thanks for the tip, I’ll have to investigate. May be it was just a limitation of the library I was using.

      Reply
    • Daz says

      January 19, 2018 at 4:49 am

      Tried the google async example but it fails half way through. Do u have a working example with the google async to concert a wav file to text?

      Thanks

      Reply
  36. daz says

    January 15, 2018 at 8:35 pm

    Is there a way to overcome the 30 sec limitation where I can do the whole file in one try? Or if I have to break the file would it be possible to have the transcript numbered? Like if the input wave file is wave01.wav wave02.wav the output be transcript0102.txt? thanks for the great script.

    Reply
    • Alex Kras says

      January 15, 2018 at 8:56 pm

      Sorry I don’t think I follow. I believe it already does both, final transcript is one text file.

      Reply
      • daz says

        January 16, 2018 at 7:00 am

        Here is the use case: I have multiple wav files, Alex.wav, Vida.wav, Jim.wav. I like to modify the program such that it reads the inputwav folder containing all the wav files (alex.wav, vida.wav, jim.wav) and runs it through the python program to output alex_transcript.txt, vida_transcript.wav, jim_transcript.wav. But I am having difficulty getting it to work. So I ran each files individually. Thanks Alex

        Reply
        • Alex Kras says

          January 16, 2018 at 7:59 am

          Ah I c. Yes, then it goes back to figuring a way to convert file to proper wav programmatically and then calling the split files command (and probably adding a clean up step later).

          I didn’t get this far.

          Another idea that I didn’t get to do is splitting file by silence around 30 seconds, instead of hard 30 second split, which can cut mid sentence/word.

          God luck! Let me know if you figure any of this out.

          Reply
  37. Anirudh says

    January 10, 2018 at 2:28 am

    “ffmpeg -i input.mp3 output.wav” converts the mp4 file to wav file without any compression.
    It is better to have a command to do the task instead of a new software if we are we are automating a task

    Reply
    • Alex Kras says

      January 10, 2018 at 5:57 am

      Unfortunately something was off about this type of wav, which I did not dig in to figure out. Transcription did not work with wav created like this. May be it was just something local to my Mac.

      Reply
      • daz says

        January 16, 2018 at 7:01 am

        I tried the same thing, but for some reason I think it read the wav file backwards, meaning it starts from the end of the file transcribing. Thanks Alex for pointing this out. I go back to using Audacity.

        Reply
      • xartle says

        July 22, 2019 at 3:48 pm

        Thanks for writing all this up! It’s been super helpful. Not sure if it’s still an issue, but I had the same problem. It seems like ffmpeg ignores the format when you’re doing the segmentation… Running it in two lines works for me, though there was probably a better way to actually fix the problem. ffmpeg -i db/foo.m4a -c:a pcm_s16le db/stage1.wav

        Reply

Copyright © 2021 · eleven40 Pro Theme on Genesis Framework · WordPress · Log in