One Day Builds: Phase Vocoder
Introduction:
I'm a fan of Adam Savage's "One Day Builds" over at Tested.com [1], and often binge-watch multiple episodes on weekend mornings. I realized that I have a habit of my own "one day builds" in the computer-coding world, and figured I'd go ahead and write the phase vocoder I'd been thinking about, and catalogue the journey for the benefit of whomever...
Motivation:
I have some students working on a project to better understand how pitch-correction utilities such as Melodyne or AutoTune work. From some cursory checking around on the internet [1]-[3], it seems that the "phase vocoder" algorithm is the bread-and-butter of these sorts of utilities, the basics of which are not hard to implement but the fine points of "making it sound good" are where the "special proprietary sauce" comes in.
Purpose:
"Why write your own code, when existing utilities already do this?" Why build anything of your own? There is a joy in making. Furthermore, I find that building my own really helps me to understand the workings of something in a way which merely reading about it can't. An example of this is the compressor model I built [4], which in the process of building, cleared up many of my misperceptions about how compressors work.
Standards:
I don't own Melodyne, but I do have Logic and Pro Tools, both of which do a "great" job of time-stretching and-or pitch-correction. "Great" in comparison to what I'm likely to build. Also, there is the vocoder of the librosa library [5] of Python utilities from IRCAM which I'll be using to test against. And yes, we'll be using Python. It's not hard to find MATLAB code out there for phase vocoding [6], but again, I don't have MATLAB. If I stick with Python, I get a host of sound libraries & FFT libraries, and could even include it in my SHAART utility [7]. So, I intend to compare my results from librosa's; I expect librosa will produce cleaner results but I'm curious how close I can get.
Procedure:
The phase vocoder algorithm is a combination of time-stretching and resampling, which can achieve either time-stretching by itself, or pitch-shifting by itself, or both together. We'll do the time-stretching first, and then pitch-shifting can be achieved by resampling the time-stretched signal.
We're going to read in a WAV, put in a Short-Time Fourier Transform (STFT), and then regenerate the time-series data using an Inverse STFT (ISTFT), only adding some extra spacing.
For the STFT, we'll turn to the wondrous place that is StackExchange, and check out user Basj's code [9]...
Ok, so all we'll do now is modify the Inverse Short Time Fourier Transform (ISTFT) so that spaces out it resynthesis frames in time...
With all that said, here's our first time-stretching code, and believe me it sounds TERRIBLE...
The output sounds really "phasey." A sample Led Zeppelin clip shifted in frequency by a multiplicative factor of 1.5 (by running "./vocoder.py theocean_short.wav 1.0 1.5") yields this audio clip.
The above code is a "vocoder" but it's not a "phase vocoder" because we didn't make any attempts to make the phase line up between the resynthesized frames.
At this point it's probably worth it to try out the librosa example to hear a 'good' phase vocoder. So we'll add the following bits of code in the appropriate place(s)...
...Yea that works great. As expected.
Turns out that it's better to stretch the STFT in time first and fill in the missing parts via interpolation as described in the aforementioned MATLAB code [7]. With that, we get much cleaner results, provided we take care to increment the phase. (This is the phase part of "phase vocoder".)
In the following "full" piece of code, we can run my original vocoder, librosa's version, or my "new" vocoder:
The librosa output sounds the best, and I wish I understood why scaling "dphi" (in my stretch_stft code) by the time-scaling amount actually makes things worse instead of better. ..???
Moving on, we need to resample the "sig2" if we want pitch-shifting. Now, numpy has a resample() routine but in my experience it's notoriously, horrendously slow unless you're careful to make sure the number of samples is a power of 2. Hence we have the code my_resample, which zero-pads the input array as needed:
With that, then we just add in a final call to my_resample, as well as including the pitch scaling in the call to the time-scaling routine, and we're done. Here's the relevant section:
If you want to download the full code here's a link.
Enjoy. My day's over. :-)
-Dr H
References:
I'm a fan of Adam Savage's "One Day Builds" over at Tested.com [1], and often binge-watch multiple episodes on weekend mornings. I realized that I have a habit of my own "one day builds" in the computer-coding world, and figured I'd go ahead and write the phase vocoder I'd been thinking about, and catalogue the journey for the benefit of whomever...
Motivation:
I have some students working on a project to better understand how pitch-correction utilities such as Melodyne or AutoTune work. From some cursory checking around on the internet [1]-[3], it seems that the "phase vocoder" algorithm is the bread-and-butter of these sorts of utilities, the basics of which are not hard to implement but the fine points of "making it sound good" are where the "special proprietary sauce" comes in.
Purpose:
"Why write your own code, when existing utilities already do this?" Why build anything of your own? There is a joy in making. Furthermore, I find that building my own really helps me to understand the workings of something in a way which merely reading about it can't. An example of this is the compressor model I built [4], which in the process of building, cleared up many of my misperceptions about how compressors work.
Standards:
I don't own Melodyne, but I do have Logic and Pro Tools, both of which do a "great" job of time-stretching and-or pitch-correction. "Great" in comparison to what I'm likely to build. Also, there is the vocoder of the librosa library [5] of Python utilities from IRCAM which I'll be using to test against. And yes, we'll be using Python. It's not hard to find MATLAB code out there for phase vocoding [6], but again, I don't have MATLAB. If I stick with Python, I get a host of sound libraries & FFT libraries, and could even include it in my SHAART utility [7]. So, I intend to compare my results from librosa's; I expect librosa will produce cleaner results but I'm curious how close I can get.
Procedure:
The phase vocoder algorithm is a combination of time-stretching and resampling, which can achieve either time-stretching by itself, or pitch-shifting by itself, or both together. We'll do the time-stretching first, and then pitch-shifting can be achieved by resampling the time-stretched signal.
We're going to read in a WAV, put in a Short-Time Fourier Transform (STFT), and then regenerate the time-series data using an Inverse STFT (ISTFT), only adding some extra spacing.
For the STFT, we'll turn to the wondrous place that is StackExchange, and check out user Basj's code [9]...
def basj_stft(x, fftsize=2048, overlap=4): hop = fftsize / overlap w = scipy.hanning(fftsize+1)[:-1] # better reconstruction with this trick +1)[:-1] return np.array([np.fft.rfft(w*x[i:i+fftsize]) for i in range(0, len(x)-fftsize, hop)]) def basj_istft(X, overlap=4): fftsize=(X.shape[1]-1)*2 hop = fftsize / overlap w = scipy.hanning(fftsize+1)[:-1] x = scipy.zeros(X.shape[0]*hop) wsum = scipy.zeros(X.shape[0]*hop) for n,i in enumerate(range(0, len(x)-fftsize, hop)): x[i:i+fftsize] += scipy.real(np.fft.irfft(X[n])) * w # overlap-add wsum[i:i+fftsize] += w ** 2. pos = np.where( wsum != 0 ) x[pos] /= wsum[pos] return x
Ok, so all we'll do now is modify the Inverse Short Time Fourier Transform (ISTFT) so that spaces out it resynthesis frames in time...
def my_istft(X, scale_t, overlap=4): fftsize=(X.shape[1]-1)*2 hop = int( fftsize / overlap * scale_t ) w = scipy.hanning(fftsize+1)[:-1] x = scipy.zeros(X.shape[0]*hop) wsum = scipy.zeros( int( X.shape[0]*hop*scale_t) ) for n,i in enumerate(range(0, len(x)-fftsize, hop)): x[i:i+fftsize] += scipy.real(np.fft.irfft(X[n])) * w # overlap-add wsum[i:i+fftsize] += w ** 2. pos = np.where( wsum != 0 ) x[pos] /= wsum[pos] return x
With all that said, here's our first time-stretching code, and believe me it sounds TERRIBLE...
#!/usr/bin/env python # vocoder.py: Scale an input waveform by amounts in time & frequency (independently) import sys import scipy.io.wavfile as wavfile import scipy.signal as signal import scipy import numpy as np import pylab as pl import math # stft and istft code take from user Basj's post at http://stackoverflow.com/a/20409020 def basj_stft(x, fftsize=2048, overlap=4): hop = fftsize / overlap w = scipy.hanning(fftsize+1)[:-1] # better reconstruction with this trick +1)[:-1] return np.array([np.fft.rfft(w*x[i:i+fftsize]) for i in range(0, len(x)-fftsize, hop)]) def basj_istft(X, overlap=4): fftsize=(X.shape[1]-1)*2 hop = fftsize / overlap w = scipy.hanning(fftsize+1)[:-1] x = scipy.zeros(X.shape[0]*hop) wsum = scipy.zeros(X.shape[0]*hop) for n,i in enumerate(range(0, len(x)-fftsize, hop)): x[i:i+fftsize] += scipy.real(np.fft.irfft(X[n])) * w # overlap-add wsum[i:i+fftsize] += w ** 2. pos = np.where( wsum != 0 ) x[pos] /= wsum[pos] return x def my_istft(X, scale_t, overlap=4): fftsize=(X.shape[1]-1)*2 hop = int( fftsize / overlap * scale_t ) w = scipy.hanning(fftsize+1)[:-1] x = scipy.zeros(X.shape[0]*hop) wsum = scipy.zeros( int( X.shape[0]*hop*scale_t) ) for n,i in enumerate(range(0, len(x)-fftsize, hop)): x[i:i+fftsize] += scipy.real(np.fft.irfft(X[n])) * w # overlap-add wsum[i:i+fftsize] += w ** 2. pos = np.where( wsum != 0 ) x[pos] /= wsum[pos] return x #----------- parse command line arguments and read wav file into memory ----- if len(sys.argv) < 4: sys.exit('Usage: %s <wav_filename> <time_scaling> <freq_scaling>' % sys.argv[0]) wav_filename = sys.argv[1] scale_t = np.float( sys.argv[2] ) scale_f = np.float( sys.argv[3] ) print "reading file ",wav_filename,"..." samplerate, data = wavfile.read(wav_filename) if (len(data.shape) > 1): # take left channel of stereo track data = data[:,0] #convert integer data to floats, and scale by max val sig = 1.0*data maxval = np.amax(sig) print "maxval = ",maxval sig = sig / maxval #--------------------- end of preparing input file -------------------- sigSTFT = basj_stft( sig ) sig2 = my_istft( sigSTFT, scale_t ); #save output file newname = wav_filename.replace(".wav","_out.wav") print "saving output file ",newname wavfile.write(newname,samplerate,sig2)
The output sounds really "phasey." A sample Led Zeppelin clip shifted in frequency by a multiplicative factor of 1.5 (by running "./vocoder.py theocean_short.wav 1.0 1.5") yields this audio clip.
The above code is a "vocoder" but it's not a "phase vocoder" because we didn't make any attempts to make the phase line up between the resynthesized frames.
At this point it's probably worth it to try out the librosa example to hear a 'good' phase vocoder. So we'll add the following bits of code in the appropriate place(s)...
import librosa vocoder_type = 'librosa' # 'mine' or 'librosa' if ('librosa' == vocoder_type): sig2 = librosa.effects.time_stretch(sig, 1.0/scale_t) # time_stretch actually time-shrinks! half_steps = 12 * np.log2( scale_f ) sig2 = librosa.effects.pitch_shift(sig2, samplerate, n_steps=half_steps) else: #do all my stuff
Turns out that it's better to stretch the STFT in time first and fill in the missing parts via interpolation as described in the aforementioned MATLAB code [7]. With that, we get much cleaner results, provided we take care to increment the phase. (This is the phase part of "phase vocoder".)
In the following "full" piece of code, we can run my original vocoder, librosa's version, or my "new" vocoder:
#!/usr/bin/env python # vocoder.py: Scale an input waveform by amounts in time & frequency (independently) import sys import scipy.io.wavfile as wavfile import scipy.signal as signal import scipy import numpy as np import pylab as pl import math import librosa # stft and istft code take from user Basj's post at http://stackoverflow.com/a/20409020 def basj_stft(x, fftsize=2048, overlap=4): hop = fftsize / overlap w = scipy.hanning(fftsize+1)[:-1] # better reconstruction with this trick +1)[:-1] return np.array([np.fft.rfft(w*x[i:i+fftsize]) for i in range(0, len(x)-fftsize, hop)]) def basj_istft(X, overlap=4): fftsize=(X.shape[1]-1)*2 hop = fftsize / overlap w = scipy.hanning(fftsize+1)[:-1] x = scipy.zeros(X.shape[0]*hop) wsum = scipy.zeros(X.shape[0]*hop) for n,i in enumerate(range(0, len(x)-fftsize, hop)): x[i:i+fftsize] += scipy.real(np.fft.irfft(X[n])) * w # overlap-add wsum[i:i+fftsize] += w ** 2. pos = np.where( wsum != 0 ) x[pos] /= wsum[pos] return x def my_istft(X, scale_t, overlap=4): fftsize=(X.shape[1]-1)*2 hop = int( fftsize / overlap * scale_t ) w = scipy.hanning(fftsize+1)[:-1] x = scipy.zeros(X.shape[0]*hop) wsum = scipy.zeros( int( X.shape[0]*hop*scale_t) ) for n,i in enumerate(range(0, len(x)-fftsize, hop)): x[i:i+fftsize] += scipy.real(np.fft.irfft(X[n])) * w # overlap-add wsum[i:i+fftsize] += w ** 2. pos = np.where( wsum != 0 ) x[pos] /= wsum[pos] return x def stretch_stft( X, scale_t ): # time stretching: stetches the stft via linear interp. # X = input STFT # scale_t = time-stretching scale factor n = len(X) n2 = int( round( n *scale_t) ) X2 = np.zeros( (n2, len(X[0]) ), dtype=complex ) phase_counter = np.angle(X[0]) for i2 in range(n2-2): # i2 counts along new spectrogram i = 1.0 * i2/ scale_t # i is the "fractional index" on the original stft ibgn = int( i ) di = i - ibgn # goes between 0 and 1 mag = (1.0-di)*np.abs(X[ibgn,:]) + di*np.abs(X[ibgn+1,:]) # linear interpolation X2[i2,:] = mag * np.exp(1.j* phase_counter) # resynthesize dphi = np.angle(X[ibgn+1,:]) - np.angle(X[ibgn,:]) # phase change per frame #dphi = dphi * scale_t # <--- Adding this makes it worse! # scale phase diff with time stretch phase_counter = phase_counter + dphi # compute phase for next frame X2[n2-1] = X[n-1] return X2 #----------- parse command line arguments and read wav file into memory ----- if len(sys.argv) < 4: sys.exit('Usage: %s <wav_filename> <time_scaling> <freq_scaling>' % sys.argv[0]) wav_filename = sys.argv[1] scale_t = np.float( sys.argv[2] ) scale_f = np.float( sys.argv[3] ) print "reading file ",wav_filename,"..." samplerate, data = wavfile.read(wav_filename) if (len(data.shape) > 1): # take left channel of stereo track data = data[:,0] #convert integer data to floats, and scale by max val sig = 1.0*data maxval = np.amax(sig) print "maxval = ",maxval sig = sig / maxval #--------------------- end of preparing input file -------------------- vocoder_type = 'mynew' # 'librosa', 'mynew', 'mine' if ('librosa' == vocoder_type): print 'Using librosa vocoder' sig2 = librosa.effects.time_stretch(sig, 1.0/scale_t) half_steps = 12 * np.log2( scale_f ) sig2 = librosa.effects.pitch_shift(sig2, samplerate, n_steps=half_steps) elif ('mynew' == vocoder_type): print 'Using my new vocoder ' sigSTFT = basj_stft( sig ) print ' Stretching the STFT....' stretched_STFT = stretch_stft ( sigSTFT, scale_t ) print "resynthesizing via stft..." sig2 = basj_istft( stretched_STFT ); else: print 'Using my vocoder ' sigSTFT = basj_stft( sig ) sig2 = my_istft( sigSTFT, scale_t ); #save output file newname = wav_filename.replace(".wav","_out.wav") print "saving output file ",newname wavfile.write(newname,samplerate,sig2)
The librosa output sounds the best, and I wish I understood why scaling "dphi" (in my stretch_stft code) by the time-scaling amount actually makes things worse instead of better. ..???
Moving on, we need to resample the "sig2" if we want pitch-shifting. Now, numpy has a resample() routine but in my experience it's notoriously, horrendously slow unless you're careful to make sure the number of samples is a power of 2. Hence we have the code my_resample, which zero-pads the input array as needed:
def my_resample(x,newlen): method = 0 if (0==method): # pad signal such that its length is a power of 2 = much faster orig_len = len(x) p2_len = math.pow(2, math.ceil(math.log(orig_len)/math.log(2))); x3 = np.zeros(p2_len) x3[0:orig_len-1] = x[0:orig_len-1] x2 = signal.resample(x3,newlen*p2_len/orig_len,) x2 = x2[0:newlen-1] else: num = len(x) stride = int(num / newlen) x2 = np.zeros(newlen) i = 0 for i2 in range(0,newlen): i = i2*stride x2[i2] = x[i] return x2
With that, then we just add in a final call to my_resample, as well as including the pitch scaling in the call to the time-scaling routine, and we're done. Here's the relevant section:
elif ('mynew' == vocoder_type): print 'Using my new vocoder ' sigSTFT = basj_stft( sig ) print ' Stretching the STFT....' stretched_STFT = stretch_stft ( sigSTFT, scale_t * scale_f ) print "resynthesizing via stft..." sig2 = basj_istft( stretched_STFT ); print "resampling as per frequency scaling" newlen = len(sig2) / scale_f sig2 = my_resample (sig2, newlen )
If you want to download the full code here's a link.
Enjoy. My day's over. :-)
-Dr H
References:
- http://www.tested.com
- Vague: http://theproaudiofiles.com/vocal-tuning-pitch-correction/
- I like this. Describes steps via pictures & non-math: https://en.m.wikipedia.org/wiki/Audio_time-scale/pitch_modification#Untangling_phase_and_time
- Another nice description without calculus, has pictures: http://www.guitarpitchshifter.com/algorithm.html
- http://www.scotthawley.com/compressor/
- librosa: https://github.com/bmcfee/librosa
- MATLAB Phase vocoder: http://www.ee.columbia.edu/ln/rosa/matlab/pvoc/
- http://hedges.belmont.edu/~shawley/SHAART/
- Basj's STFT/ISTFT Python code: http://stackoverflow.com/a/20409020
- Laroche & Dalson phase vocoder article: http://www.ee.columbia.edu/~dpwe/papers/LaroD99-pvoc.pdf