Sunday, April 03, 2016

One Day Builds: Phase Vocoder

Introduction:
I'm a fan of Adam Savage's "One Day Builds" over at Tested.com [1], and often binge-watch multiple episodes on weekend mornings.  I realized that I have a habit of my own "one day builds" in the computer-coding world, and figured I'd go ahead and write the phase vocoder I'd been thinking about, and catalogue the journey for the benefit of whomever...

Motivation:
I have some students working on a project to better understand how pitch-correction utilities such as Melodyne or AutoTune work.  From some cursory checking around on the internet [1]-[3], it seems that the "phase vocoder" algorithm is the bread-and-butter of these sorts of utilities, the basics of which are not hard to implement but the fine points of "making it sound good" are where the "special proprietary sauce" comes in.


Purpose:
"Why write your own code, when existing utilities already do this?"   Why build anything of your own?   There is a joy in making.  Furthermore, I find that building my own really helps me to understand the workings of something in a way which merely reading about it can't.  An example of this is the compressor model I built [4], which in the process of building, cleared up many of my misperceptions about how compressors work.

Standards:
I don't own Melodyne, but I do have Logic and Pro Tools, both of which do a "great" job of time-stretching and-or pitch-correction.  "Great" in comparison to what I'm likely to build.  Also, there is the vocoder of the librosa library [5]  of Python utilities from IRCAM which I'll be using to test against.  And yes, we'll be using Python. It's not hard to find MATLAB code out there for phase vocoding [6], but again, I don't have MATLAB.  If I stick with Python, I get a host of sound libraries & FFT libraries, and could even include it in my SHAART utility [7].  So, I intend to compare my results from librosa's; I expect librosa will produce cleaner results but I'm curious how close I can get.


Procedure:
 The phase vocoder algorithm is a combination of time-stretching and resampling, which can achieve either time-stretching by itself, or pitch-shifting by itself, or both together.  We'll do the time-stretching first, and then pitch-shifting can be achieved by resampling the time-stretched signal.
We're going to read in a WAV, put in a Short-Time Fourier Transform (STFT), and then regenerate the time-series data using an Inverse STFT (ISTFT), only adding some extra spacing.

For the STFT, we'll turn to the wondrous place that is StackExchange, and check out user Basj's code [9]...

def basj_stft(x, fftsize=2048, overlap=4):   
    hop = fftsize / overlap
    w = scipy.hanning(fftsize+1)[:-1]      # better reconstruction with this trick +1)[:-1]  
    return np.array([np.fft.rfft(w*x[i:i+fftsize]) for i in range(0, len(x)-fftsize, hop)])

def basj_istft(X, overlap=4):   
    fftsize=(X.shape[1]-1)*2
    hop = fftsize / overlap
    w = scipy.hanning(fftsize+1)[:-1]
    x = scipy.zeros(X.shape[0]*hop)
    wsum = scipy.zeros(X.shape[0]*hop) 
    for n,i in enumerate(range(0, len(x)-fftsize, hop)): 
        x[i:i+fftsize] += scipy.real(np.fft.irfft(X[n])) * w   # overlap-add
        wsum[i:i+fftsize] += w ** 2.
    pos = np.where( wsum != 0 )
    x[pos] /= wsum[pos]
    return x

Ok, so all we'll do now is modify the Inverse Short Time Fourier Transform (ISTFT) so that spaces out it resynthesis frames in time...


def my_istft(X, scale_t, overlap=4):   
    fftsize=(X.shape[1]-1)*2
    hop = int( fftsize / overlap * scale_t )
    w = scipy.hanning(fftsize+1)[:-1]
    x = scipy.zeros(X.shape[0]*hop)
    wsum = scipy.zeros( int( X.shape[0]*hop*scale_t) )
    for n,i in enumerate(range(0, len(x)-fftsize, hop)): 
        x[i:i+fftsize] += scipy.real(np.fft.irfft(X[n])) * w   # overlap-add
        wsum[i:i+fftsize] += w ** 2.
    pos = np.where( wsum != 0 )
    x[pos] /= wsum[pos]
    return x

With all that said, here's our first time-stretching code, and believe me it sounds TERRIBLE...



#!/usr/bin/env python
# vocoder.py: Scale an input waveform by amounts in time & frequency (independently)

import sys
import scipy.io.wavfile as wavfile
import scipy.signal as signal
import scipy
import numpy as np
import pylab as pl
import math

# stft and istft code take from user Basj's post at http://stackoverflow.com/a/20409020
def basj_stft(x, fftsize=2048, overlap=4):   
    hop = fftsize / overlap
    w = scipy.hanning(fftsize+1)[:-1]      # better reconstruction with this trick +1)[:-1]  
    return np.array([np.fft.rfft(w*x[i:i+fftsize]) for i in range(0, len(x)-fftsize, hop)])

def basj_istft(X, overlap=4):   
    fftsize=(X.shape[1]-1)*2
    hop = fftsize / overlap
    w = scipy.hanning(fftsize+1)[:-1]
    x = scipy.zeros(X.shape[0]*hop)
    wsum = scipy.zeros(X.shape[0]*hop) 
    for n,i in enumerate(range(0, len(x)-fftsize, hop)): 
        x[i:i+fftsize] += scipy.real(np.fft.irfft(X[n])) * w   # overlap-add
        wsum[i:i+fftsize] += w ** 2.
    pos = np.where( wsum != 0 )
    x[pos] /= wsum[pos]
    return x


def my_istft(X, scale_t, overlap=4):   
    fftsize=(X.shape[1]-1)*2
    hop = int( fftsize / overlap * scale_t )
    w = scipy.hanning(fftsize+1)[:-1]
    x = scipy.zeros(X.shape[0]*hop)
    wsum = scipy.zeros( int( X.shape[0]*hop*scale_t) )
    for n,i in enumerate(range(0, len(x)-fftsize, hop)): 
        x[i:i+fftsize] += scipy.real(np.fft.irfft(X[n])) * w   # overlap-add
        wsum[i:i+fftsize] += w ** 2.
    pos = np.where( wsum != 0 )
    x[pos] /= wsum[pos]
    return x


#----------- parse command line arguments and read wav file into memory -----
if len(sys.argv) < 4:
    sys.exit('Usage: %s <wav_filename> <time_scaling> <freq_scaling>' % sys.argv[0])
wav_filename = sys.argv[1]
scale_t = np.float( sys.argv[2] )
scale_f = np.float( sys.argv[3] )

print "reading file ",wav_filename,"..."
samplerate, data = wavfile.read(wav_filename)
if (len(data.shape) > 1):    # take left channel of stereo track
    data = data[:,0]

#convert integer data to floats, and scale by max val
sig = 1.0*data
maxval = np.amax(sig)
print "maxval = ",maxval
sig = sig / maxval
#--------------------- end of preparing input file -------------------- 

 

sigSTFT = basj_stft( sig )
sig2 = my_istft( sigSTFT, scale_t );
 

#save output file
newname = wav_filename.replace(".wav","_out.wav")
print "saving output file ",newname
wavfile.write(newname,samplerate,sig2)



The output sounds really "phasey."   A sample Led Zeppelin clip shifted in frequency by a multiplicative factor of 1.5  (by running "./vocoder.py theocean_short.wav 1.0 1.5") yields this audio clip.

The above code is a "vocoder" but it's not a "phase vocoder" because we didn't make any attempts to make the phase line up between the resynthesized frames.

At this point it's probably worth it to try out the librosa example to hear a 'good' phase vocoder.  So we'll add the following bits of code in the appropriate place(s)...

import librosa

vocoder_type = 'librosa'   # 'mine' or 'librosa'

if ('librosa' == vocoder_type):
    sig2 = librosa.effects.time_stretch(sig, 1.0/scale_t)  # time_stretch actually time-shrinks!
    half_steps = 12 * np.log2( scale_f )
    sig2 = librosa.effects.pitch_shift(sig2, samplerate, n_steps=half_steps)

else:
    #do all my stuff
...Yea that works great.  As expected.

Turns out that it's better to stretch the STFT in time first and fill in the missing parts via interpolation as described in the aforementioned MATLAB code [7].  With that, we get much cleaner results, provided we take care to increment the phase.  (This is the phase part of "phase vocoder".)

In the following "full" piece of code, we can run my original vocoder,  librosa's version, or  my "new" vocoder:

#!/usr/bin/env python
# vocoder.py: Scale an input waveform by amounts in time & frequency (independently)

import sys
import scipy.io.wavfile as wavfile
import scipy.signal as signal
import scipy
import numpy as np
import pylab as pl
import math
import librosa


# stft and istft code take from user Basj's post at http://stackoverflow.com/a/20409020
def basj_stft(x, fftsize=2048, overlap=4):   
    hop = fftsize / overlap
    w = scipy.hanning(fftsize+1)[:-1]      # better reconstruction with this trick +1)[:-1]  
    return np.array([np.fft.rfft(w*x[i:i+fftsize]) for i in range(0, len(x)-fftsize, hop)])

def basj_istft(X, overlap=4):   
    fftsize=(X.shape[1]-1)*2
    hop = fftsize / overlap
    w = scipy.hanning(fftsize+1)[:-1]
    x = scipy.zeros(X.shape[0]*hop)
    wsum = scipy.zeros(X.shape[0]*hop) 
    for n,i in enumerate(range(0, len(x)-fftsize, hop)): 
        x[i:i+fftsize] += scipy.real(np.fft.irfft(X[n])) * w   # overlap-add
        wsum[i:i+fftsize] += w ** 2.
    pos = np.where( wsum != 0 )
    x[pos] /= wsum[pos]
    return x


def my_istft(X, scale_t, overlap=4):   
    fftsize=(X.shape[1]-1)*2
    hop = int( fftsize / overlap * scale_t )
    w = scipy.hanning(fftsize+1)[:-1]
    x = scipy.zeros(X.shape[0]*hop)
    wsum = scipy.zeros( int( X.shape[0]*hop*scale_t) )
    for n,i in enumerate(range(0, len(x)-fftsize, hop)): 
        x[i:i+fftsize] += scipy.real(np.fft.irfft(X[n])) * w   # overlap-add
        wsum[i:i+fftsize] += w ** 2.
    pos = np.where( wsum != 0 )
    x[pos] /= wsum[pos]
    return x


def stretch_stft( X, scale_t ):  # time stretching: stetches the stft via linear interp.
    # X = input STFT
    # scale_t = time-stretching scale factor
    n = len(X)
    n2 = int( round( n *scale_t) ) 
    X2 = np.zeros( (n2, len(X[0]) ), dtype=complex  )

    phase_counter = np.angle(X[0])
    
    for i2 in range(n2-2):                  # i2 counts along new spectrogram
        i = 1.0 * i2/ scale_t               # i is the "fractional index" on the original stft
        ibgn = int( i )
        di = i - ibgn                                               # goes between 0 and 1
        mag = (1.0-di)*np.abs(X[ibgn,:]) + di*np.abs(X[ibgn+1,:])   # linear interpolation 
        X2[i2,:] = mag * np.exp(1.j* phase_counter)                 # resynthesize

        dphi = np.angle(X[ibgn+1,:]) - np.angle(X[ibgn,:])          # phase change per frame
        #dphi = dphi  * scale_t # <--- Adding this makes it worse!  # scale phase diff with time stretch
        phase_counter = phase_counter + dphi                        # compute phase for next frame
    X2[n2-1] = X[n-1]
    return X2



#----------- parse command line arguments and read wav file into memory -----
if len(sys.argv) < 4:
    sys.exit('Usage: %s <wav_filename> <time_scaling> <freq_scaling>' % sys.argv[0])
wav_filename = sys.argv[1]
scale_t = np.float( sys.argv[2] )
scale_f = np.float( sys.argv[3] )

print "reading file ",wav_filename,"..."
samplerate, data = wavfile.read(wav_filename)
if (len(data.shape) > 1):    # take left channel of stereo track
    data = data[:,0]

#convert integer data to floats, and scale by max val
sig = 1.0*data
maxval = np.amax(sig)
print "maxval = ",maxval
sig = sig / maxval
#--------------------- end of preparing input file --------------------


vocoder_type = 'mynew'  # 'librosa', 'mynew', 'mine'

if ('librosa' == vocoder_type):
    print 'Using librosa vocoder'
    sig2 = librosa.effects.time_stretch(sig, 1.0/scale_t) 
    half_steps = 12 * np.log2( scale_f )
    sig2 = librosa.effects.pitch_shift(sig2, samplerate, n_steps=half_steps)

elif ('mynew' == vocoder_type):
    print 'Using my new vocoder '
    sigSTFT = basj_stft( sig )

    print ' Stretching the STFT....'
    stretched_STFT = stretch_stft ( sigSTFT, scale_t )

    print "resynthesizing via stft..."
    sig2 = basj_istft( stretched_STFT );

else:
    print 'Using my vocoder '
    sigSTFT = basj_stft( sig )
    sig2 = my_istft( sigSTFT, scale_t );
 

#save output file
newname = wav_filename.replace(".wav","_out.wav")
print "saving output file ",newname
wavfile.write(newname,samplerate,sig2)





The librosa output sounds the best, and I wish I understood why scaling "dphi" (in my stretch_stft code)  by the time-scaling amount actually makes things worse instead of better. ..???

Moving on, we need to resample the "sig2" if we want pitch-shifting.  Now, numpy has a resample() routine but in my experience it's notoriously, horrendously slow unless you're careful to make sure the number of samples is a power of 2.    Hence we have the code my_resample, which zero-pads the input array as needed:


def my_resample(x,newlen):
  method = 0
  if (0==method):
     # pad signal such that its length is a power of 2 = much faster
     orig_len = len(x)
     p2_len = math.pow(2, math.ceil(math.log(orig_len)/math.log(2)));
     x3 = np.zeros(p2_len)
     x3[0:orig_len-1] = x[0:orig_len-1]
     x2  = signal.resample(x3,newlen*p2_len/orig_len,)
     x2 = x2[0:newlen-1]
  else:
     num = len(x)
     stride = int(num / newlen)
     x2 = np.zeros(newlen)
     i = 0
     for i2 in range(0,newlen):
         i = i2*stride
         x2[i2] = x[i]
  return x2





With that, then we just add in a final call to my_resample, as well as including the pitch scaling in the call to the time-scaling routine, and we're done.  Here's the relevant section:

elif ('mynew' == vocoder_type):
    print 'Using my new vocoder '
    sigSTFT = basj_stft( sig )

    print ' Stretching the STFT....'
    stretched_STFT = stretch_stft ( sigSTFT, scale_t * scale_f )

    print "resynthesizing via stft..."
    sig2 = basj_istft( stretched_STFT );

    print "resampling as per frequency scaling"
    newlen = len(sig2) / scale_f
    sig2 = my_resample (sig2, newlen )





If you want to download the full code here's a link.

Enjoy.  My day's over. :-)

-Dr H


References:
  1.  http://www.tested.com
  2.  Vague: http://theproaudiofiles.com/vocal-tuning-pitch-correction/
  3.  I like this. Describes steps via pictures & non-math: https://en.m.wikipedia.org/wiki/Audio_time-scale/pitch_modification#Untangling_phase_and_time
  4.  Another nice description without calculus, has pictures: http://www.guitarpitchshifter.com/algorithm.html
  5.  http://www.scotthawley.com/compressor/
  6.  librosa:  https://github.com/bmcfee/librosa
  7.  MATLAB Phase vocoder: http://www.ee.columbia.edu/ln/rosa/matlab/pvoc/
  8.  http://hedges.belmont.edu/~shawley/SHAART/
  9. Basj's STFT/ISTFT Python code: http://stackoverflow.com/a/20409020
  10. Laroche & Dalson phase vocoder article: http://www.ee.columbia.edu/~dpwe/papers/LaroD99-pvoc.pdf

0 Comments:

Post a Comment

<< Home