This week, I managed to prove (to myself, at least) the power of MFCCs in speech recognition. I was quite skeptical that I could get anything to actually recognize speech, despite many sources saying how vital they are to DSP and speech recognition.
A tutorial on Tensorflow I found a couple of months ago sparked the idea: if 2-dimensional images can be represented as a 1-dimensional array and used to train the model, perhaps the same could be done with features extracted from an audio file. After all, the extracted features are nothing but an array of coefficients. So this week, armed with weeks of knowledge of basic Tensorflow and signal processing, I finally tried to get it to work. And of course, many problems arose.
After hours of struggling with mismatches in the shape of the data, waiting for the huge dataset to reload when I made a mistake, and getting no results, I finally put together the last piece of code that made it run correctly, and immediately second-guessed the accuracy of the model (“0.99 out of 100, right???”).
Of course, when training a model, a result this good could be a case of overfitting. And indeed it is, because it is only 95% accurate when using separate test data. And even this percentage isn’t the whole story. The test data comes from the same dataset, which has a lot of recordings of each digit, but using only 4 voices. It’s quite possible that there are patterns found in the voices that would not exist in other voices. This would make it great using a random sample from the original dataset, but possibly useless for someone else. There’s also the problem of noise, which MFCC is strongly affected by. So naturally, I recorded my own voice speaking digits and ran it with the model. Unfortunately, I could only manage approximately 50% accuracy, although it is consistently accurate with digits 0, 1, 2, 4 and 6. Much better than chance, at least!
This is a very simple model, which allows you to extract only MFCCs from an audio recording of a spoken digit (0 through 9) and plug it into the model to get an answer. But MFCCs may not tell the whole story, so the next step will be to use additional extracted features to get this model to perform better. There is also much more tweaking I can do with the model to see if I can obtain better results.
I’d like to step through the actual code next week and describe the steps taken to achieve this result. In the meantime, I have a lot more tweaking and refactoring to do.
I would like to mention a very important concept that I studied this week in the context of DSP: convolution. With the help of Allen Downey’s ThinkDSP and related lecture, I learned a bit more detail on filtering of signals. Convolution is essentially sweeping one signal over another to get a new signal. In DSP, this is used for things such as low-pass filters and adding echo to audio.
Think of an impulse as an instantaneous tone consisting of many (or all) frequencies. If you record this noise in a room, you will get a recording of the “impulse response”. That is, how all of the frequencies are affected by the room over time. The discrete Fourier transform of this response is essentially a filter, because it gives the amplitude of each frequency in the impulse response, including all echos and any muffling. Multiplying these amplitudes by the DFT of an entirely different audio signal will modify each frequency in the exact same way. And thus, to the human hear, this different audio signal will sound like it does in the same room. If this concept is interesting, I encourage you to watch the lecture and work through the examples in the book.
I think these topics may come in handy if I need to pre-process recordings, in the event that noise is in fact causing errors in the above model.
From the blog CS@Worcester – Inquiries and Queries by James Young and used with permission of the author. All other rights reserved by the author.