After getting a simple machine learning model to recognize spoken digits, I was able to begin the iterative process of improving the model. Using only MFCC’s, the model was failing more than desirable, reaching a maximum of 60% accuracy when using validation data (my own voice, which was not used in training the model).
Below you will see plots of a sample of results when validating the model. For each digit, there is the extracted MFCC features, the actual spoken digit, the predicted digit by the model, and the certainty. There is also a plot of the certainty for each other digit for that recording.

This is just a sample of a larger validation set, and the actual results in this first model was only 45% accurate. But this shows that for all of these digits except 3 and 5, the model was 99% to 100% certain of the result. The differences in the MFCCs are subtle, but stark differences in color appear to be more likely to be correct, whereas 5 is clearly closer in color to 1, which it was mistake for. Additionally, every single audio clip of 3 was mistaken for a 0 using this model.
Retraining the model with different parameters may help in this case, but we can also hypothesize about the reason for these mistakes. Perhaps the MFCC is finding patterns in vowels that make “zero” and “three” look identical. If that’s the case, features that can detect consonants might help improve results. This sounds pretty obvious anyway, so it might be a good next step on the next iteration.
But first, let’s retrain the model without any changes.

Okay! This 3 was very accurately predicted. But the total accuracy of validation was only 50% (remember, this only shows a sample size of 10). Inspection of actual results now shows that 3 is sometimes mistake for a 2, and vice versa. This model is slightly better, bit still flawed. Which makes sense, because no changes have been made to the model and we just got lucky that it learned to be a bit better this time.
I’ve been training with 25 epochs, and getting 95-97% accuracy during training, and 93-97% accuracy using test data (from the same dataset as the training data, which was not used to train the model). Those results are pretty good, so maybe we can use fewer epochs and prevent some overfitting.

This certainly looks promising. With 95% accuracy during training, and 93.8% accuracy using test data, the results are still pretty good. However, the validation data with my voice is now 57.5% accurate! Only a single 3 was mistaken for a 0.
So I’m using a dataset of 4 voices to train and test, and my own voice to validate. But more data is probably better, so let’s use my voice to train the model and take a random sample to validate.

The plot is looking good! Each of these was very accurately predicted. During training and test data was 97% accurate. The validation data was 100% accurate. Of course, now that the validation data contains all voices that were trained, it’s more likely to be correct. Furthermore, the sample is small. So let’s see what happens if we use a new voice to validate. I had my roommate record himself saying each digit and used only his voice for validation data.

In general, the model is much more certain of its guesses. The final validation result was 80% accuracy, so not perfect but a major improvement. This much of an improvement was gotten just by adding more data and making small modifications to the model.
The importance of collecting data in order to improve a model is apparent. Even with 80% accuracy, there is still some predictive power. If this can be found to be useful, further data can be collected as it is used and this new data can be cleaned and used to train better models.
From the blog CS@Worcester – Inquiries and Queries by James Young and used with permission of the author. All other rights reserved by the author.