Storing files on a server from a mobile app is nifty trick, but this week for my independent study I began running code on the server to extract audio features and make predictions on a spoken digit.
In its current state, my app allows a user to record an audio file. Once done, the file is uploaded to the server and will extract the audio features and submit it to the machine learning model, which currently predicts a spoken digit. The server then allows the user to look at specific information about the audio file: a graph of the certainty of which digits were spoken, the waveplot, and the MFCC features.
This basic framework allows room for growth in the future. First, I have been taking care to design the app to easily add additional features for the user’s viewing pleasure and plan on adding a spectrogram and FFT this week. Second, the machine learning model is currently trained on MFCC features only, but this can be retrained to work better using other features. And although it currently only guesses spoken digits, additional models can be trained to make a more complex system to analyze different kinds of audio data with different applications.
The biggest issue with what I’ve wanted to do in this project has been finding datasets large enough to train a model. I’d love to extend the features of the machine learning aspect of this app, but unfortunately the amount of work required is way out of scope for a single person in a single semester. Although there are many large human speech datasets, training a model in a supervised manner would require hours of manually labeling the data.
Luckily, I’ve learned enough about signal processing to make that a main aspect of the project. And as I said at the beginning of the semester, my main goal was to gain experience in the Android framework and software development in general. Having to overcome unexpected challenges and find creative ways to approach them has probably been the most important learning experience in this project.
I also continue to be reminded of the importance of knowing the shape of your data and what is actually represents before trying to work with it. MFCC features just aren’t displayed in the same way as a spectrogram or a waveplot, so each of these requires special considerations in plotting and, in the future, training machine learning models with them.
And to finish, I’d like to describe my biggest issue of the week. I had to determine how I wanted to get the data to a user after running server-side code. The naive approach would be to send all the data at once as a response, but not only would this take a long time, but the user might not want it. Instead, I send an HTTP request to get a JSON object of metadata for a given audio recording. This contains all the extracted features with a link to them for download, if desired. Then, the app itself can determine if they should be downloaded. In my case, I currently have an interface that handles the API calls, and passes back each file download link individually in a callback method when the HTTP request is successful. The app displays each link as it is received.
This week I also had to refactor an old project for an assignment and chose my first attempt at a Scrabble game in Python. The contrast between that one and this one was a reminder of the tools I’ve picked up over the past 4 years. I never would have been able to juggle this many different technologies and still understand the architecture without the help of many software engineering concepts.
From the blog CS@Worcester – Inquiries and Queries by James Young and used with permission of the author. All other rights reserved by the author.


