One of the (stupid) side projects I have wanted to work on was an easier way to find lyrics while freestyle singing. I had been hyping this project for a while to friends and coworkers. With all of us going our own ways at work, I felt compelled to get the MVP done before not seeing each other again. Rap Owl uses a combination of AI and crowdsourcing to produce lyrics for all sorts of songs! Or at least that is the end goal marketing speak, for now the MVP is what we will be discussing.
I like to make up songs. They are not good songs, but for about 20 seconds (or one verse) I am able to rhyme while belting out adhoc lyrics. At times I have been told this borders on being annoying. After a 1500 mile road trip with my best friend, I have found making up songs to be an amazing creative outlet while passing the time. Sadly my rap skills are not as good as my best friend, to make up for this I knew I needed to build a tool to help. A few years ago I stumbled upon https://rhymebrain.com/en a tool that uses Levenshtein distances (how closely related two strings are) and a bunch of proprietary magic to find words that sound the same. The guy that built the tool is a genius, as I do not have the time or desire to create such an opus. I explored trying to move the basics of the Levenshtein distance functions to the frontend via python (10x performance cost) and moving the similar code to ASM would be a lot slower (transmitting the dictionary alone would be horrible). Thankfully what I am good at is connecting to backend APIs and luckily due to the graciousness of the engineer that created Rhyme Brain we can make throttled requests! The biggest piece that would make this tool useful in the space I am focusing on is using the speech API to try and find words that sound like what a person is saying. What this creates is a freestyle singing tool to continuously look up words that rhyme as you pause, as that is how a good flow can be found.
Digging into the pieces of this project, my limited scope MVP looks something like this:
webkitSpeechRecognition API >>> Identify the last word in a the bar >>> GET Request to Rhyme Brain >>> Display Results
webkitSpeechRecognition demo here: https://github.com/googlearchive/webplatform-samples/tree/master/webspeechdemo
This is one of those (not) super awesome APIs that are only supported on Chrome. Luckily the documentation is very clear, with the structured demo being simplified as I would expect (a single HTML file!!!). Getting their demo to work on my machine was tricky mostly because of my machine’s configuration of too many extensions (bordering on a 90’s yahoo bar) and the way I use to manage my soundcard. After verifying that the microphone web API works on my phone, I was able to determine that one of the many extensions was blocking the XHR call from making its way to Google. The next day I ran into the second issue of my microphone not capturing audio, this was alleviated by diving into the laptop’s audio settings to be toggled.
Then the scope creep happened. If this were a real startup, I would pursue this product by building models for the different rap cadences from the existing songs. This would allow for the different rapping styles to be selected on the frontend. Then we would store the lyrics to try and determine the optimal flow. Existing songs can be parodied to get that viral growth. Eventually, enough data on how people write songs would be collected and an AI model to write songs would be able to do 90% of the work. Thankfully this is just a side project! Finally, I got so fed up with my perpetual hyping of Rap Owl that I built the MVP using Vanilla JS.
Using the webkitSpeechRecognition API I was able to detect the last word rapped before the pause by comparing the final result to the inferred result. Then do a fetch to the heavily throttled: https://rhymebrain.com/en
This is where the UX conundrum comes in, as being inundated with 500 matching words is overwhelming. I do not have a good solution to this at the time being. Ideally scrolling words over the lyrics that are being detected seems reasonable. But I will be saving that for another day since I only sing on long car trips.
As for the CI, apparently, I set up a build and deploy pipeline a while ago and had forgotten about it! Getting the code deployed was as easy as pushing. I was so proud of my previous self making sure that deploying was easier for my future self. Once it was deployed it became apparent that the access to the microphone needed HTTPS. The documentation is sparse, but I recall that media devices require SSL unless it is localhost.
The HTTPS was tricky to implement. There are tons of these tutorials that make it seem so easy on AWS, but none of them worked. They all seem to suggest that the dropdowns will have the CloudFront listed when you set up a new A record in Route 53, this was not my case. I went to sleep hoping the issue would be resolved by morning, it was not. I did some research and found some documentation discussing the Cloudfront Distribution needs to be deployed which it already had been set up to do.
That was when I started to think outside the box. My payment card recently got a new number as part of some fraud protection push, half of the services I use were able to resolve the new card number. Amazon is not one of those companies. After updating the billing to my new card number and paying a few past due invoices I was able to see the Cloudfront Distributions deploying.
Within 30 minutes the service had propagated enough to be able to be selected from the dropdown within Route53. Turns out only one CloudFront distribution is needed per domain. With it live I was back at the coding trying to manage the flood of new words always showing. Sharing the link is worse, as it is only working in Chrome currently. To settle on a better way, I see MDN has a nice little demo: https://github.com/mdn/web-speech-api/tree/master/speech-color-changer
Since I had hyped this so much I found a template I had purchased in the past and added the rap-owl.js library to it. So now I have this clean page with reasonable performance that hosts the MVP. And I shared the hell out of it at work. The pressure of being laid off in 60 days triggered so many feelings, but the worst was not being able to share the projects I had been discussing with coworkers for years. Motivation comes from many places and even now I am still mourning the loss of my job. Not because of the work, but because of the people with whom I formed deep bonds with.
Overall this was a fun project with a limited scope. I found it easy to integrate a small amount of code into a performant marketing page template. I have all the notes for what it would take to expand on this work from the technical standpoint but have not done any market research. The incumbent songwriting software seems out of date and ripe for enhancement. The demo does not work on mobile Chrome, as the final result and inferred results API’s are not standard between mobile and desktop. There is someone calling out the issue on Stack Overflow, but it would appear that the webSpeechRecognition API is not mature enough to build on in 2020. The ideas, knowledge, and goals from this project are being transferred to a new project that will focus on generating alternative lyrics for songs similar to how Weird Al makes silly covers or help you write more concisely.