Generating realistic speech on-the-fly with Azure Functions and Cognitive Service’s Speech APIs

Last week Chloe Condon, a Cloud Developer Advocate for Microsoft, posted a great article and accompanying open source project for helping people handle awkward social situations. The project – combining Azure Functions, Twilio and a Flic button (available from Shortcut Labs), allows a user to trigger a fake call using a discrete Bluetooth button, which triggers an Azure function, which in turn uses the Twilio API to make a call to a specified number and play a pre-recorded MP3. You can read much more detail about the project in the great article Chloe wrote about it over on Medium.

As a side note, Chloe describes herself as an ambivert, a term which I will admit I had never come across, but after reading the description fits me to a tee. As with Chloe, people assume I am an extrovert, but whilst I am totally comfortable presenting to a room of conference goers and interacting with folks before and after, I soon find myself needing to recharge my batteries and thinking of any excuse to extricate myself from the situation – even just for a short time. Hence, this project resonated with me (as well half of Twitter it seems!).

One of the things that struck me when first looking at the app was the fact that a pre-recorded MP3 was needed. Now, this obviously means that you can have some great fun with this, potentially playing your favorite artist down the phone, but wouldn’t it be good if you could generate natural sounding speech dynamically at the point at which you made the call? Step in the Speech service from the Microsoft Cognitive Services suite – this is what I am going to show you how to do as part of this post.

The Speech service has, over the last year or so, gone through some dramatic improvements, with one of the most incredible, from my perspective, being neural voices. This is ability to have speech generated that is almost indistinguishable from a real human voice. You can read the blog post where neural voices were announced here.

So, based on all of this, what I wanted to achieve was the ability to trigger an Azure function – passing the text to be turned into speech – and have that generate an MP3 file for me and have that available to use immediately.

This is what I am going to show you how to do in this article and below you can hear an example of speech generated using the new neural capabilities of the service.

Let’s get started….

(more…)