Making Amazon Alexa smarter with Microsoft Cognitive Services

Recently those of us who work at Mando were lucky enough to receive an Amazon Echo Dot for us to start to play with and to see if we could innovate with them in any interesting ways and as I have been doing a lot of work recently with the Microsoft Bot Framework and the Microsoft Cognitive Services, this was something I was keen to do.  The Echo Dot, hardware that sits on top of the Alexa service is a very nice piece of kit for sure, but I quickly found some limitations once I started extending it with some skills of my own.  In this post I will talk about my experience so far and how you might be able to use Microsoft services to make up for some of the current Alexa shortcomings.

Firstly, for those of you who don’t know what the Amazon Echo devices are, they are essentially Amazon’s take on a digital personal assistant for use around your home. The devices are always listening (unless you disable this feature) for you to say “Alexa”, after which you can ask it to do one of any number of tasks – from playing music from Spotify or Amazon Music through to controlling your heating within your home (think Siri / Cortana for your house!). Out of the box the devices come with a number of ‘skills’ built in or configurable within the associated app for Android / iOS. After configuring the device I tested playing music by saying “Alexa, play some popular music” and this works very well, even when citing a specific artist.  I also tried some other skills such as getting a summary of the latest news and asking the device about the current weather in a specific location.  It has to be said that I when using the out of the box skills that are available I was very impressed at Alexa’s hit rate in terms of understanding what I wanted to do, although from speaking with some of my colleagues, their experiences varied.

I then quickly moved on to exploring how the Echo can be extended by implementing new skills which you can, if you wish, make available within the store for others to use.  In essence a new skill is made up of two key parts;

  1. An Alexa interaction model which describes how users talk to the Alexa service and the commands they can use
  2. The code behind the skill, written using Node.js

Anatomy of a skill

The intention of this post is not to tell you all the ins and outs of creating skills for Alexa, but it will help to show you to basic components.  If you are already familiar with building Alexa skills then you can skip past this to the “What’s the problem?” section Smile

A basic interaction model

An interaction model is defined using 3 constituent parts.

The first is the intent schema.  This, as shown below, defines what the various tasks are that the user might want to achieve using your skill.  In this example we have a single intent, the task of finding the time at a given location.  To go along with the intent we also define slots.  Slots are the bit of the phrase that will change depending on the request, so in this case, the name of the location where the user would like the current time for.

{
  “intents”: [
    {
      “intent”: “GetTimeInLocation”,
       “slots”: [
        {
          “name”: “location”,
          “type”: “location”
        }
      ]
    }
  ]
}  

The second part are your slot types.  Above we have specified a slot type of ‘location’ and we nee to give Alexa some example values of the sort of thing we are expecting the user to say. Therefore our custom slot would be defined as follows;

Slot name: location
Slot values : orlando | new york | london | manchester | san francisco

The final piece of the interaction model are sample utterances, which give Alexa example phrases the user might use to indicate a certain intent.  As you can see below I have defined 3 possible ways the user might ask for the time in Orlando and within each utterance is a placeholder for the location that the user would like to check.  The sample utterance starts with the intent name, in this case “GetTimeInLocation” and then is followed by the phrase the user might use.

GetTimeInLocation what the current time is in {location}
GetTimeInLocation what is the time right now in {location}
GetTimeInLocation in {location} what is the current time

The code behind a basic Alexa skill

The code below shows the core code for a simple Alexa skill which allows the user to ask what the time is in a given location right now.  When you speak to Alexa the service determines the user’s intent based on the model you have specified and in this example there is only a single intent “GetCurrentTimeInLocation”. Based on the simple interaction model I described above, if the user said, “Alexa, ask location time what the time is right now in Orlando”, Alexa would trigger the code below and pass the location Orlando in via the intent handler that we have specified.

var APP_ID = “YOUR_ALEXA_APP_ID_GOES_HERE”;

/**
* The AlexaSkill prototype and helper functions
*/
var AlexaSkill = require(‘./AlexaSkill’);

var SampleSkill = function () {
    AlexaSkill.call(this, APP_ID);
};

// Extend AlexaSkill
SampleSkill.prototype = Object.create(AlexaSkill.prototype);
SampleSkill.prototype.constructor = HelloWorld;

SampleSkill.prototype.eventHandlers.onSessionStarted = function (sessionStartedRequest, session) {
    // any initialization logic goes here
};

SampleSkill.prototype.eventHandlers.onLaunch = function (launchRequest, session, response) {
    var speechOutput = “Welcome to the sample Alexa skill!”;
    response.tell(speechOutput);
};

SampleSkill.prototype.eventHandlers.onSessionEnded = function (sessionEndedRequest, session) {
    // any cleanup logic goes here
};

SampleSkill.prototype.intentHandlers = {
    // register custom intent handlers
    “GetCurrentTimeInLocation”: function (intent, session, response) {

        if (intent.slots.location.value) {
        // your code to get the time in the given location goes here
        response.tell(‘The current time in ‘ + intent.slots.location.value + ‘ is …’);
        }

    response.tell(‘Sorry, I wasn\’t sure which location you wanted to check the time for’);
    }
}

// Create the handler that responds to the Alexa Request.
exports.handler = function (event, context) {
    var sampleSkill = new SampleSkill();
    sampleSkill.execute(event, context);
};

 

What’s the problem?

So, right now if you have implemented a basic skill like the one described above then you should be able to use one of the example utterances like, “what is the time right now in Orlando” and in all likelihood Alexa will understand your intent correctly and route your request to the right intent handler in your code. “Great! What’s is the big problem then?” I hear your cry! Well, so far during my testing of Alexa I have found that if you try to ask for something in a way that differs even only a little from the sample utterances you have specified, Alexa is not great at understanding what it is you would like to do.  This means that the user needs to know, in many cases, exactly the right way to ask for something.  This isn’t an ideal user experience and can be frustrating of for the user who from their perspective is asking for something in a perfectly reasonable way.

Having worked a lot with Microsoft’s Cognitive Service and the LUIS (Language Understanding Intelligence Service) in particular, I have become use to being able to define similar interaction models as you do with Alexa, but then have the ability to monitor user input and train the model over time to understand different phrases that are received and which intents they relate to.  This continuous improvement step really is crucial for building up a great user experience, but, sadly as of right now, there doesn’t seem to be a similar function available within Alexa.

Using LUIS to make Alexa smarter – an experiment

Now that I had my skill, but was finding the natural language understanding capabilities of Alexa to be a little less than, I wanted to see if I could use the Microsoft LUIS Cognitive Service in tandem with Alexa to improve the situation.  If you are not familiar with LUIS and how you can use it to build language models just check out my post here.

My overall goal with this experiment was to see if I could effectively replace Alexa’s intent determination process with that of the Microsoft LUIS service.  In an ideal world in my skill I would like to just get the whole phrase that the user spoke and send it to LUIS for analysis and ultimately determine the user’s intent.  However, the first problem I found was that in the response sent to our intent handler we don’t get the text of the whole phrase the user spoke, instead just receiving the likely intent and any values for custom slots.  To get around this problem I have altered my interaction model to only have a single intent, “GetUserIntent” and then to have a single slot type called “phrase” which contains complete sample utterances that the user might use.  Here you would add a number of samples that the user might use to hit any intent on your LUIS model. Finally we set a single sample utterance which is made up of just our custom slot.  Basically, this should now tell Alexa to pass the whole phrase spoken by the user to our custom slot and therefore make it available in our skill.

{
  “intents”: [
    {
      “intent”: “GetUserIntent”,
       “slots”: [
        {
          “name”: “phrase”,
          “type”: “phrase”
        }
      ]
    }
  ]
}  

————————————————————————————————–

Slot name: phrase
Slot values : what is the current time in london | in orlando what is the time right now | what time is it in new york

————————————————————————————————–

GetUserIntent {phrase}

 

Now that we have modified our interaction model to pass the whole phrase spoken by the user, we need to alter our skill code to handle this and use the LUIS service to determine the user’s intent. In the modified code below for our skills event handler, you can see that I am creating an intent handler for our single intent, “GetUserIntent” and then taking the value of the “phrase” custom slot and passing it to the LUIS service.  Right now I am only checking to see if the intent matches a single intent, “GetTimeInLocation”, but you can easily use a switch statement or something similar to check against multiple intents.

LuisSkill.prototype.intentHandlers = {
    // register custom intent handlers
    “GetUserIntent”: function (intent, session, response) {

        var phrase = null;

        if (intent.slots.phrase.value) {
            phrase = intent.slots.phrase.value;
        }

        if (phrase) {
         var request = require(“request”)
         var url = “https://api.projectoxford.ai/luis/v1/application?id=YOUR_APP_ID&subscription-key=YOUR_SUBSCRIPTION_KEY&q=” + intent.slots.phrase.value + “&timezoneOffset=0.0”;

         request({
             url: url,
             json: true
         }, function (error, urlresponse, body) {

             if (!error && urlresponse.statusCode === 200) {
                 if(urlresponse.intents && urlresponse.intents[0].intent === “GetTimeInLocation”)
         {
             //do something here to get the time in the location
         }
             }
    });
        }
        else {
            response.tell(‘Sorry, I wasn\’t sure what you asked me’);
        }
    }
}

 

Summary

On the whole I really like the development model for Alexa Skills and the devices really are very good, but if you are looking to implement anything more than basic natural language processing then I think you will find that Alexa will struggle unless you put in almost every combination of phrase that the user might say. I am hopeful that at some point Amazon will update the Alexa service to have as much flexibility and continuous training capabilities as LUIS.  For now though, hopefully this post shows you how you might use the Microsoft LUIS service to overcome this issue.  I have only performed limited testing on this so far, but initial results are pretty good.  Let me know how you get on if you try the above approach. Alternatively if you have your own approach I would love to hear about that too!

2 thoughts to “Making Amazon Alexa smarter with Microsoft Cognitive Services”

  1. Hi,

    I am struck with the same problem you attempt to address here. Your solution works only if you have all posibilities of user queries are made available in the SLOT “phrase”. The Interaction model will not return you user-entered phrase if the phrase is not present in the SLOT. Do you have any ideas to solve the problem since you wrote this blog?

Leave a Reply

Your email address will not be published. Required fields are marked *