Build Bot with Voice (Skype)


In the previous post "BUILD BOT with Microsoft Bot Framework Rest Api" I described about the rest raw of text messaging bot. The SDK is built on the rest api. Therefore, you can understand the bottom technology by learning this rest api.
In this blog post, I show you the same for the calling (voice) bot, not text messaging bot. I hope this helps you to understand the essence of voice communications in Microsoft Bot Framework.

If it's Skype, you (developers) can provide the calling (voice) bot using Microsoft Bot Framework. (In Microsoft Bot Framework, this calling bot is only supported by Skype channel.)

Note : If you use SDK (.NET, Node.js), you can refer the official document for the step-by-step tutorial of the calling bot.

Note : Currently only en-US culture is supported in the calling bot. Therefore, if you want to handle ja-JP (Japanese language), please use the following play and record communication now.

Settings

The prerequisite settings is the same as normal chatbot in Microsoft Bot Framework. (Please refer the document.)
But, in this case (when you create the calling bot), you need to do the additional settings for calling. First, open the bot settings page, go to Skype channel configuration page, and enable "Calls" feature in Skype channel. (See the following screenshot.)
You must also set the call endpoint (which I explain later) in this settings.

Authentication

If you proceed the bot communication in the webhook, all request is having Authorization header like following, and you must verify this token each time. For verifying this token, please refer the previous post "BUILD BOT with Microsoft Bot Framework Rest Api".

POST https://example.com/yourbot/callback
Accept: application/json
Authorization: Bearer eyJ0eXAiOi...
Content-Type: application/json; charset=utf-8

...

Call endpoint and Callback endpoint

The calling (voice) communication needs two endpoints in your bot side: the one is "call" endpoint and the other is "callback" endpoint.

When the user starts to communicate with your bot, the first webhook arrives at your call endpoint like the following HTTP request.

POST https://example.com/yourbot/call
Authorization: Bearer eyJ0eXAiOi...
Content-Type: application/json; charset=utf-8

{
  "id": "13107307-7bd6-4c5e-9a1b-65b98464cee6",
  "participants": [
    {
      "identity": "29:12BgCTOvVtWCWb0LlRkes7g428GXh_A4Gl9qbfce7YteH4zcD5pqSlQB-OMF1MVRM",
      "languageId": "en",
      "originator": true
    },
    {
      "identity": "28:d82c7c25-ddb4-426a-8b59-76a8a034abb4",
      "originator": false
    }
  ],
  "isMultiparty": false,
  "presentedModalityTypes": [
    "audio"
  ],
  "callState": "incoming"
}

When your bot accepts this calling request, your bot reply the callback endpoint in the response body.

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "links": {
    "callback": "https://example.com/yourbot/callback"
  },
  ... (here, explain later)
  
}

After that, the Bot Framework (or Skype Bot Platform) calls this callback endpoint for subsequent communications. The all requests (webhook) in this communications is done by this callback url, after the acceptance.

In the Skype configuration settings page which I previously explained, you must set the "call" endpoint in "Calling Webhook" textbox.

Key concepts - Actions and Outcomes

All event is notified as the callback webhook, and you can reply as corresponding HTTP response. (the request-reply pattern)
When your bot want to act some operations, your bot must set these activities in HTTP respose as "action". When the user has responded to this activity (or some system event has occured), this event is included as the "operation outcome" in the callback webhook (HTTP request body). That is, the outcome is corresponding to some specific action.

The type of actions (outcomes) are: answer (answerOutcome), playPrompt (playPromptOutcome), recognize (recognizeOutcome), record (recordOutcome), reject (rejectOutcome), and hangup (hangupOutcome).

Why this style of communication is used ?
When we were creating chatbot using only text messages (see "BUILD BOT with Microsoft Bot Framework Rest Api"), the communication pattern might be essentially one-way. Imagine if you are creating the alarm bot. This sort of bot might be idle while waiting, and is accidentally triggered by some occurrence.
But, when you create the calling (voice) bot, the state is always connected till hang-up. That is reason that the calling bot communication pattern is request-reply with "actions" and "outcomes", and continuously communicates each other till hang-up.

Now, let's see the communication flow !

First of all, when your bot accepts the initial call request (which is described above), your bot sends the following "answer" action. (If your bot refuses, please use "reject" action.)

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "links": {
    "callback": "https://example.com/yourbot/callback"
  },
  "actions": [
    {
      "operationId": "673048f9-4442-440b-93a3-faa7433c977a",
      "action": "answer",
      "acceptModalityTypes": [
        "audio"
      ]
    }
  ],
  "notificationSubscriptions": [
    "callStateChange"
  ]
}

You can include multiple outcomes in one HTTP response. For example, if you accepts the initial call request and reply some messages to the user, your bot can send like the following HTTP response.
When you use the "playPrompt" action with text value, Microsoft Bot Framework (Skype Bot Platform) transfers to the voice (speech) using the built-in text-to-speech engine.

HTTP/1.1 200 OK
Content-Length: 383
Content-Type: application/json; charset=utf-8

{
  "links": {
    "callback": "https://example.com/yourbot/callback"
  },
  "actions": [
    {
      "operationId": "673048f9-4442-440b-93a3-faa7433c977a",
      "action": "answer",
      "acceptModalityTypes": [
        "audio"
      ]
    },
    {
      "operationId": "030eeb97-8210-48fd-b497-d761154f0b5a",
      "action": "playPrompt",
      "prompts": [
        {
          "value": "Welcome to test bot",
          "voice": "male"
        }
      ]
    }
  ],
  "notificationSubscriptions": [
    "callStateChange"
  ]
}

When the playPrompt action is accepted, the Bot Framework calls the following callback (webhook) with the outcome. This outcome means that the prompt is successfully accepted by the user.

POST https://example.com/yourbot/callback
Authorization: Bearer eyJ0eXAiOi...
Content-Type: application/json; charset=utf-8

{
  "id": "13107307-7bd6-4c5e-9a1b-65b98464cee6",
  "operationOutcome": {
    "type": "playPromptOutcome",
    "id": "030eeb97-8210-48fd-b497-d761154f0b5a",
    "outcome": "success"
  },
  "callState": "established"
}

If your bot wants to ask something to the user, your bot use "recognize" action in the HTTP response.
For example, the following is requesting the choice of the dial pad digit to the user.

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "links": {
    "callback": "https://example.com/yourbot/callback"
  },
  "actions": [
    {
      "operationId": "32e1a166-f557-4b22-9fd7-64a742d5f040",
      "action": "recognize",
      "playPrompt": {
        "operationId": "bab59923-63d2-48a0-9d34-c0fbadb54435",
        "action": "playPrompt",
        "prompts": [
          {
            "value": "If you want to report technical issues, press 1. You want to ask about our products, press 2.",
            "voice": "male"
          }
        ]
      },
      "bargeInAllowed": true,
      "choices": [
        {
          "name": "1",
          "dtmfVariation": "1"
        },
        {
          "name": "2",
          "dtmfVariation": "2"
        }
      ]
    }
  ],
  "notificationSubscriptions": [
    "callStateChange"
  ]
}

When your bot want to provide the choice by the voice (not dial pad digit), your bot sends the following HTTP response. In this example, if the user speaks "yes" or "okay", the "Yes" is returned as the result of choice.

Note that this isn't the speech recognition feature itself (speech-to-text functionality), but choice by the voice. If the user speaks other words, your bot cannot recognize that.
If your bot needs speech-to-text functionality itself, please use the recording capability which I explain later.

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "links": {
    "callback": "https://example.com/yourbot/callback"
  },
  "actions": [
    {
      "operationId": "9a109320-7f30-46f4-a894-a47a4eb8b398",
      "action": "recognize",
      "playPrompt": {
        "operationId": "01105459-d549-4327-b85d-0b0c94b62e8e",
        "action": "playPrompt",
        "prompts": [
          {
            "value": "Please answer yes or no.",
            "voice": "male"
          }
        ]
      },
      "bargeInAllowed": true,
      "choices": [
        {
          "name": "Yes",
          "speechVariation": [
            "Yes",
            "Okay"
          ]
        },
        {
          "name": "No",
          "speechVariation": [
            "No",
            "None"
          ]
        }
      ]
    }
  ],
  "notificationSubscriptions": [
    "callStateChange"
  ]
}

When the user has responded for the "recognize" action, the following callback (webhook) is called. This is the example of the dial pad digit choice.

POST https://example.com/yourbot/callback
Authorization: Bearer eyJ0eXAiOi...
Content-Type: application/json; charset=utf-8

{
  "id": "13107307-7bd6-4c5e-9a1b-65b98464cee6",
  "operationOutcome": {
    "type": "recognizeOutcome",
    "id": "09c78c7c-33fc-488c-808b-0db83de1b433",
    "choiceOutcome": {
      "completionReason": "dtmfOptionMatched",
      "choiceName": "1"
    },
    "outcome": "success"
  },
  "callState": "established"
}

If the user is wasting the time to think, the following system event is returned as the recognize outcome.

POST https://example.com/yourbot/callback
Authorization: Bearer eyJ0eXAiOi...
Content-Type: application/json; charset=utf-8

{
  "id": "13107307-7bd6-4c5e-9a1b-65b98464cee6",
  "operationOutcome": {
    "type": "recognizeOutcome",
    "id": "32e1a166-f557-4b22-9fd7-64a742d5f040",
    "choiceOutcome": {
      "completionReason": "initialSilenceTimeout"
    },
    "outcome": "failure",
    "failureReason": "InitialSilenceTimeout"
  },
  "callState": "established"
}

If your bot wants to hang up (disconnect), your bot sends the following HTTP response with the "hangup" action.

HTTP/1.1 200 OK
Cache-Control: no-cache
Content-Type: application/json; charset=utf-8

{
  "links": {
    "callback": "https://example.com/yourbot/callback"
  },
  "actions": [
    {
      "operationId": "d2cb708e-f8ab-4aa1-bcf6-b9396afe4b70",
      "action": "hangup"
    }
  ],
  "notificationSubscriptions": [
    "callStateChange"
  ]
}

Conversation with Play / Record

Using Microsoft Bot Framework, your bot can also play media, or record as binary. This capabilities are very much used in the real scenarios : playing music on hold, recording messages to someone, etc
Especially, you can do the interactive talks using this capabilities as follows. In the case of needing the high-quality voice guidance, you can also use these capabilities.

  • Record user's request (speech) and get binary
  • Call external speech-to-text engine (like Bing Speech API) and get text value
  • Select speech binary for response and play

Let's see these capabilities.
If your bot plays some existing audio file, your bot sends the HTTP response like follows. As you can see, your bot can use the audio file uri in the playPrompt action, instead of text value.

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "links": {
    "callback": "https://example.com/yourbot/callback"
  },
  "actions": [
    {
      "operationId": "d18e7f63-0400-48ff-964b-302cf4910dd3",
      "action": "playPrompt",
      "prompts": [
        {
          "fileUri": "http://example.com/test.wma"
        }
      ]
    }
  ],
  "notificationSubscriptions": [
    "callStateChange"
  ]
}

If you want to record the user's response, send the "record" action. (The recording starts !)

HTTP/1.1 200 OK
Content-Type: application/json; charset=utf-8

{
  "links": {
    "callback": "https://example.com/yourbot/callback"
  },
  "actions": [
    {
      "operationId": "efe617d7-4de5-42e9-b4e1-90dfd5850e49",
      "action": "record",
      "playPrompt": {
        "operationId": "e2381379-20b0-4fae-b341-afcdd8187323",
        "action": "playPrompt",
        "prompts": [
          {
            "value": "What is your flight ?",
            "voice": "male"
          }
        ]
      },
      "maxDurationInSeconds": 10,
      "initialSilenceTimeoutInSeconds": 5,
      "maxSilenceTimeoutInSeconds": 2,
      "playBeep": true,
      "stopTones": [
        "#"
      ]
    }
  ],
  "notificationSubscriptions": [
    "callStateChange"
  ]
}

If the recording has completed, your bot receives the following "record" outcome as the MIME multipart format.
Your bot can retrieve the audio binary from this raw data, and proceed some subsequent operations.

For example, Microsoft Bot framework is not having speech recognition feature itself (speech-to-text functionality), but you can get the text string value with external speech recognition service (like Bing Speech API), and you might also proceed the language understanding using LUIS (language understanding intelligent service).

POST https://example.com/yourbot/callback
Authorization: Bearer eyJ0eXAiOi...
Content-Type: multipart/form-data; boundary="test-0123"

--test-0123
Content-Type: application/json; charset=utf-8
Content-Disposition: form-data; name=conversationResult

{
  "id": "13107307-7bd6-4c5e-9a1b-65b98464cee6",
  "operationOutcome": {
    "type": "recordOutcome",
    "id": "efe617d7-4de5-42e9-b4e1-90dfd5850e49",
    "completionReason": "completedSilenceDetected",
    "lengthOfRecordingInSecs": 5.0459999999999994,
    "format": "wma",
    "outcome": "success"
  },
  "callState": "established"
}
--test-0123
Content-Type:audio/x-ms-wma
Content-Disposition:form-data; name=recordedAudio

(This is the audio binary ...)
--test-0123--
Comments (2)

  1. Hello!

    I'm struggling quite hard to get it work on a C# project. I tried to send it to Bing Speech API but I never get a response back. I am not sure what's going on. It lags too much.

    Can you please share any of the samples if you have one?

  2. Daniel says:

    How to get any user data like skype-name or session-id?
    Is there any constant data during the conversation until hang-up?

Skip to main content