[MUSIC PLAYING] SABA ZAIDI: My husband and I have been in a somewhat long-distance relationship for a few years now. We talk on the phone and message a lot. But perhaps what's most enjoyable for me is when I visit him. I leave little notes all over his apartment. And sometimes when I'm home alone, he'll start changing the color of the smart lights in my living room to let me know that he's thinking of me. Our human conversations take a lot of different forms. We don't just talk verbally. Depending on where we are and what we're trying to say, we might write, hold hands, or even change color of smart lights. It's these human conversations that have inspired us to create conversations with technology. It's no surprise then that we're starting to incorporate more modalities into our digital interactions as well. But with this increased number of devices and complexity of interactions, it might feel overwhelming to design for the Google Assistant. If you look at the Google Assistant today, I can talk to it not just through the speaker in my living room, but also through my car or my headphones. I can tap on my phone or my watch. But how do we design for an increased number of complexity of devices? I'm Saba Zaidi, and I'm an interaction designer on the Google Assistant team. I'll be talking about how you can design actions across surfaces and give you some frameworks and some tips. ULAS KIRAZCI: And hi, I'm Ulas. I'm an engineer on Actions on Google. And later on, I'm going to tell you how to build an action using the design principle Saba will talk about. SABA ZAIDI: OK. So before we can get started to understand how to design actions across surfaces, we need to get a better understanding of what those experiences can look like. Let's start with a journey through a user's day. You wake up in the morning to the sound of an alarm ringing on your Google Home. Without even getting out your blanket, you can say, hey, Google, stop. You get up, get ready. And as you're about to head out, you want to make sure you don't need an umbrella. So you turn to the smart display in your hallway and you ask, hey, Google, what's the weather today? You're able to hear a summary, and also at a glance see the hourly forecast. As you're walking to your car, you take out your phone and tap on it to ask the Google Assistant to order your favorite from Starbucks. It's able to connect you to Starbucks, where you can quickly place your order. As you're driving, you want to listen to your favorite podcast or the news. And you can ask Google Assistant for help in a hands-free way. And it can connect you, for example, to NPR News update. You go about your day. And when you come back home, it's time to make dinner for your family. You turn to the smart display in your kitchen and you ask, for example, Tasty for help with a recipe, like in this case, pizza bombs. Tasty comes back, and you can hear and see on the screen step-by-step instructions and the ingredients. After dinner, it's time to unwind with your family in the living room. You decide to play a Buzzfeed personality quiz. And it's a great way for your family to get together around a shared device and have some fun. And then you head to bed. You say, hey, Google, good night. It's able to start your custom bedtime routine, which includes, for example, setting your alarms or telling you about your day tomorrow. So as you can see, users were able to interact with the Google Assistant similar to human interactions in that there were a lot of different ways and contexts. And although there were a lot of different devices, there are some overarching principles. So let's take a look at them. First, the experience was familiar. Whether it was morning, evening, or commute, users can access their favorite Google Assistant actions whenever, wherever they need them. The Assistant was available in different contexts. You saw it being used on the go and at home, up close and from a distance, and in shared and private settings. So as you're thinking about your actions, think about all the different contexts it may be used in. And lastly, different devices lend themselves to different modes of interaction. Some are voice only, some are visual, and some were a mix of both. And we'll talk a bit more about the strengths and weaknesses of each of these modalities in a bit. First, let's take a deeper look at a couple of those devices and see how these principles apply. You've already heard about smart displays now. They were announced earlier this year. And even though it's a new device, users can expect it to feel familiar. It's essentially like a Google Home with a screen. It's designed to be used at home from a distance and as a shared device. And as you can see in this inspiration action example, even though there's a screen, users still interact with the device through voice. They don't have to tap through complex app navigation. Instead, the visuals are designed to be seen at a distance. And of course, the user can walk up to the screen and touch it if they want. Next, let's take a quick look at phones. One thing to note here is that we're making phones more visually assistive as well, similar to smart displays, allowing for a greater focus on the content. These devices, as you know, are great for use cases, on the go, up close, and in a private environment. And users can interact with the Google Assistant on the phone through both voice and visuals. So hopefully that gave you a better sense of what the experiences on Google Assistant look like across devices. Now, in order to design for so many devices, it helps to have a vocabulary to help categorize them. At Google, we use this design framework called the multimodal spectrum. It helps us categorize devices based on their interaction types. So on one end, you have voice-only devices, like the Google Home and other smart displays, that you have to hear or talk to. On the other end, you have visual-only devices, like a phone or a Chromebook, that's on mute, and most watches. So you have to look at these devices or touch them. And in the middle, you have what we call multimodal devices in that they're a mix of both. Cars and smart displays that rely primarily on voice, but have optional visuals, are known as voice-forward devices. Phones and Chromebooks with audio on, which can use a mix of both voice and visuals, are known as intermodal devices. So now, we have a vocabulary for categorizing these devices. But before we can start designing for them, it helps to understand the strengths and weaknesses of each of these modalities. Let's talk about voice first. Voice is great for natural input. We've been using it for millennia. Whether you're a kid, a senior, or someone who's not tech savvy, it's still really intuitive. It's great for hands-free, far-field use cases, like setting a timer in the kitchen while you're cooking. And it helps reduce task navigation. So for example, if you were out on a run, you could ask your favorite Google action that's about fitness and ask it about your workouts instead of having to pull out your phone, navigate to that app, and search for that answer. Similarly, you could ask the Google Assistant to play the next song without having to mess with any controls. So voice has a lot of benefits. It's great. But it does have some limitations. And that's where visuals come in. Think about the last time you were at a cafe. You probably walked past all the pastries, looked at the menu, and made eye contact with the cashier. Think about how difficult that interaction would be if it was just through voice. Have a listen at what the menu would sound like if it was through voice alone. [AUDIO PLAYBACK] - Espresso, latte, vanilla latte, cappuccino, mocha, Americano, flat white, hot chocolate, black coffee, and tea. [END PLAYBACK] SABA ZAIDI: That feels pretty overwhelming, right? It's like watching all the options go by and catching the right one. It's like looking at a ticker. Voice is ephemeral and linear. And that makes it very difficult to hold a lot of information in your head. By contrast, the menu is a lot easier to scan if it looks like this. And you can imagine that the problem gets compounded if you also had to compare prices in calories. So visuals are great for scanning and comparing. We also use them a lot to reference objects in the world. So I can look at all the baked goods and then point to the one that I want. Instead of, for example, having to hear or say out loud something like, that small, sugar cookie with a chocolate drizzle. So voice and visuals both have their benefits. And it often is useful to use both. So in this example, we usually prefer to look at the menu but then talk to the cashier in order to check out and pay. Similar benefits to using both voice and visuals exist in the real world as well or in the digital world as well. And that's what makes multimodal devices, like phones and smart displays, such a unique opportunity. By leveraging the best of both voice and visuals, they're able to provide really rich interactions. But how do we design for them along with designing for speakers? One thing to keep in mind, again, is to start with a human conversation. You might have an app already, but avoid the temptation to duplicate it. Instead, try to observe a relevant conversation in the real world or role play it with a colleague and write down that dialogue. You'll realize that not everything that's in your app does well as conversation or vice versa. Instead, think of your action as a companion to your app that's faster in certain use cases. I won't go into detail into how to write good dialogue or create persona, but I highly recommend you check out our brand new conversation design website at that link there. It goes into great detail into how to get started. So we've learned that we need to create spoken dialogue and then add visuals to it. So let's take a look at an example of how to do that and how that helps us scale. So if you haven't already, I'd encourage you to check out the Google I/O 2018 action that helps you learn more about this event. We started by writing a spoken dialogue as if it was for a voice-only device, like a Google Home. And it includes terms like this one. So a user can say, browse sessions. And we respond with a spoken response like, here are some of the topics left for today and so on. Now, in order to take this dialogue and scale it, we need to take every turn like this and think about all the ways we can incorporate visual components to it. So this would include, for example, display prompts, cards, and suggestion chips. In our example, we can accompany that spoken prompt with a display prompt like, "Which topic are you interested in?" This helps carry the written conversation on a screen. We can add a list of sessions as a card. A user could tap on that, for example. And we could have a suggestion chip like, "None of these." And this helps the user know how to pivot or follow up a conversation. Once we've constructed our response to have spoken and visual elements, we can then map that response to the multimodal spectrum from earlier. So depending on whether the device has visual or audio capabilities, or have important voices, we can choose the right components. So you already saw what our response would look like on a Google Home. We would simply have the spoken prompt. Let's take a look at what the response would look like across some of the other devices. A smart display is a voice-forward device. So we still need to show all the spoken prompt and make it carry the whole conversation. We don't really need a display prompt anymore, especially if we're going to have better visuals like the list and the chips. A phone, on the other hand, is intermodal device. And we need to have both the voice and the visuals carry the conversation. In this case, you might notice that we shortened the spoken prompt because we can direct the user to look at the screen for more details. And Ulas will talk more about how you can do that. And of course, the rest of the visual components are there as well. And finally, if your phone was on silent, we would simply ignore the spoken prompt, and the visual components are able to carry the complete conversation. So now, we learned that in order to scale our dialogue, we need to write spoken prompts and then add visuals to them. And that helps us map across devices. But how do we know what kinds of visuals to add? I'd like to leave you with five tips for how you can incorporate visuals into your dialogue. For that, let's take this made up Assistant action called the National Anthem Player on a smart display. As the name suggests, a user can ask for a country, and it will come back with the national anthem for that country. When you invoke this action, it gives you a welcome message that sounds like this. [AUDIO PLAYBACK] - Welcome to National Anthem Player. I can play the national anthems from 20 different countries, including United States, Canada, and the United Kingdom. Which would you like to hear? [END PLAYBACK] SABA ZAIDI: So as you can see, the device is currently writing on the screen what it's saying out loud. And this is really a missed opportunity, especially given that by now, we've learned that visuals have some strength over voice, and that smart displays are great for showing rich, immersive visuals. So tip number one is to consider cards rather than just display prompts. In this case, we've swapped out the words for a carousel. Users can quickly browse through the list and select the country that they want. You'll notice this is similar to the menu example we looked at in the cafe, where visuals are helping someone scan and compare options. Additionally, things like maps, charts, and images are also great on visual devices, because they are difficult to describe through voice, similar to the cookie. Second, consider varying your spoken and your display prompt. This is particularly useful for devices that are intermodal, that might have a display prompt next to a card. And some of that information might be redundant. So in this case, we're stripping out the examples for the countries in the display prompt because the card already shows a lot of examples. Third, consider visuals for suggestions. Here, we know that the user is a repeat user. So we're reordering the list so that their most frequently visited countries show up first. We're also using suggestion chips to allude to the user how they can follow up or pivot a conversation. This kind of discovery can be quite difficult through voice alone. Next, you can use visuals to increase your brand expression. We used to allow you to change your voice and choose a logo, but now we're also going to be allowing you to choose a font and the background image. And Ulas will talk more about how you can do that. As you can see here, the experience looks a lot more custom and immersive. And lastly, visual devices are great for carrying conversations that started on a voice-only device. So for example, if I use this national anthem action on a Google Home, and I wanted to see the lyrics, the action can send a notification after a few steps to my phone. And I can take out my phone and read them there. So hopefully those five tips will help you in incorporating more visuals into your dialogue. Let's summarize what we've learned so far on how to design actions for the Assistant. First, users interact with the Google Assistant in a variety of different ways and contexts. So this could include at home or on the go, up close, at a distance, or through voice or visuals. In order to design across so many modalities, it helps to keep in mind the multimodal spectrum and think of your responses as having visual components as well as spoken components. And lastly, learn and leverage the strengths of each of these modalities. We learned, for example, that visuals are great at scanning, brand expression, and discovery. So instead of just showing on the screen what you're saying out loud, try to use cards instead. All right. Now, I'm going to hand it over to my colleague, Ulas, who's going to talk about how you can develop these actions. ULAS KIRAZCI: Thank you, Saba. [APPLAUSE] ULAS KIRAZCI: Hi. So we said that the Assistant runs on many types of devices. And in the future, it will run on many others. Luckily, we make it easy for you to build your action in a way that it will run well on all of these devices today, as well as devices in the future. So let's go through an example. To walk you through this, I'm going to use a test action I created called California Surf Report that gives wave height and weather information and beaches in California for surfers. So currently, I only have spoken responses, no visuals yet. So let's see what this sounds like on a voice-only device like the Google Home. [AUDIO PLAYBACK] - --looks fair for most of today. Waves will be from two to three feet in the morning, to three to four feet in the afternoon. Expect waist-high swell in the morning with northwest winds. Shoulder-high surf in the afternoon with southwest winds. [END PLAYBACK] ULAS KIRAZCI: OK, great. Pretty informative. So now, let's take a look at what this sounds like and looks like on a voice-forward device, like a smart display. Switch to demo, please. OK, Google. Talk to California Surf Report. GOOGLE ASSISTANT: OK. Let's get the test version of California Surf Report. GOOGLE ASSISTANT: Welcome to California Surf Report. ULAS KIRAZCI: Tell me the surf report for Santa Cruz. GOOGLE ASSISTANT: Surf in Santa Cruz Beach looks fair for most of today. Waves will be from two to three feet in the morning, to three to four feet in the afternoon. ULAS KIRAZCI: OK. I cut it short because I think we all get the idea. So as you can see, spoken responses are a good way to get started and working well on many devices. But when we have a screen, we can make it a lot better. So let's see how. One of the best visuals we can add and the easiest ones to add is a BasicCard. And here's an example from the Node.js client library of how to add a BasicCard to your responses. So we start with a spoken prompt as usual and the second statement of ask as a BasicCard. A BasicCard can have a title, subtitle, body text, and optional image. So let's see what this looks like on our smart display again. Let's switch to the demo, please. OK, Google. Show me the surf report for Santa Cruz. GOOGLE ASSISTANT: Surf in Santa Cruz Beach looks fair for most of today. Waves will be from two to three feet in the morning, to three to four feet in the afternoon. Expect waist-high swell in the morning with northwest winds, shoulder-high surf in the afternoon with southwest winds. ULAS KIRAZCI: OK, great. Now, it looks much better. We have a nice visual for it rather than huge lines of text. And here are some other kinds of cards you can use. There's a carousel on this card that allows you to display a set of things that the user can choose from. There's also a newly introduced table card. Another great way to add visuals to your action is to use suggestion chips. Suggestion chips allow the user to understand what they can do in this turn in the conversation. And also, they simplify user input. You can learn more about responses at the link. And by the way, all of this, like I promised, works equally well on an intermodal device, like a phone. As you can see, we have formatted the font sizes and the layout to fit the intermodal form factor. OK, great. So next, maybe what we want to do is shorten the spoken response a bit because it's a bit repetitive with the information that's already on the card. Users can just look at the display for these. So how do we do this? We have a feature in the API called capabilities. So instead of thinking, if Google Home, do this, else if smart display do that, think about what capabilities the surface that the user is interacting with you on have. Does it have a screen? Can it output audio? The capabilities of the device are reported to you in every webhook call, so you get to know what this is on every conversation turn. And here's a sample list of capabilities that we support. And you can learn more at the link. So in our use case, what we're looking for is the screen output capability. This indicates that the user device has a screen. So we can show them a card. Oh, and by the way, if you don't want your responses to differentiate between ones with displays and ones that are voice only, you can always add a card, and we'll just trip it out for you silently. So this makes it easy for you to build if you don't want to differentiate. And here's, again, the Node.js client library snippet that shows how to use this. So in the first if statement, we determined that the user's device does not have a screen. So we have the full content in the spoken response. In the else clause, we know that there is a screen. So we shorten the spoken response and end it with a phrase like, "Here's the report" to lead the user to the screen. And then we append the BasicCard to the response. So let's see how this looks like on our smart display now. Switch to demo, please. OK, Google. Show me the shortened surf report for Santa Cruz. GOOGLE ASSISTANT: Surf in Santa Cruz Beach looks fair for most of today, with two to three foot waves in the morning and three to four foot waves in the afternoon. Here's the report. ULAS KIRAZCI: So that sounds a lot more concise and user friendly. Another way you can use capabilities is to require that your action only run on devices that have the capability. This is what we call static capabilities. And you can configure these through the Actions on Google console, as you can see here. But only use this if your action absolutely makes no sense without that capability. So for example, the National Anthem Player action that Saba talked about would not make sense on a device without audio. So this would be a good place to use that. However, for the surf report app, it equally works well on voice-only and display-only devices. So it wouldn't be a good place to use this. You can configure all this using the Actions on developer Google console. Another high quality and easy way to target multiple surfaces is to use Google libraries we call helpers. So I've been asking California Surf Report the surf report with the beach name. But if I don't say the beach name, I get a prompt, "Which beach?" Now, this doesn't tell me what I can say. It doesn't tell me which beaches this action actually supports. So we can fix that with a helper called askWithCarousel. What askWithCarousel does is it presents the user with a list of options to pick from and associate visuals with each item. In addition, when the user utters the query to select an item, Google does the matching of the query to the item. So we can deal with variations in how people pick things much better. So let's make our prompt better with the askWithCarousel helper. And again, the Node.js library snippet here. We start with the spoken response with the prompt. And we add a carousel to it. And the carousel is made up of items. And each item has a list of phrases that you think the user might say to match this item and visuals associated with each item so the user can understand what they're about to tap on. So let's switch to the demo and see what this looks like. OK, Google. Show me the surf report. GOOGLE ASSISTANT: Which beach do you want to see the report for? ULAS KIRAZCI: OK, so this is the example where I'm a little confused as a user. OK, Google. Show me the beach carousel. GOOGLE ASSISTANT: Which beach would you like to know about? ULAS KIRAZCI: OK, great. Now it's much easier for me to understand what the possible options are, and even tap on one if I want to go with that. And we always continuously improve the experience with these helpers. So this is one of the advantages of helpers, is that we continue to modify them to optimize them for surfaces. Now, since we launched the current API last year, we've since come up with smart displays. As you noticed on smart displays, each conversation turn takes up the entire screen. So given this fact, maybe we can make our visuals more branded and give them a little bit more flair. So we're introducing styling options this year. So here's how it works. Let's switch to the demo, please. So here is a new tab in the Actions on Google console called theme customization. You can modify the background color, primary color-- so that's like the font color of the text-- and the typography, and even set a background image. So let's try a few things here. Let's say we want to make this cursive. Let's add a background image. OK, so this is the landscape aspect ratio image. And then we want to add a portrait image as well. All right, all set. Now, all we have to do is save. And then we click Test right here to update our test version. All right, now, let's see what this looks like on the demo. OK, Google. Show me the surf report for Santa Cruz. GOOGLE ASSISTANT: Surf in Santa Cruz Beach looks fair for most of today, with two to three foot waves in the morning and three to four foot waves in the afternoon. Here's the report. ULAS KIRAZCI: I think now that looks really beautiful. Smart displays are coming out later this year. However, you can start building your action against these visuals today, using the updated simulator. So we've added a new simulator device type for smart displays, as you can see here. And we've also added a display tab, which shows you the full-screen version of what you would get on a smart display. You can also use this with a phone. And on the left side as usual, you have the spoken responses, as well as an input box where you can put user queries. One last thing, we said that Assistant is in many places. So if the user's interacting with your action using a voice-only device, maybe they also have a device that has a display on it, for example, a phone. So what if, in your current turn in the conversation, you really want to have your response display something? So for example, in the surf report action, the user might ask us for the full report. And we want to return the hour by hour wave height graph. So how do we do that? There's a feature in the API called multi-surface conversations. And here's how it works. In each API call to your webhook, we report not only the capabilities of the device that the user's currently using, but the union of the capabilities of all the devices the user owns. So in this example, what you see is that the current user device only has a voice output capability and has no screen. But in available surfaces, we can see the screen output capability. So the user seems to have another device with a screen on it. So how do we use this? Again, in the client library we have a function to help you to inspect if the user has a certain capability among their devices. Now, we determine that the user has a device with this capability. How do we transfer the user to the other device? We have a function, ask for new surface, that does this. And you can give it a notification that will appear on the target device, in addition to the list of capabilities you require for continuing your conversation. I'm not going to demo this, but here's what it looks like. So let's say the user said, show me the full report. And they're talking to you on a Google Home. So you would call the ask new surface function that I showed you earlier. And we ask the user permission to send the conversation over to the user's phone. And if the user accepts, then there's a notification sent to the new device. The conversation ends on the current device. And when the user taps the notification, then they resume the conversation from where you left off, like this. Note that this is not just for single responses. This works equally well when you want to continue the conversation. So we bring the full context over so you can continue from where you left off. So to sum up, we've built a lot of features in the API for you to add visuals to your responses. So please use them. And we make it such that we take your responses and optimize them as best as possible to all these surfaces, and surfaces in the future without extra work from you. And if you wanted to customize your responses, always think of capabilities and not individual device types. This way, we can run your action on new devices without any extra work from you. SABA ZAIDI: Thanks. [APPLAUSE] Thank you. So I'd just like to end with an invitation for you. Next time you order coffee at a cafe or do a presentation like this one, start to notice all the different modes of interactions you use every day. And let the richness of those human conversations inspire how you design actions for your users. And help us evolve what it means to have conversations with technology. Here's some links to resources we mentioned and how to give feedback. So also come just talk to us. We'll be with our team at the Assistant office hours and Sandboxes, ready to answer your questions and showing off some of the devices. So thank you and good luck. ULAS KIRAZCI: Thank you. [MUSIC PLAYING]