[MUSIC PLAYING]


SABA ZAIDI: My
husband and I have

been in a somewhat long-distance
relationship for a few years

now.

We talk on the phone
and message a lot.

But perhaps what's
most enjoyable

for me is when I visit him.

I leave little notes
all over his apartment.

And sometimes when
I'm home alone,

he'll start changing the
color of the smart lights

in my living room to let me
know that he's thinking of me.

Our human conversations take
a lot of different forms.

We don't just talk verbally.

Depending on where we are
and what we're trying to say,

we might write, hold
hands, or even change

color of smart lights.

It's these human
conversations that

have inspired us to create
conversations with technology.

It's no surprise then
that we're starting

to incorporate more modalities
into our digital interactions

as well.

But with this increased number
of devices and complexity

of interactions, it might
feel overwhelming to design

for the Google Assistant.


If you look at the
Google Assistant today,

I can talk to it not
just through the speaker

in my living room, but
also through my car

or my headphones.

I can tap on my
phone or my watch.

But how do we design
for an increased number

of complexity of devices?

I'm Saba Zaidi, and I'm
an interaction designer

on the Google Assistant team.

I'll be talking about how
you can design actions

across surfaces and give you
some frameworks and some tips.

ULAS KIRAZCI: And hi, I'm Ulas.

I'm an engineer on
Actions on Google.

And later on, I'm
going to tell you

how to build an action
using the design

principle Saba will talk about.

SABA ZAIDI: OK.

So before we can get
started to understand

how to design actions
across surfaces,

we need to get a
better understanding

of what those experiences
can look like.

Let's start with a journey
through a user's day.

You wake up in the morning
to the sound of an alarm

ringing on your Google Home.

Without even getting out
your blanket, you can say,

hey, Google, stop.

You get up, get ready.

And as you're about
to head out, you

want to make sure you
don't need an umbrella.

So you turn to the smart
display in your hallway

and you ask, hey, Google,
what's the weather today?

You're able to hear a
summary, and also at a glance

see the hourly forecast.

As you're walking to your
car, you take out your phone

and tap on it to ask the
Google Assistant to order

your favorite from Starbucks.

It's able to connect
you to Starbucks,

where you can quickly
place your order.


As you're driving, you want to
listen to your favorite podcast

or the news.

And you can ask Google Assistant
for help in a hands-free way.

And it can connect you, for
example, to NPR News update.

You go about your day.

And when you come
back home, it's

time to make dinner
for your family.

You turn to the smart
display in your kitchen

and you ask, for example,
Tasty for help with a recipe,

like in this case, pizza bombs.

Tasty comes back, and
you can hear and see

on the screen
step-by-step instructions

and the ingredients.

After dinner, it's time
to unwind with your family

in the living room.

You decide to play a
Buzzfeed personality quiz.

And it's a great
way for your family

to get together around a shared
device and have some fun.


And then you head to bed.

You say, hey,
Google, good night.

It's able to start your
custom bedtime routine, which

includes, for example,
setting your alarms

or telling you about
your day tomorrow.

So as you can see, users
were able to interact

with the Google Assistant
similar to human interactions

in that there were a lot of
different ways and contexts.

And although there were a
lot of different devices,

there are some
overarching principles.

So let's take a look at them.

First, the experience
was familiar.

Whether it was morning,
evening, or commute,

users can access their favorite
Google Assistant actions

whenever, wherever
they need them.

The Assistant was available
in different contexts.

You saw it being used
on the go and at home,

up close and from
a distance, and

in shared and private settings.

So as you're thinking
about your actions,

think about all the different
contexts it may be used in.

And lastly, different
devices lend themselves

to different modes
of interaction.

Some are voice only,
some are visual,

and some were a mix of both.

And we'll talk a bit more about
the strengths and weaknesses

of each of these
modalities in a bit.

First, let's take a deeper look
at a couple of those devices

and see how these
principles apply.

You've already heard
about smart displays now.

They were announced
earlier this year.

And even though
it's a new device,

users can expect it
to feel familiar.

It's essentially like a
Google Home with a screen.

It's designed to be used
at home from a distance

and as a shared device.

And as you can see in this
inspiration action example,

even though there's
a screen, users still

interact with the
device through voice.

They don't have to tap through
complex app navigation.

Instead, the
visuals are designed

to be seen at a distance.

And of course, the user
can walk up to the screen

and touch it if they want.


Next, let's take a
quick look at phones.

One thing to note
here is that we're

making phones more
visually assistive as well,

similar to smart displays,
allowing for a greater

focus on the content.

These devices, as you know, are
great for use cases, on the go,

up close, and in a
private environment.

And users can interact with the
Google Assistant on the phone

through both voice and visuals.

So hopefully that gave
you a better sense

of what the experiences
on Google Assistant

look like across devices.

Now, in order to design
for so many devices,

it helps to have a vocabulary
to help categorize them.

At Google, we use
this design framework

called the multimodal spectrum.

It helps us categorize devices
based on their interaction

types.

So on one end, you have
voice-only devices,

like the Google Home and
other smart displays,

that you have to
hear or talk to.

On the other end, you
have visual-only devices,

like a phone or a
Chromebook, that's

on mute, and most watches.

So you have to look at
these devices or touch them.

And in the middle, you have
what we call multimodal devices

in that they're a mix of both.

Cars and smart displays that
rely primarily on voice,

but have optional visuals, are
known as voice-forward devices.

Phones and Chromebooks
with audio on,

which can use a mix of
both voice and visuals,

are known as intermodal devices.

So now, we have a vocabulary
for categorizing these devices.

But before we can start
designing for them,

it helps to understand the
strengths and weaknesses

of each of these modalities.

Let's talk about voice first.

Voice is great
for natural input.

We've been using
it for millennia.

Whether you're a kid,
a senior, or someone

who's not tech savvy, it's
still really intuitive.

It's great for hands-free,
far-field use cases,

like setting a timer in the
kitchen while you're cooking.

And it helps reduce
task navigation.

So for example, if
you were out on a run,

you could ask your favorite
Google action that's

about fitness and ask
it about your workouts

instead of having to
pull out your phone,

navigate to that app, and
search for that answer.

Similarly, you could
ask the Google Assistant

to play the next
song without having

to mess with any controls.


So voice has a lot of benefits.

It's great.

But it does have
some limitations.

And that's where
visuals come in.


Think about the last
time you were at a cafe.

You probably walked past all the
pastries, looked at the menu,

and made eye contact
with the cashier.

Think about how difficult
that interaction would be

if it was just through voice.

Have a listen at what
the menu would sound like

if it was through voice alone.


[AUDIO PLAYBACK]

- Espresso, latte, vanilla
latte, cappuccino, mocha,

Americano, flat white, hot
chocolate, black coffee,

and tea.

[END PLAYBACK]

SABA ZAIDI: That feels
pretty overwhelming, right?

It's like watching
all the options go by

and catching the right one.

It's like looking at a ticker.

Voice is ephemeral and linear.

And that makes it very difficult
to hold a lot of information

in your head.

By contrast, the
menu is a lot easier

to scan if it looks like this.

And you can imagine
that the problem

gets compounded if you also had
to compare prices in calories.

So visuals are great for
scanning and comparing.

We also use them a lot to
reference objects in the world.

So I can look at all the baked
goods and then point to the one

that I want.

Instead of, for example,
having to hear or say out

loud something like, that small,
sugar cookie with a chocolate

drizzle.


So voice and visuals
both have their benefits.

And it often is
useful to use both.

So in this example, we usually
prefer to look at the menu

but then talk to the cashier
in order to check out and pay.

Similar benefits to using
both voice and visuals

exist in the real world as
well or in the digital world

as well.

And that's what makes
multimodal devices, like phones

and smart displays, such
a unique opportunity.

By leveraging the best of
both voice and visuals,

they're able to provide
really rich interactions.

But how do we design
for them along

with designing for speakers?

One thing to keep
in mind, again,

is to start with a
human conversation.

You might have an app already,
but avoid the temptation

to duplicate it.

Instead, try to observe
a relevant conversation

in the real world or role
play it with a colleague

and write down that dialogue.

You'll realize that not
everything that's in your app

does well as conversation
or vice versa.

Instead, think of your
action as a companion

to your app that's faster
in certain use cases.

I won't go into detail into
how to write good dialogue

or create persona,
but I highly recommend

you check out our brand
new conversation design

website at that link there.

It goes into great detail
into how to get started.

So we've learned that we need
to create spoken dialogue

and then add visuals to it.

So let's take a look at an
example of how to do that

and how that helps us scale.

So if you haven't
already, I'd encourage

you to check out
the Google I/O 2018

action that helps you learn
more about this event.

We started by writing
a spoken dialogue

as if it was for a voice-only
device, like a Google Home.

And it includes
terms like this one.

So a user can say,
browse sessions.

And we respond with
a spoken response

like, here are some
of the topics left

for today and so on.

Now, in order to take this
dialogue and scale it,

we need to take
every turn like this

and think about all the
ways we can incorporate

visual components to it.

So this would include, for
example, display prompts,

cards, and suggestion chips.

In our example, we can
accompany that spoken prompt

with a display prompt
like, "Which topic are you

interested in?"

This helps carry the written
conversation on a screen.

We can add a list of
sessions as a card.

A user could tap on
that, for example.

And we could have a suggestion
chip like, "None of these."

And this helps the user know
how to pivot or follow up

a conversation.

Once we've constructed our
response to have spoken

and visual elements, we
can then map that response

to the multimodal
spectrum from earlier.

So depending on
whether the device

has visual or
audio capabilities,

or have important voices, we
can choose the right components.

So you already saw
what our response would

look like on a Google Home.

We would simply have
the spoken prompt.

Let's take a look
at what the response

would look like across
some of the other devices.


A smart display is a
voice-forward device.

So we still need to show
all the spoken prompt

and make it carry the
whole conversation.

We don't really need a display
prompt anymore, especially

if we're going to
have better visuals

like the list and the chips.

A phone, on the other
hand, is intermodal device.

And we need to
have both the voice

and the visuals carry
the conversation.

In this case, you might
notice that we shortened

the spoken prompt because we
can direct the user to look

at the screen for more details.

And Ulas will talk more
about how you can do that.

And of course, the rest
of the visual components

are there as well.

And finally, if your
phone was on silent,

we would simply ignore
the spoken prompt,

and the visual components
are able to carry

the complete conversation.

So now, we learned that in
order to scale our dialogue,

we need to write spoken prompts
and then add visuals to them.

And that helps us
map across devices.

But how do we know what
kinds of visuals to add?

I'd like to leave you
with five tips for how

you can incorporate
visuals into your dialogue.

For that, let's
take this made up

Assistant action called
the National Anthem

Player on a smart display.

As the name suggests, a
user can ask for a country,

and it will come back
with the national anthem

for that country.

When you invoke this action,
it gives you a welcome message

that sounds like this.


[AUDIO PLAYBACK]

- Welcome to National
Anthem Player.

I can play the national anthems
from 20 different countries,

including United States,
Canada, and the United Kingdom.

Which would you like to hear?

[END PLAYBACK]

SABA ZAIDI: So as you
can see, the device

is currently writing
on the screen

what it's saying out loud.

And this is really a missed
opportunity, especially

given that by now, we've
learned that visuals

have some strength over voice,
and that smart displays are

great for showing rich,
immersive visuals.

So tip number one is to consider
cards rather than just display

prompts.

In this case, we've swapped
out the words for a carousel.

Users can quickly
browse through the list

and select the country
that they want.

You'll notice this is
similar to the menu example

we looked at in the cafe, where
visuals are helping someone

scan and compare options.

Additionally, things like
maps, charts, and images

are also great on
visual devices,

because they are difficult
to describe through voice,

similar to the cookie.

Second, consider varying your
spoken and your display prompt.

This is particularly useful for
devices that are intermodal,

that might have a display
prompt next to a card.

And some of that information
might be redundant.

So in this case,
we're stripping out

the examples for the
countries in the display

prompt because the card already
shows a lot of examples.


Third, consider visuals
for suggestions.

Here, we know that the
user is a repeat user.

So we're reordering the list
so that their most frequently

visited countries show up first.

We're also using suggestion
chips to allude to the user

how they can follow up
or pivot a conversation.

This kind of discovery can be
quite difficult through voice

alone.


Next, you can use visuals to
increase your brand expression.

We used to allow you to change
your voice and choose a logo,

but now we're also
going to be allowing

you to choose a font and
the background image.

And Ulas will talk more
about how you can do that.

As you can see
here, the experience

looks a lot more
custom and immersive.

And lastly, visual devices
are great for carrying

conversations that started
on a voice-only device.

So for example, if I use
this national anthem action

on a Google Home, and I
wanted to see the lyrics,

the action can send a
notification after a few steps

to my phone.

And I can take out my
phone and read them there.

So hopefully those
five tips will help you

in incorporating more
visuals into your dialogue.

Let's summarize what
we've learned so far

on how to design actions
for the Assistant.

First, users interact
with the Google Assistant

in a variety of different
ways and contexts.

So this could include
at home or on the go,

up close, at a distance, or
through voice or visuals.

In order to design across
so many modalities,

it helps to keep in mind
the multimodal spectrum

and think of your responses
as having visual components as

well as spoken components.

And lastly, learn and
leverage the strengths of each

of these modalities.

We learned, for
example, that visuals

are great at scanning, brand
expression, and discovery.

So instead of just
showing on the screen what

you're saying out loud,
try to use cards instead.

All right.

Now, I'm going to hand
it over to my colleague,

Ulas, who's going
to talk about how

you can develop these actions.

ULAS KIRAZCI: Thank you, Saba.

[APPLAUSE]


ULAS KIRAZCI: Hi.

So we said that the Assistant
runs on many types of devices.

And in the future, it
will run on many others.

Luckily, we make it easy for you
to build your action in a way

that it will run well on
all of these devices today,

as well as devices
in the future.

So let's go through an example.

To walk you through
this, I'm going

to use a test action I
created called California Surf

Report that gives wave height
and weather information

and beaches in
California for surfers.

So currently, I only have spoken
responses, no visuals yet.

So let's see what this sounds
like on a voice-only device

like the Google Home.


[AUDIO PLAYBACK]

- --looks fair
for most of today.

Waves will be from two to
three feet in the morning,

to three to four feet
in the afternoon.

Expect waist-high swell in the
morning with northwest winds.

Shoulder-high surf in the
afternoon with southwest winds.

[END PLAYBACK]

ULAS KIRAZCI: OK, great.

Pretty informative.

So now, let's take
a look at what

this sounds like and looks
like on a voice-forward device,

like a smart display.

Switch to demo, please.


OK, Google.

Talk to California Surf Report.


GOOGLE ASSISTANT: OK.

Let's get the test version
of California Surf Report.

GOOGLE ASSISTANT: Welcome
to California Surf Report.

ULAS KIRAZCI: Tell me the
surf report for Santa Cruz.


GOOGLE ASSISTANT: Surf
in Santa Cruz Beach

looks fair for most of today.

Waves will be from two to
three feet in the morning,

to three to four feet
in the afternoon.

ULAS KIRAZCI: OK.

I cut it short because I
think we all get the idea.

So as you can see,
spoken responses

are a good way to get
started and working well

on many devices.


But when we have a screen,
we can make it a lot better.

So let's see how.

One of the best visuals we can
add and the easiest ones to add

is a BasicCard.

And here's an example
from the Node.js client

library of how to add a
BasicCard to your responses.

So we start with a
spoken prompt as usual

and the second statement
of ask as a BasicCard.

A BasicCard can have a
title, subtitle, body text,

and optional image.

So let's see what this looks
like on our smart display

again.

Let's switch to
the demo, please.


OK, Google.

Show me the surf
report for Santa Cruz.


GOOGLE ASSISTANT: Surf
in Santa Cruz Beach

looks fair for most of today.

Waves will be from two to
three feet in the morning,

to three to four feet
in the afternoon.

Expect waist-high swell in the
morning with northwest winds,

shoulder-high surf in the
afternoon with southwest winds.


ULAS KIRAZCI: OK, great.

Now, it looks much better.

We have a nice visual for it
rather than huge lines of text.


And here are some other
kinds of cards you can use.

There's a carousel
on this card that

allows you to display
a set of things

that the user can choose from.

There's also a newly
introduced table card.

Another great way to add
visuals to your action

is to use suggestion chips.

Suggestion chips allow
the user to understand

what they can do in this
turn in the conversation.

And also, they
simplify user input.

You can learn more about
responses at the link.


And by the way, all of
this, like I promised,

works equally well on
an intermodal device,

like a phone.

As you can see,
we have formatted

the font sizes and the layout to
fit the intermodal form factor.

OK, great.

So next, maybe
what we want to do

is shorten the
spoken response a bit

because it's a bit repetitive
with the information that's

already on the card.

Users can just look at
the display for these.


So how do we do this?

We have a feature in the
API called capabilities.

So instead of thinking,
if Google Home,

do this, else if
smart display do that,

think about what capabilities
the surface that the user is

interacting with you on have.

Does it have a screen?

Can it output audio?

The capabilities of the
device are reported to you

in every webhook
call, so you get

to know what this is on
every conversation turn.

And here's a sample list of
capabilities that we support.

And you can learn
more at the link.

So in our use case,
what we're looking for

is the screen output capability.

This indicates that the
user device has a screen.

So we can show them a card.

Oh, and by the way, if you
don't want your responses

to differentiate between
ones with displays and ones

that are voice only, you
can always add a card,

and we'll just trip it
out for you silently.

So this makes it easy
for you to build if you

don't want to differentiate.

And here's, again, the
Node.js client library

snippet that shows
how to use this.

So in the first if
statement, we determined

that the user's device
does not have a screen.

So we have the full content
in the spoken response.

In the else clause, we know
that there is a screen.

So we shorten the
spoken response

and end it with a phrase
like, "Here's the report"

to lead the user to the screen.

And then we append the
BasicCard to the response.

So let's see how this looks
like on our smart display now.

Switch to demo, please.


OK, Google.

Show me the shortened surf
report for Santa Cruz.


GOOGLE ASSISTANT: Surf
in Santa Cruz Beach

looks fair for most of
today, with two to three foot

waves in the morning
and three to four foot

waves in the afternoon.

Here's the report.


ULAS KIRAZCI: So that
sounds a lot more concise

and user friendly.


Another way you can
use capabilities

is to require that
your action only

run on devices that
have the capability.

This is what we call
static capabilities.

And you can configure
these through the Actions

on Google console,
as you can see here.

But only use this if your
action absolutely makes no sense

without that capability.

So for example, the National
Anthem Player action

that Saba talked about would
not make sense on a device

without audio.

So this would be a
good place to use that.

However, for the
surf report app,

it equally works well on
voice-only and display-only

devices.

So it wouldn't be a
good place to use this.

You can configure all this using
the Actions on developer Google

console.


Another high quality and easy
way to target multiple surfaces

is to use Google
libraries we call helpers.

So I've been asking California
Surf Report the surf

report with the beach name.

But if I don't say the
beach name, I get a prompt,

"Which beach?"

Now, this doesn't tell
me what I can say.

It doesn't tell me which beaches
this action actually supports.

So we can fix that with a
helper called askWithCarousel.

What askWithCarousel does
is it presents the user

with a list of
options to pick from

and associate visuals
with each item.

In addition, when the
user utters the query

to select an item, Google
does the matching of the query

to the item.

So we can deal with
variations in how

people pick things much better.

So let's make our
prompt better with

the askWithCarousel helper.

And again, the Node.js
library snippet here.

We start with the spoken
response with the prompt.

And we add a carousel to it.

And the carousel is
made up of items.

And each item has
a list of phrases

that you think the user
might say to match this item

and visuals associated
with each item

so the user can understand
what they're about to tap on.

So let's switch to the demo
and see what this looks like.


OK, Google.

Show me the surf report.

GOOGLE ASSISTANT: Which beach do
you want to see the report for?

ULAS KIRAZCI: OK, so
this is the example

where I'm a little
confused as a user.

OK, Google.

Show me the beach carousel.


GOOGLE ASSISTANT: Which beach
would you like to know about?


ULAS KIRAZCI: OK, great.

Now it's much easier
for me to understand

what the possible options
are, and even tap on one

if I want to go with that.


And we always continuously
improve the experience

with these helpers.

So this is one of the
advantages of helpers,

is that we continue to
modify them to optimize them

for surfaces.


Now, since we launched
the current API last year,

we've since come up
with smart displays.

As you noticed on
smart displays,

each conversation turn
takes up the entire screen.

So given this fact, maybe we can
make our visuals more branded

and give them a
little bit more flair.

So we're introducing
styling options this year.

So here's how it works.

Let's switch to
the demo, please.


So here is a new
tab in the Actions

on Google console called
theme customization.

You can modify the background
color, primary color--

so that's like the font
color of the text--

and the typography, and
even set a background image.

So let's try a few things here.

Let's say we want to
make this cursive.

Let's add a background image.


OK, so this is the landscape
aspect ratio image.

And then we want to add
a portrait image as well.


All right, all set.

Now, all we have to do is save.


And then we click
Test right here

to update our test version.


All right, now, let's see what
this looks like on the demo.


OK, Google.

Show me the surf
report for Santa Cruz.

GOOGLE ASSISTANT: Surf
in Santa Cruz Beach

looks fair for most of
today, with two to three foot

waves in the morning
and three to four foot

waves in the afternoon.

Here's the report.


ULAS KIRAZCI: I think now
that looks really beautiful.


Smart displays are coming
out later this year.

However, you can start
building your action

against these visuals today,
using the updated simulator.

So we've added a new simulator
device type for smart displays,

as you can see here.

And we've also
added a display tab,

which shows you the
full-screen version of what you

would get on a smart display.

You can also use
this with a phone.

And on the left
side as usual, you

have the spoken responses,
as well as an input box

where you can put user queries.


One last thing, we said that
Assistant is in many places.

So if the user's
interacting with your action

using a voice-only
device, maybe they also

have a device that has a display
on it, for example, a phone.

So what if, in your current
turn in the conversation,

you really want to have your
response display something?

So for example, in the
surf report action,

the user might ask us
for the full report.

And we want to return the hour
by hour wave height graph.

So how do we do that?


There's a feature
in the API called

multi-surface conversations.

And here's how it works.

In each API call
to your webhook,

we report not only
the capabilities

of the device that the
user's currently using,

but the union of the
capabilities of all the devices

the user owns.

So in this example, what you
see is that the current user

device only has a voice output
capability and has no screen.

But in available surfaces,
we can see the screen output

capability.

So the user seems to
have another device

with a screen on it.

So how do we use this?


Again, in the client
library we have a function

to help you to inspect if the
user has a certain capability

among their devices.

Now, we determine
that the user has

a device with this capability.

How do we transfer the
user to the other device?


We have a function, ask for
new surface, that does this.

And you can give it
a notification that

will appear on
the target device,

in addition to the
list of capabilities

you require for continuing
your conversation.

I'm not going to demo this,
but here's what it looks like.

So let's say the user said,
show me the full report.

And they're talking to
you on a Google Home.


So you would call the
ask new surface function

that I showed you earlier.

And we ask the user permission
to send the conversation over

to the user's phone.

And if the user
accepts, then there's

a notification sent
to the new device.

The conversation ends
on the current device.

And when the user
taps the notification,

then they resume
the conversation

from where you left
off, like this.

Note that this is not
just for single responses.

This works equally
well when you want

to continue the conversation.

So we bring the full context
over so you can continue

from where you left off.


So to sum up, we've built a
lot of features in the API

for you to add visuals
to your responses.

So please use them.

And we make it such that
we take your responses

and optimize them as best
as possible to all these

surfaces, and
surfaces in the future

without extra work from you.

And if you wanted to
customize your responses,

always think of capabilities
and not individual device types.

This way, we can run your
action on new devices

without any extra work from you.


SABA ZAIDI: Thanks.

[APPLAUSE]


Thank you.

So I'd just like to end
with an invitation for you.

Next time you order
coffee at a cafe

or do a presentation
like this one,

start to notice all the
different modes of interactions

you use every day.

And let the richness of
those human conversations

inspire how you design
actions for your users.

And help us evolve what it
means to have conversations

with technology.

Here's some links to
resources we mentioned

and how to give feedback.

So also come just talk to us.

We'll be with our team
at the Assistant office

hours and Sandboxes, ready
to answer your questions

and showing off
some of the devices.

So thank you and good luck.

ULAS KIRAZCI: Thank you.

[MUSIC PLAYING]