The Future of Voice Design / by Gavin Lau


Speech interfaces are on the rise. Consumers are excited, taking note, and playing with tools like Google Home and Amazon Alexato see what voice can do and how it can fit into their daily lives. Don’t be fooled, this moment isn’t a fleeting one—the adoption of voice as a mainstream medium has been a long time coming. When I submitted my session for IA Summit ‘17, the reviewing team of designers were excited to learn more about voice, and others said (almost reluctantly) “now is the time to get on board with this stuff.”

Designing for voice involves writing prompts and responses to help users get acquainted with the possible actions they can take and make it easy for them to accomplish the task at hand—my work as an Interactive Communication Designer involves a lot of writing! Designing conversations is like designing an information architecture: ultimately we want to help people achieve an end task and find what they’re looking for, but there are lots of unique things we have to account for to help them accomplish these tasks with no visual interface. 

As it happens, UX designers already possess the skills they need to design effectively for voice—this isn’t a revolution, but rather a natural extension of the skills many of us have been utilizing in designing for other digital mediums. Knowing how to design well for this growing medium will be an integral part of the growth of products and applications that will shape the ways we communicate with our devices. 

Principles of VUI Design

Designing conversations is ultimately the goal of voice design. In the same way that UX designers create visual interfaces that are easy to navigate, drive users to take action, and reduce constraints in their way, voice interfaces lead users to accomplish tasks with as little confusion and barriers as possible. In the case of voice though, it happens through a seamless conversation.

Error Handling

I’ll cover several principles of Voice User Interface design during my talk at IAS ‘17 this year, but I couldn’t help talking about my favorite principle early—error handling. Handling errors through voice is one aspect of designing for voice that makes it unique, and directly parallels to the ways we think about designing thoughtful and clear visual interfaces. 

Error handling is an essential component of designing thoughtful voice interactions. How can users understand the limits and possibilities of a system without any visual cues? The answer is through thoughtful handling of errors. When designing for a voice interface, designers have to expect what users will say. I’ll use my work in designing interactive phone calls for healthcare as an example. 

In a situation where I’m designing an interactive phone call for patients that need to get a colonoscopy, I can expect that when I ask if patients would like to schedule a colonoscopy now, they might respond in several ways: yes, no, stop calling me, I don’t need this, etc. For each of these expected answers, I can design thoughtful responses and can even capture variations on each of these types of responses, like: “yeah, sure, okay, no way, no thank you,” or even “never call me again.” But in some cases, patients either get confused, lost, or curious about our systems and sometimes speak “out of grammar” or out of the possibility of the responses I’ve expected. 

For example, a user might say something like… “pickles.” 

My system obviously doesn’t know what to respond to the word “pickles”, since I as a designer didn’t expect a patient would respond to my question about colonoscopies in this way. While I can’t be helpful in this scenario, I can thoughtfully redirect the user, or “error handle” to better get them where they want to go. 

When a user reaches a limit, a poorly designed system might say nothing at all, or ask the user to repeat their question until the user asked something the system could take action on. Like a visual interface, designers must be thoughtful about these potential error paths, and help users get back to completing their desired action sooner. In this case, I might say something like, “I’m really sorry, I didn’t quite hear what you said. Would you like to schedule a colonoscopy now? Just say yes, no, or I don’t need one.” In this situation, I’m designing a response that shows a user has reached a limit in the system, but have also provided suggestions for moving forward and getting where they want to go. 

Designing Grammars

Another important principle of VUI design is anticipating how users will respond—also called designing grammars. In the context of designing interactive phone calls for Emmi, I’m typically asking patients lots of questions over the phone—things like, “Have you gotten a flu vaccine yet this year?” or “Would you like to transfer to schedule a screening now?” In both of these examples, I can expect the responses to be fairly straightforward. People will typically respond with a “yes,” “no,” or “I don’t know.” By anticipating these responses, I can train my voice interface to expect these answers, and can also frame my questions to make it clear to patients how to answer these questions. 

In some cases though, identifying how users will respond gets tricky, and the users may break our expectations. In many cases, as users begin to trust their VUIs, they start using more complex responses and phrases that they expect the system should understand. 

For example, I might ask more complicated questions to patients with specific conditions like heart failure. In one case, I wanted to ask patients, “When you cough, are you experiencing any mucus?” in order to determine whether they were having any serious complications their doctors should know about. In a simple scenario, I would expect users to just say “yes” or “no,” but in reality users respond like they would in a polite conversation to a person. They’ll add in extra words like, “yes I am” or “no I’m not” or sometimes even “yes, mucus” or “no mucus.” By anticipating these normal variances in conversation, I can design a system that is smarter, more closely mimics real conversation, and builds the trust of users. 

In other words, by listening to recordings of patients interacting with my VUIs and doing lots of research, I can better anticipate what these variances will be and prepare for them to prevent confusion and frustrating user errors.