Drawing invisible boundaries in conversational interfaces

One of the things anyone who has worked on textual conversation interfaces, like chatbots, will tell you is that the challenge is dealing with the long tail of crazy things people will type. People love to abuse chatbots. Something about text-based conversation UI's invites Turing tests. Every game player remembers the moment they first abandoned their assigned mission in Grand Theft Auto to start driving around the city crashing into cars, running over pedestrians, just to exercise their freedom and explore just what happens when they escape the plot mechanic tree.

However, this type of user roaming or trolling happens much less with voice interfaces. Sure, the first time a user tries Siri or Alexa or whatever Google's voice assistant is called (it really needs a name, IMO, to avoid inheriting everything the word "Google" stands for), they may ask something ridiculous or snarky. However, that type of rogue input tends to trail off quickly, whereas it doesn't in textual conversation UI's.

I suspect some form of the uncanny valley and blame the affordances of text interfaces. Most text conversation UI's are visually indistinguishable from those of a messaging UI used to communicate primarily with other human beings. Thus it invites the user to probe its intelligence boundaries. Unfortunately, the seamless polish of the UI isn't matched by the capabilities of chatbots today, most of which are just dumb trees.

On the other hand, none of the voice assistants to date sounds close to replicating the natural way a human speaks. These voice assistants may have more human timbre, but the stiff elocution, the mispronunciations, the frequent mistakes in comprehension, all quickly inform the user that what they are dealing with is something of quite limited intelligence. The affordances draw palpable, if invisible, boundaries in the user's mind, and they quickly realize the low ROI on trying anything other than what is likely to be in the hard-coded response tree. In fact, I'd argue that the small jokes that these UI's insert, like answering random questions like "what is the meaning of life?" may actually set these assistants up to disappoint people even more by encouraging more such questions the assistant isn't ready to answer (I found it amusing when Alexa answered my question, "Is Jon Snow dead?" two seasons ago, but then was disappointed when it still had the same abandoned answer a season later, after the question had already been answered by the program months ago).

The same invisible boundaries work immediately when speaking to one of those automated voice customer service menus. You immediately know to speak to these as if you're addressing an idiot who is also hard of hearing, and the goal is to complete the interaction as quickly as possible, or to divert to a human customer service rep at the earliest possible moment.

[I read on Twitter that one shortcut to get to a human when speaking to an automated voice response system is to curse, that the use of profanity is often a built-in trigger to turn you over to an operator. This is both an amusing and clever design but also feels like some odd admission of guilt on the part of the system designer.]

It is not easy, given the simplicity of textual UIs, to lower the user's expectations. However, given where the technology is for now, it may be necessary to erect such guardrails. Perhaps the font for the assistant should be some fixed-width typeface, to distinguish it from a human. Maybe some mechanical sound effects could convey the robotic nature of the machine writing the words, and perhaps the syntax should be less human in some ways, to lower expectations.

One of the huge problems with voice assistants, after all, is that the failures, when they occur, feel catastrophic from the user perspective. I may try a search on Google that doesn't return the results I want, but at least something comes back, and I'm usually sympathetic to the idea that what I want may not exist in an easily queryable form on the internet. However, though voice assistant errors occur much less frequently than before, when they do, it feels as if you're speaking to a careless design, and I mean careless in all sense of the word, from poorly crafted (why didn't the developer account for this obvious query) and uncaring (as in emotionally cold).  

Couples go to counseling over feeling as if they aren't being heard by each other. Some technology can get away with promising more than they can deliver, but when it comes to tech that is built around conversation, with all the expectations that very human mode of communication has accrued over the years, it's a dangerous game. In a map of the human brain, the neighborhoods of "you don't understand" and "you don't care" share a few exit ramps.

Information previews in modern UI's

[I don't know if Facebook invented this (and if they didn't, I'm sure one of my readers will alert me to who did), but it's certainly the service which has used it to greatest effect which I suppose is the case for anything they put to use given their scale.]

One problem with embedded videos as opposed to text online has always been the high cost of sampling the video. Especially for interviews, I'd almost always rather just have the transcript than be forced to wade through an entire video. Scanning text is more efficient than scanning online video.

Facebook has, for some time now, autoplayed videos in the News Feed with the audio on mute. Not only does it catch your eye, it automatically gives you a motion preview of the video itself (without annoying you with the audio), thus lowering the sampling cost. To play the video, you click on it and it activates the audio. I'm sure the rollout of this UI change increased video clicks in the News Feed quite a bit. Very clever. I've already seen this in many mobile apps and expect it to become a standard for video online.

[It's trickier when videos include pre-roll ads; it's not a great user experience to be enticed to watch a video by an autoplayed clip, then to be dropped into an ad as soon as you act on your interest.]

Someday, the autoplayed samples could be even smarter; perhaps the video uploader could define in and out points for a specific sample, or perhaps the algorithm which selects the sample could be smarter about the best moment to select.

It's not just video where sampling costs should be minimized. Twitter shows a title, image, and excerpts for some links in its Timelines, helping you to preview what you might get for clicking on the link. They show these for some but not all links. I suspect they'd increase clickthroughs on those links quite a bit if they were more consistent in displaying those preview Twitter cards.

Business Insider and Buzzfeed linkbait-style headlines are a text analogue, albeit one with a poor reputation among some. Given the high and increasing competition for user attention at every waking moment, it's not clear that services can leave any such tactical stones unturned.

Fitt's Law, the Tesla Model S, and touchscreen car interfaces

Fitts’s Law can accurately predict the time it will take a person to move their pointer, be it a glowing arrow on a screen or a finger tip attached to their hand, from its current position to the target they have chosen to hit.

Much more about Fitt's Law here from Tog. This bit was instructive:

Paul Fitts was not a computer guy.  He was working on military cockpit design when he discovered his famous Law. Paul Fitts never had to deal with the the issue of stability because stuff inside aircraft cockpits is inherently stable. The few things that do move only do so because the pilot moved them, as when he or she pushes a control stick to the side or advances the throttle forward.  The rest of the targets the pilot must acquire—the pressure adjustment on the altitude indicator,  the Gatling gun arm switch, the frequency dial on the radio, and fuel pump kill switch—stay exactly where they were originally installed. Everything in that cockpit is a predictable target, either always in the same place or, in the case of things like the throttle, within a fixed area and exactly where you left it. Once you become familiar with the cockpit and settle into flying the same plane hour after hour after hour, you hardly look at your intended targets at all. Your motor memory carries your hand right to the target, with touch zeroing you in.

I had heard of Fitt's Law but didn't know its history, and it came to mind as I was driving my Tesla Model S recently.

In almost every respect, I really love the car. I took ownership of my Model S in December 2012 after having put down a deposit over 3.5 years earlier, and I long ago stopped thinking of it as anything other than a car, perhaps the most critical leading indicator as to whether it can cross the chasm as a technology. I've forgotten what it's like to stop and pump gas (what are gas prices these days anyway?), it's roomy enough I can throw my snowboard, road bike, and other things in the back with room to spare, and I still haven't tired of occasionally flooring it and getting compressed back into my seat like I'm being propelled by a giant rubber band that has been released after being stretched to its limit. Most of all, it's still a thrill when I fire the car up to find a new software update ready to install, almost as if the Model S were a driveable iPad.

It's the ability to update the interface via software that gives me hope that a few things in the interface might be adjusted.* In a Model S, most of the controls are accessible via a giant touch screen in the center of the console. There aren't many buttons or switches except on the steering wheel which you can use to handle some of the more common actions, like changing the thermostat, adjusting volume on the sound system, skipping ahead on a musical track, and making phone calls.

When the car first came out, one of the early complaints was the lack of physical controls. I was concerned as well. Physical controls are useful because, without looking at the road, I can run my fingers across a bunch of controls to locate the one I want without activating the wrong ones by mistake as I search. With a touch screen, there is no physical contour differentiating controls, you have to actually look at the screen to hit the appropriate control, taking your eyes off of the road.

[I also confess to some nostalgia for physical controls for their aesthetics: controls in a car give physical manifestation to the functionality of a car. The more controls a car has, the more it appeals to geeks who love functionality, and physical controls also give car designers the opportunity for showing off their skills. I find many old school car dashboards quite sexy with all their knobs and switches and levers. Touch screens tend to hide all of that which has more of a minimalist appeal that may be more modern.]

In practice, I have not missed them as much as I thought I would because a lot can be operated by physical controls on the steering wheel.

However, one task that a touch screen makes difficult, in practice, is hitting a button while in a car that's in motion. It turns out that road vibration makes it very hard to keep your arm and hand steady and to hit touch targets on a touchscreen with precision. That's why I rely so much on my steering wheel controls in the Model S to do things like adjust the volume or change the temperature. Not only are the controls accessible without having to move my hands or look at the touchscreen, but the steering wheel acts as an anchor for my hand, taking road vibration out of the equation.

Maybe there is an analogue to Fitt's Law for touch screens interfaces in cars or other places where your body is being jostled or in motion. What you'd like is maximum forgiveness in the UI in such cases because it's hard to accurately hit a specific spot on the screen.

Matthaeus Krenn recently published a proposal for touch screen car interfaces that takes this idea to the logical extreme. You can read about it and watch a video demo as well. Essentially Krenn transforms the entire touchscreen in the Tesla into one single control with maximum forgiveness for your fingers to be jostled horizontally since only the vertical movement of your hand matters. By using the entire screen and spreading the input across a larger vertical distance, you can have a much larger margin of error to get the desired change. Krenn also tracks the number of fingers on the screen to allow access to different settings.

This is an interesting proposal, but for some of the most accessed functions of the car, controls on the steering wheel are still superior. The left scrollwheel on the Model S steering wheel is more convenient for changing the volume of the stereo and toggling between play and stop (you can press the scrollwheel) than the touchscreen. The right scrollwheel is more convenient for changing the car temperature and turning the climate control on and off than the touchscreen. Both scrollwheels allow you to keep both hands on the steering wheel rather than having to take the right hand off to access the touchscreen.

Actually, the ideal solution to almost all of these problems is a combination of the steering wheel controls and another interface that already exists in the car: voice. The ideal car interface from a safety standpoint would allow you to keep your eyes on the road and both hands on the steering wheel at all times. The scrollwheel and steering wheel buttons and voice commands satisfy both conditions.

In the Model S, to issue a voice command, you press and hold the upper right button on the steering wheel and issue your voice command, after which there is a delay while the car processes your command.

Unfortunately, for now, the number of voice commands available in the Tesla Model S are quite limited:

  • Navigation — you can say "navigate to" or "drive to" or "where is" followed by an address or destination
  • Audio — you can say "play" or "listen to" and then say an artist name or song title and artist name and it will try to set up the right playlist or find the specific track using Slacker Radio (one of the bundled audio services for Model S's sold in the U.S.)
  • Phone — if you connect a phone via Bluetooth, you can say "call" or "dial" followed by the name of a contact in your phone contact book

I'm not sure why the command list is so limited. When I first got the car I tried saying things like "Open the sunroof" or "Turn on the air conditioning" to no avail.

Perhaps the hardware/software for voice processing in the car aren't powerful enough to handle more sophisticated commands? Perhaps, though it seems like voice commands are sent to the cloud for processing which should enable more sophisticated voice processing when you have cellular connectivity. Or perhaps the car can offload voice processing to select cell phones with more onboard computing power.

In time, I hope more and more controls are accessible by voice. I'd love to have voice controls passed through to my phone via Bluetooth, too. For example, I'd love to ask my phone to play my voicemails through the car's audio system, or read my latest text message. For safety reasons, it's better not to fiddle with any controls while driving, analog or touchscreen-based.

Perhaps this is a problem with a closing window given the possibility of self-driving cars in the future, but that is still a technology whose arrival date is uncertain. In the meantime, with more and more companies like Apple and Google moving into the car operating system space, I hope voice controls are given greater emphasis as a primary mode of interaction between driver and car.

* One other thing I'd love to see in a refresh of the software would be less 3D in the digital instrument cluster above the steering wheel. I have some usability concerns with the currently vogue flat interfaces in mobile phone UI's, but the digital instrument cluster in a car is not meant to be touched, and the strange lighting reflection and shadow effects used there in the Tesla feel oddly old-fashioned. It's one interface where flat design seems more befitting such a modern marvel.

I Want Sandy

Does anyone remember I Want Sandy? It was one of the first virtual assistants out there, launched in 2007, I believe, but it shut down just a short while later after its creator moved on to another job.

I really loved I Want Sandy, and no other virtual assistants that have popped up since have captivated me the same way. Watching the Spike Jonze movie Her, I was reminded of why I was so taken by I Want Sandy: it was the method of interaction.

You used the service by sending Sandy emails in human readable language: “Sandy, remind me to pick up the dry cleaning at 6:50pm tonight.” Sandy would respond with a confirmation, and if I remember correctly you could have Sandy set up to either email you a reminder, text you, or both.

It was essentially a command line interface, and yet the fact that you had to write an email to use it was subtly and critically different. Though I knew it was just software on the other end, interacting with it in a manner typically reserved for interacting with other humans created a powerful illusion of intimacy and humanity. Email isn't even the most efficient way of interacting with software, you have to wait for a reply email confirmation, and sometimes if Sandy didn't understand my command she'd reply asking me to clarify and I'd have to change my command and resend it. With an AI built right into one's calendar, you could fix such an issue immediately, and yet my brain converted the muscle memory of writing emails into a sensation of conversing with another person.

The most efficient way to do something is not always the most human. My first short in film school had to be shot and edited on film. It was a giant pain in the ass to use the giant and ancient K-E-M flatbed editing machines to edit our film. I spent several all-nighters one of the UCLA's editing rooms splicing and taping together strips of my 16mm black and white film, running the new cuts forward and backwards.

Our next short we edited on Final Cut Pro on the computer, and it was a magnitude of order faster and easier. However, editing a digitized abstraction of the actual film itself put a mental barrier between me and my movie that removed some of the intimacy from the process. I felt more detached from my movie than I had when I had been manipulating it with my own fingertips. It was so fun, using a scroll wheel to ramp up the speed at which my film played forward or backward, even if the machines often broke down.

The shift from mouse and keyboard interfaces to touchscreen interfaces is another example of a method of interaction that feels more human, and if voice interaction ever gets to the point where we're speaking to our computers like Joaquin Phoenix speaks to his operating system in Her, that will be an even larger leap towards more human (humane?) interfaces and interactions.

Apple and Google are taking steps in that direction with Siri and Google Now, but I'd love a few more human touches. I think they'd make users much more tolerant of the current defects of those systems. Creating an illusion of personality is difficult, of course, but tiny flourishes go a long way. One time I recall sending IWantSandy an email at 3 in the morning asking “her” to remind me of something the next morning. Her reply began with “Wow! You're up late! Get some sleep soon” or something like that. Simple to code, powerfully effective.

When I get reminders on iOS or Android of meetings, they always come in cold and flat. Instead of just “1 on 1 with Joe at 1:00pm” popping up with a chirp on my phone, what if it were phrased “Eugene, don't forget you have a 1 on 1 with Joe in 10 minutes!” What if it came in via a text message, as if a person were texting me? What if there were a smiley emoji at the end of the text?

I know some folks would hate that type of false anthropomorphism, but perhaps you could choose whether to turn it on or off.

I still miss Sandy. For a short period it seemed as if some folks might resurrect her, but nothing came of it. I keep thinking one of these days I'll find a note from her in my inbox.