One of the things anyone who has worked on textual conversation interfaces, like chatbots, will tell you is that the challenge is dealing with the long tail of crazy things people will type. People love to abuse chatbots. Something about text-based conversation UI's invites Turing tests. Every game player remembers the moment they first abandoned their assigned mission in Grand Theft Auto to start driving around the city crashing into cars, running over pedestrians, just to exercise their freedom and explore just what happens when they escape the plot mechanic tree.
However, this type of user roaming or trolling happens much less with voice interfaces. Sure, the first time a user tries Siri or Alexa or whatever Google's voice assistant is called (it really needs a name, IMO, to avoid inheriting everything the word "Google" stands for), they may ask something ridiculous or snarky. However, that type of rogue input tends to trail off quickly, whereas it doesn't in textual conversation UI's.
I suspect some form of the uncanny valley and blame the affordances of text interfaces. Most text conversation UI's are visually indistinguishable from those of a messaging UI used to communicate primarily with other human beings. Thus it invites the user to probe its intelligence boundaries. Unfortunately, the seamless polish of the UI isn't matched by the capabilities of chatbots today, most of which are just dumb trees.
On the other hand, none of the voice assistants to date sounds close to replicating the natural way a human speaks. These voice assistants may have more human timbre, but the stiff elocution, the mispronunciations, the frequent mistakes in comprehension, all quickly inform the user that what they are dealing with is something of quite limited intelligence. The affordances draw palpable, if invisible, boundaries in the user's mind, and they quickly realize the low ROI on trying anything other than what is likely to be in the hard-coded response tree. In fact, I'd argue that the small jokes that these UI's insert, like answering random questions like "what is the meaning of life?" may actually set these assistants up to disappoint people even more by encouraging more such questions the assistant isn't ready to answer (I found it amusing when Alexa answered my question, "Is Jon Snow dead?" two seasons ago, but then was disappointed when it still had the same abandoned answer a season later, after the question had already been answered by the program months ago).
The same invisible boundaries work immediately when speaking to one of those automated voice customer service menus. You immediately know to speak to these as if you're addressing an idiot who is also hard of hearing, and the goal is to complete the interaction as quickly as possible, or to divert to a human customer service rep at the earliest possible moment.
[I read on Twitter that one shortcut to get to a human when speaking to an automated voice response system is to curse, that the use of profanity is often a built-in trigger to turn you over to an operator. This is both an amusing and clever design but also feels like some odd admission of guilt on the part of the system designer.]
It is not easy, given the simplicity of textual UIs, to lower the user's expectations. However, given where the technology is for now, it may be necessary to erect such guardrails. Perhaps the font for the assistant should be some fixed-width typeface, to distinguish it from a human. Maybe some mechanical sound effects could convey the robotic nature of the machine writing the words, and perhaps the syntax should be less human in some ways, to lower expectations.
One of the huge problems with voice assistants, after all, is that the failures, when they occur, feel catastrophic from the user perspective. I may try a search on Google that doesn't return the results I want, but at least something comes back, and I'm usually sympathetic to the idea that what I want may not exist in an easily queryable form on the internet. However, though voice assistant errors occur much less frequently than before, when they do, it feels as if you're speaking to a careless design, and I mean careless in all sense of the word, from poorly crafted (why didn't the developer account for this obvious query) and uncaring (as in emotionally cold).
Couples go to counseling over feeling as if they aren't being heard by each other. Some technology can get away with promising more than they can deliver, but when it comes to tech that is built around conversation, with all the expectations that very human mode of communication has accrued over the years, it's a dangerous game. In a map of the human brain, the neighborhoods of "you don't understand" and "you don't care" share a few exit ramps.