I was talking to my phone the other day, like you do, while barreling down I-25 in Colorado. I had a practical problem to solve: determining where would be the best place to gas up, knowing that I would have to pee before the tank was empty, and wanting to make the most economical use of time and money on this leg of the road trip. I wanted to pee and gas up at the same place, buy gas at optimal price and volume, and if possible, avoid excruciating discomfort: a win-win-win. We have powerful networked computers with us at all times. Why not take advantage?
I know the road pretty well and I had reduced the information-gathering phase of the problem to a Q&A, alternating two types of questions: “How far away is X?” and “What’s the price of gas in X?” All went well for a couple of iterations until I popped the question, “How far away is Trinidad?” A heartbeat later, loud and proud, Google Assistant showed me this:
Of course what I meant was Trinidad, Colorado, not Trinidad, island nation in the Caribbean Sea. If this had been a real-life, human-to-human conversation, Google’s response would be classified as a joke, or a Gricean fail: the algorithmically determined answer to my question (despite being factual) violated the maxim of relation, in which people engaged in conversation “try to be relevant, saying things that are pertinent to the discussion.” But the thing to note here is that this was not a discussion, or a conversation. It was a human talking to a computer, and getting a lame response back from it.
When we talk to voice-activated, speech-enabled software, we’re aware that we’re not talking to a human, despite every encouragement from the developers of these programs to act as if we are. Google Assistant asks “How can I help you?” and our lifelong conditioning is to assume that the pronoun “I” refers to another consciousness, like our own — not to thousands of lines of computer code.
The purveyors of speech agents encourage this illusion further in their documentation about them. Google has programmers working on “conversational Actions”; Amazon encourages developers to work on “engaging voice experiences” that will “delight customers,” and Apple offers SiriKit to help coders take advantage of “voice and natural language recognition . . . to get information and handle user requests.” But none of these things is a substitute for human conversation. All of them are piecemeal simulations of human conversation.
Of course I could have approached this entire exercise with my phone conversationally, as if it actually channeled a human interlocutor. I could have started out with “I wonder where I should gas up?” and then introduced the other variables that would affect my decision in a logical sequence: miles left in this tank, fluctuating price of gas, Interstate exit distribution, eventual need to pee. Or I could have delivered the scenario in a context-rich question, along the lines of “Where should I gas up because I know I’ll have to pee before I run out of gas, and there aren’t that many exits down the road, and gas might get more expensive, and I want to pee and gas up at the same place and get the best deal.” If you have ever tried either of these approaches with your phone, you will know why I didn’t bother. But I thought it would be able to handle the simpler Q&A.
The critical component missing from human-software speech interactions is the ability to evaluate context. This will be the hardest nut for AI to crack, but it will also be the only action that can turn a query-response exercise with a computer or robot into what humans engage in as conversation. The layers of context that envelop human conversation are thick, deep, and probably unfathomable by software, because humans got to the game of conversation first, leveraging millennia of evolution as sentient, information-storing organisms to do it. Here are some layers of context in my exercise with the digital assistant that it would have a hard time either knowing, finding the bandwidth for, or surmising the relevance of:
- Geographical I’m moving southward on a highway at 70 mph while I ask my questions. GPS software could have made this known to the digital assistant in a way that would have weighted Trinidad, Colorado, as being more relevant to my query than the other Trinidad; but the pointer to this fact was clearly not part of the deal in my interaction.
- Physio-mechanical A human traveling at superhuman speed across terrain depends on a fuel source that requires periodic replenishing. This is obvious real-world knowledge for a human, but not likely to figure into the algorithms that inform a talking computer.
- Socio-economic As you drive south on I-25 in Colorado, there is an increasing scarcity of exits past Colorado Springs, and gas typically (but not always) becomes sharply more expensive once you are past Pueblo. Again, these are facts discoverable by software, but of unknown or possibly unknowable relevance, because the software hasn’t really grocked where I’m going and why it might matter. It’s only evaluating one question at a time.
- Human-existential We have to pee at regular intervals and we cannot ignore the need indefinitely. The older you are, the more often you have to pee. It’s very unwieldy to separate yourself from pee effectively while you’re driving a car.
- Cultural People in developed nations observe the conventions of disposing of bodily excretions with a modicum of privacy and sanitation, and we rely on the infrastructure that is predictably in place to accommodate this. Violating this cultural norm is inconvenient and usually more trouble than observing it.
All of the above are intuitively obvious to a human, to the point that they do not need to be made explicit in conversation; at the most, they can be merely alluded to as a way of making another human aware of their current relevance. We can think of some of these variables as Bayesian priors, and AI has made great strides in using priors to discern the patterns in language at the level of phoneme, morpheme, letter, word, phrase, and sentence. But at the conversational level, AI has a much bigger job to do in even getting its “mind” around what the priors are, because so many of them are drawn from the experience of being a human.
An extraordinary thing about being a human is a fact we take completely for granted: it takes one to truly know one. When we engage in a conversation with another human in a common language, though it seems effortless to us, a fairly miraculous thing, and a convergence of millennia of evolution takes place. We emit a stream air, energized and modulated by acoustic waves from our vocal cords, to produce sounds. These sounds, which we further modify by flapping bits of meat in our mouths with astonishing precision timing, have the capacity to transmit huge volumes of information, with impressive economy, to another human. But the efficiency of the communication depends largely on the fact that we draw on a vast store of common experience. We give very little thought to the pragmatics of conversation — why we’re actually engaging in it and what we hope to achieve with it — because we can largely figure all of that out intuitively. A computer on the receiving end of a human conversation can’t do that.
A conceit in artificial intelligence these days is that software might become an equal player in this game — that it might be able to parse all of the information contained in human speech, and respond to it with equal richness and complexity. It won’t, despite fantasies to the contrary presented to us in entertainment, like the robot TARS in “Interstellar”, the femme fatale robots in “Ex Machina,” or the digital Samantha in “Her”. Conversations with artificial intelligences will never be like this because these intelligences will always lack the thing that is our birthright: the skin envelope, equipped with sophisticated sensory equipment, learning ability, information storage capacity, and communicative virtuosity, in which we negotiate the world. What we have now with machines is a limp simulacrum of conversation with some bells and whistles that, when they work, seem to add a “human touch”.
Can we impart to machines the many channels of conversational context and implicature, all of which require considerable computation and bandwidth? I don’t think it’s a ANI problem (i.e., requiring narrow intelligence), it’s an AGI problem (requiring general intelligence). Adding capabilities to speech agents piecemeal, as developers are now encouraged to do, is not going to generalize what is narrow. It seems likely that machines will never be able to fully flesh out what we mean in what we say, in the way that we do intuitively, because they are not flesh.
But do keep talking to your phone or other smart device as if it might fully understand you. Those who work with Alexa, Siri, and Google Digital Assistant need all the training data that they can get.