Transformer Models are a Clutch
How should we think about (the flavor of) AI known as transformer models or large language models (LLMs)? Should we be afraid? Should we be concerned? What kind of thing are they?
There’s a widely-read essay by Ted Chiang, one of my favorite authors, comparing these models to a “blurry JPEG of the web” — a hazy picture, a kind of imperfect memorization engine. Language models like ChatGPT spit out text that sounds authoritative, that seems all-knowing, while in reality they’re just skating over a thin ice of statistics. And often the ice is so thin we end up all wet, in a bath of hallucination.
Therefore, these models are — for anyone who cares about understanding the world — an object of suspicion. For Chiang, yes, we should be afraid and concerned, because in a world where the truth is so hard to make out, we are being further misled.
This is heartfelt; and I can relate to it. But I can’t go along with it. For one thing, Chiang seems wilfully blind to the demonstrated (and quite mysterious — and worthy of curiosity) ability of transformer models to truly manipulate ideas and generate fresh ones. For another thing, he assumes — with a pessimism so profound it verges into wishful optimism — that progress on this technology will stop, here, in hallucination land.
It won’t.
But aside from all that, Chiang’s thinking on LLMs is just . . . a bit confining. I want to offer another point of view, to enlarge a bit: these models are partly a way to know things, yes, because they retain a lot of information that they have learned. But bigger than that, transformer models will be a way to do things. Blurry JPEGs don’t do anything, they just sit there. Language models, in reality, open up a new domain for action.
So I’ll tell you one very specific thing transformer models will be used to do.
(They will do lots of other things, but I’ll describe one.)
I’ve made a lot of software. It’s so fascinating, honestly. When you make software, the game you play is this: you come with your human hands into a machine palace of crystal, a clockwork labyrinth where every gear is made out of diamond and it links with hundreds of other gears, all set in that particular arrangement by other people. It’s a social space. The palace will do exactly what you tell it to do — if you can get the gears to interlock the right way — and it will do it as much as you want.
But it’s monstrously, numbingly literal; if you put one semicolon in the wrong place, the entire thing doesn’t work. If you put an extra zero somewhere, then you’re suddenly in the position of Mickey Mouse in the Sorcerer’s Apprentice, thinking to command a simple mop and bucket to do the chores, and instead, you’re in a chaos of thousands of mops and buckets wrecking everything, and there’s a river of water running down the steps.
Even when it’s under control, it’s just different from us. That simple piece of logic that counts from 1 to 1000 — that loop, that little twist — it will count from 1 to 1000 every time you ask it to, forever, until the machine decays; it will never skip a number, it will never get tired, it will never reach 1001. As a person made of human stuff, an approximator, one of the roughs, we can’t fully grasp a thing like that loop; it’s not a natural object. It doesn’t work the way we do. It’s rigid in a way that we are not. It’s loyal, but it’s unforgiving.
In my job as a software designer and maker and thinker, a lot of what I focused on was what we call “human-computer interactions”. I was interested in (and it’s a fun privilege to work on) developing novel ways for humans and computers to relate. I worked on new types of data visualization, to help people see and work with information in ways that had not yet been tried; I worked on helping to invent large-scale touchable pixel walls, and rolled them out into public spaces in airports; I worked with hand gesture, body pose, and proximity as novel forms of computer input. I created collaboration systems to try to help people work together better, via their technology.
The unifying thread in all this is really just one question: how might people (with their organic intents and purposes) relate more usefully to and through computers (with their rigid, alien potential)? How to mediate between these two worlds? How to translate?
And in fact, most software engineers find themselves in a fairly similar position. Even if they aren’t working in human-computer interaction per se, the job is a translator job: to ferry between the world of human concerns and needs — lists of requirements, graphic designs, bug reports, whiteboard sketches — and the world of the unforgiving clockwork palace, with its spectre of a million mops and buckets going crazy.
Software makers are a kind of a clutch in the system— that sliding metal plate between two independent worlds that are rotating at different, incompatible speeds. The clutch helps to synchronize. Without the clutch, the gears just grind, and nothing useful happens, and no one goes anywhere.
This negotiation between the machine world and the human world happens at other levels, too. It happens all the time to us as end-users of software, mostly unconsciously. Think about a piece of ordinary consumer software, like Spotify. Maybe, like me, you use Spotify on your phone, but also in your car. Maybe even on your computer as well. There are numerous different versions of Spotify.
And each version of Spotify has a different feature set, which you’ve had to learn, basically by memorization. The version in the car can’t add songs to a playlist, and it can’t let you tweak the sound with an equalizer. Kind of a strange limitation, but maybe it just “makes sense to you” that it works that way?
Spotify on a phone can do both of those tasks, but it can’t do other things, like multi-select a bunch of songs and drag them into a playlist all at once. On the computer, you can do exactly that. But — on the computer, you can’t boss Spotify around with speech in some of the ways that you can in the car. And if you touch computer-Spotify with your fingers, nothing happens.
So, we have three distinct pieces of software here, all called “Spotify”, and they’re each as rigid and hierarchical as the underlying software palace they sprang from. Each one ships from the software factory exactly as it is; there is no one on earth, really, that can make them be any different than they are, except the software teams at Spotify that make them. There is no way for a user to mix and match capabilities between these three experiences of Spotify, to suit their purposes — let alone mix them with the capabilities of some other piece of software. “Hey Spotify, put those last three songs in a new playlist called “birthday,” and also put in that disco song that came up yesterday when I was in the shower, and share this playlist with Eve, but use Whatsapp to do it”. No.
This rigidity of software systems is so ordinary we may not see it any more. As users of the software, we’re accustomed to being the malleable part. People who are comfortable with technology do this bridging function, they feather this clutch, all the time. They comport themselves to fit the crystal structure that doesn’t comport to them (because it can’t). They use Spotify in the dashboard of their car and then fish around for their phone while they’re driving, so they can achieve what they want (put a song in a playlist) with the other version of Spotify.
And it’s not Spotify’s fault; the designers of Spotify have to play by the same rules as every other software designer in the palace. If they start adding a bunch of features, the product stops being “intuitive,” whatever that really means, and the product dies. Most of successful software design is actually saying no to things; because each software interaction surface will really only hold so much weight. You have to make hard choices about what to include, and not everything is going to fit.
What does this have to do with language models? Well, hopefully it’s already clear — language models aren’t like Spotify at all. They don’t come from the crystal palace, they come from . . . someplace else. They are literally — and I do mean this literally — grown. Not authored. Not machined.
And so they turn out soft, like people. They turn out to be capable of capturing — and indeed, matching — the skewed, organic nature of human language and human ways of thinking. But they also have the patience and persistence of machines; they will also tirelessly mop the floor, forever. And yet they have enough common sense to represent and hold onto the intuition that we only wanted one mop and bucket. They can hold the concept that our goal was for the floor to be cleaned — and that we didn’t want a waterfall down the stairs.
Transformer models are a new, third thing that fits into the space between human ways and software ways.
So I’m reasonably certain that they will be used in just the translator position that I’ve been describing. Here’s a recipe for how they could fit in:
Find a way to represent — to write down — the full set of capabilities of Spotify, across all its software surfaces. This can come from the pixels of the interface, accessibility layers that have already been built, from manual annotations, from existing APIs, from code comments, or from all the code.
Teach this content to a language model. (Update the model when Spotify changes.)
Build interfaces that put the model into the loop of people’s interactions with Spotify. This could happen by supplements (help me do it) and by substitutes (do it for me).
Goto 1; now do this for every app.
As the model can now begin to marshal the user’s intentions across all available apps, in the aggregate, the concept of apps will start to dissolve, and now we’re getting somewhere.
Running this play doesn’t require any further technical progress in LLMs — mostly, the change has to come in the ground rules of apps, operating systems, APIs, and business incentives. (And there are some with incentives to resist this! Microsoft tried to hold back a similar tide in the 1990s, seeing business model displacement from the rise of the web.) It will probably take a few generations of technology to work this through, maybe a decade or more.
But it will play out, because the gravitational pull will be strong. This is how people wanted software to work all along. They just didn’t know to ask for it yet.
And once it starts to work this way, it will rewrite human-computer interaction — it will change the contract between people and software. I can’t see how this doesn’t overturn the field that I’ve spent a lot of my career working in.
Playing this mediator role, sitting at the human-computer interface layer, isn’t all that language models can do; but it’s one thing that they can and will do — transformer models will translate our organic intentions into computational action.
They are a clutch.