Boris Johnson during Prime Minister's Questions

Can we beat OpenAI's chatbot?

Image credit: UK Parliament/Jessica Taylor/Handout via REUTERS

ChatGPT exposes our failings more than its own.

Should we make Boris Johnson take a Turing test? Now that OpenAI’s ChatGPT has surfaced, we have to question whether we can tell the difference between someone of the former prime minister’s calibre in debate and speech-making and an AI.

ChatGPT was launched amid some big claims, such as the ability to understand and create correct software code and credible reports and articles. But its owners are less certain about  just how good it is at the job – and perhaps more importantly, whether it has the ability to work out when it is doing a bad job. 

It does not take long to work out where the underlying AI, a slightly trimmed version of OpenAI’s GPT-3 in the case of ChatGPT, falls down. Arithmetic is one of the more obvious weak spots. Though large language models can cope with very simple sums – ask one for two-plus-two, and you would expect it to be right – anything that takes a bit more reasoning to calculate will generally trip them up. Much of that has to do with the way the models are structured and trained. Two-plus-two is easy because it’s a phrase that often appears in text. Bigger numbers often stump the models because they have never been trained on the intricacies of carrying digits. 

There is a way round this with a model like GPT-3, identified by researchers at MIT’s CSAIL, where the trick is to feed the model a few carefully chosen prompts before asking for the result you want. However, this smacks more of an example that DeepMind computer scientist Ian Goodfellow often cites: Clever Hans, a performing horse that supposedly had a penchant for arithmetic by stamping a hoof the right number of times in response to a question. In reality, Hans simply understood the tells provided by his owner at key points.

The lucky thing for ChatGPT is people don’t typically ask it to act as the world’s most expensive, energy-intensive calculator. Instead, it winds up being tested on tasks that humans can find tricky or are simply too lazy to research. This is where the problem of this kind of language really sits. 

The unsettling part of ChatGPT is not how good is it. It does not take long to examine a language model to find that, at its core, it has no real idea what it is doing. It can connect concepts that it has seen presented together but the link between them is tenuous. But that does make it good enough to bang on about things that will make people think it’s doing a bang-up job. Its ability at that is exposing just how much human society revolves around non-sequiturs and vaguely competent-sounding rubbish, an issue that philosopher Harry Frankfurt described in his 1986 essay and later book: On Bullshit.

It is perhaps not surprising that several years ago IBM chose public debate as the demonstration of the abilities of its Project Debater AI. Debating has a much better image than its reality, which itself reflects the persuasive power of those who tend to do well in public debates. More tends to hinge on emotive language and framing issues in a positive or negative way than actual evidence. In the event, this proved instrumental in the machine losing, but largely because the audience was only too aware that a machine was reeling off points, rather than a poor choice of tactics in itself in an environment where a fully reasoned argument is usually a hindrance.

Interviewed by newspaper The Hindu after the contest, Project Debater’s adversary Harish Natarajan explained, “One way in which I was able to defeat the machine was to think a little bit more about human emotion and the way those operated. To be precise the machine did talk about emotion. It asked what it was like to be poor and noted the terrible conditions that an individual suffered through. But this never came across to the audience as genuine for the obvious reason that these emotions were not genuine from the machine.”

The problem we have as humans is that we don’t use reasoning skills nearly as much as we would like to pretend. This argument lies at the core of Dan Sperber and Hugo Mercier’s book The Enigma of Reason, which is a popular read in some machine-learning circles, and is a little different from that made by Daniel Kahneman in his own Thinking, Fast and Slow, which also makes regular appearances in machine-learning reading lists.

The language models that power things like ChatGPT work almost on the instinctual level: connecting elements that seem connected because they appear more often than not close together in text. For some AI researchers, that correlates to Kahneman’s System 1, though some such as University of Washington professor of computer science Yejin Choi have argued that the reflex-action Perception stage, which comes before the “think fast” System 1, is a closer match for what happens in today's crop of artificial neural networks. 

System 2, or full reasoning, is not something that today’s AI models do. Getting there almost certainly needs additions to the existing architectures if not a complete change and possibly much greater progress towards what the research community calls artificial general intelligence. Our problem though when competing with these machines, as Sperber and Mercier frequently point out in their book, is that we don’t use anything like System 2 all that much in everyday circumstances even when it seems System 2 is in action. We just assume we and others do.

So, when in 2019 broadcaster Jeremy Vine approvingly described Johnson arriving at some awards bash in the days before he became prime minister and formulating a speech on the spot that largely consisted of combining out quickly learned keywords such as “securitisation” with some long-remembered crowd pleaser about the movie 'Jaws', you have to wonder whether the only thing we have over the machines is that we have an easier job of being convincingly human. But with the words spoken by an actor, a deepfake video or inscribed into a written email, that advantage subsides quickly.

For the moment there are tells, as Bosworth MP Luke Evans found in using a script in Churchillian language provided by a ChatGPT session, in the Christmas adjournment debate on Tuesday (20th December). Kevin Brennan, MP for Cardiff West, asked, “Did the honourable gentleman write that himself or was it written by artificial intelligence?”

Evans responded by talking about how it represents an incredible step forward “but with that come huge issues about autonomy, liability, fairness, safety, morality and even ownership of creativity…the problem with such intelligence is that we risk creating an echo chamber. Now when a sixteen-year-old writes a school essay on what happened with Brexit, the algorithm will drive an answer, which will be read and put into a marking algorithm”.

The one advantage we still have in avoiding that vicious circle is that it is highly likely ChatGPT will make some critical mistakes. But, given our own prejudices, opinions and frequent lack of attention to detail, will we notice? And when will we demand proof of life as a result?

Sign up to the E&T News e-mail to get great stories like this delivered to your inbox every day.

Recent articles