9 September 2024

Technology, innovation, and thinking ahead

Can we trust our Copilot summary? Part one - the issues

As I start to write this article, whilst according to our friend Ian Betteridge the answer to the question in the headline is No.

But that's not going to be the whole story; this article, as Part One, will give what hopefully is an easy-to-understand explanation of how the Large Language Model of Copilot actually works, which will then naturally conclude that on account of how it works then the answer to the question of whether you can trust the summary has to be No.

I don't want to just make a confident assertion about something which is becoming a shibboleth in our industry without supporting evidence, so Part Two will look at Copilot's summary of this article and see how it stacks up.

What is a Large Language Model? How does it work?

One description which has been coined to describe LLMs is Stochastic Parrots. No, I don't really get what that actually means, either.

The description I've coined for what they are is, essentially, they are Probability Engines for Words.

At the most basic level, the predictive word finder on your phone is a probability engine - you poke out a word, it makes a guess of the next three words which are most likely to follow that word. Another example of a probability engine is a Markov Chain. A Markov Chain takes a body of work - in this case, text - as an input, it looks at each individual component, or word, in the corpus, and for each instance of that word it looks at what words follow it and thus the probability that any given word follows any other given word.

The cat sat on the mat

There are five unique words in that sentence. The word 'the' is followed by the words 'cat' and 'mat', so the probability of either the word 'cat' or 'mat' following the word 'the' is 1/2. The other words only exist once in the sentence, so the probability of the word 'sat' following the word 'cat' is 1, etc. So were you to ask your Markov Chain generator to create some new text based on the corpus of the sentence, it will either generate for you the sentence

The cat sat on the mat

or the sentence

The mat sat on the cat

because, probabalistically speaking, both those sentences are equally likely.

If we extend the sentence to read

The cat sat on the mat by the door

We change the probabilities of what words might follow the word 'the' to be 1/3 for each of the words 'cat', 'mat' and 'door', so your Markov Chain generator might generate the sentence

The door sat on the cat by the mat

The Markov Chain has no idea what these words represent or mean or whatever, it literally sees each word in one single dimension - it's a word - and the probability any other word might follow that word. so a Markov Chain generator is as likely to generate complete gibberish, or generate text which kind of looks meaningful at face value but is absurd - neither mats not doors ever sit on cats, despite what the frog might have you believe. You can control just how gibberishy, as distinct from how likely, the generated text is by the Order - an Order 1 Markov Chain will be as random as it gets, whilst an Order 10 chain will be almost identical to the text you gave it as a corpus.

But Copilot / ChatGPT / whatever doesn't generate gibberish

So the question which I'm putting in your mouth there is 'so how does that work?'. Let me give you the answer - it works by assigning some degree of explanation to what each individual word represents by having been 'trained', and then employing a Neural Network. What's a Neural Network and how does it work? Go and read the Wikipedia article linked to there. Oh, you don't understand a word of that Wikipedia article? Congratulations, you are just like the overwhelming majority of members of the human race, including, in fact, most of the people responsible for creating and maintaining them, in not having the faintest idea how they actually work. Srlsly, even for the people who understand the maths and the principles, very few of those people actually truly understand how a neural network actually works, they just accept that it kind of does and prefer not to be asked too many detailed questions.

But you didn't come here to be fobbed off just like that, so let me try to summarise my understanding of somebody else's summary of somebody else's understanding of somebody else's summary of somebody else's understanding of it.

The Multidimensional Matrix

The Markov Chain just treats the words in the corpus in one dimension - the simple fact of them being, in fact, words.

But language is multidimensional; if you think back to English in school it's comprised of what your English teacher called 'parts of speech'. 'The' is an article - the definite article, so to speak, 'cat', 'mat', and 'door' are nouns, 'sat' is a verb, and 'on' is a preposition - the human beings training the LLM will have assigned each word in the corpus that first dimension, of parts of speech.

However, there are other dimensions to language - a 'cat' is an 'animal', just like a 'dog' or a 'frog' is also an 'animal'. A 'door' is a fixture in a 'building', just like a 'window' is. A 'wall' is part of the structure of a 'building', just as a 'roof' or a 'floor' is, and inside a building you might find a 'sofa' or a 'table' or a 'chair', which are items of 'furniture', and on the floor you might find a 'carpet' or 'tiles' or a 'mat' which are 'floorcoverings'.

Nouns are often paired with adjectives, so you might have a 'black' cat, a black mat, or a black door. But whilst you'll often find a 'green' door or a green mat, unless the CRISPR People go rogue you're unlikely to ever find a green cat, even though you'll often find a green frog; you'll often see a 'fluffy' cat and a fluffy mat, but you'll rarely see a fluffy frog or door. At least you're unlikely to see a fluffy door outside certain specialist private entertainment establishments.

Thus, the people who have the job of training a LLM have the tedious and meticulous job of logging every word in the corpus and assigning according to the multidimensional matrix of semantic and syntactic meanings what all those words represent within the sentence. There's a kind of certain irony that the process of training an AI is quite, erm...

...robotic!

So with this extra information available to it through the training it has, the Probability Engine 'knows' that a door is a fixed part of a building, a mat will be found on the floor, and a cat is an animal that often sits on things, often on the floor - so on that basis, based on all the additional probabilities of certain words which have certain functions within our language following certain other words with certain linguistic functions, ChatGPT is highly likely to generate the sentence

The cat sat on the mat by the door

and is highly unlikely to generate the sentence

The mat sat on the door by the cat

All in all, each word in the English language ends up being categorised in about 12,000 dimensions.

Hallucinations

Now whilst we know one is unlikely to ever encounter a slimy green cat or indeed to encounter a fluffy tabby frog, depending on how deep the training of the LLM of the dimensions of the corpus was (ie, how much money the trainer wanted to spend on how many humans spending how much time tediously clicking buttons for ever single word), the neural network may have been told the extra specific information that you get green frogs and doors, you get black cats and mats and frogs and doors, and that you get fluffy cats but you don't get fluffy frogs. Hopefully it's been told that cats sit on mats and frogs sit on logs, but similarly it may or may not have been told that dogs don't sit on frogs.

So depending on how meticulously trained the AI was with the corpus of text, it's still entirely possible for the generator to 'hallucinate' and generate some text which describes frogs as fluffy little things being sat on by dogs being laughed at by a green cat. Indeed, no matter how meticulously trained the model is, a general language model is probably more likely to have been trained to 'think' a dog might sit on a frog, because a lot of the text in the corpus will be works of fiction in which literally anything can happen, and it's unlikely that the training would have covered the unlikelihood of dogs sitting on frogs. Or you can deliberately encourage the LLM to hallucinate by asking it to write you a four paragraph essay about the influence of Led Zeppelin on the development of modern contemporary sculpture.

The AI has been told that certain words have certain linguistic functions, it's been told that certain types of words can be connected to certain other types of words, but it still no more understands what a cat or a dog or a frog actually are than the Markov Chain does; it's been told what an essay is, it's been told how many words are in a typical paragraph, it's been told that Led Zeppelin were an influential rock band, but ask it to generate an essay about Led Zeppelin and sculpture and it'll do just that on the probabilistic basis that the Markov chain generates its text - all that has changed is the probabilities of words following other words are much more refined. But when your Large Language Model is generating text, essentially it's still doing it the same way a Markov Chain generates text - a word at a time, based on the probability that certain words follow other words in the corpus of text it has been trained on.

What is a summary, then?

I'm very glad you asked me that question. A summary is a shortened version of a longer piece of text.

But it's something more than just a shortened version of long text - a proper summary, the person doing the summarising know which parts of the longer piece of text are important, and which parts are descriptive fluff. Because they understand the text they're summarising they know what might be easy to understand so won't need to reword it, and what will be harder to understand so a rewording of it will be necessary. And they'll read through it in the round, hopefully several times, to give it a final sense check to establish if they've covered the whole piece adequately.

Copilot does not do this, however; Copilot does not understand the text, Copilot has no sense of what might be important and what might be an amusing cultural allusion written for the purpose of raising a smile amongst knowing readers, such as my references here to frogs sitting on logs. When you ask Copilot to summarise a long article it's basically going to do what it does when you ask it to write you a four paragraph essay - it's just going to look at the prompt and the probabilities of each word following each other word in the prompt, and continue word by word in that same probabilistic manner. If you ask for a four paragraph summary, it'll generate four paragraphs of words based on the probabilities of all the words following all the other words, and it'll do it word by word, and it won't go back and sense-check what it's done because it has no sense.

Does it really matter?

This is a reasonable question, which can in part be answered by another question - does the original text it's being asked to summarise actually matter? If all you're doing is asking Copilot to summarise what was discussed in a meeting based on the transcript of what was said in the meeting, then if it's just to give a flavour of what was said for the benefit of people who aren't going to bother reading the full (corrected, if anybody can be bothered correcting it) transcript of the meeting, and they have nothing more than a passing interest in what was discussed, then it probably doesn't matter.

And indeed, getting Copilot to generate a first pass summary of the text can do a lot of the heavy lifting leaving the person who's responsible for it more time to be able to make sure the summary is fair and fill in the gaps; many years ago I worked with somebody whose mother was a professional translator, and I once asked if she felt threatened by machine translations, and the reply was not a bit of it, many professional translators used machines to do the first pass to allow them to then concentrate on getting the nuance right. If you as the person doing the summary have read through the original text and noted the important facts, and you've then gone and edited Copilot's summary to ensure no important facts are left out and no trivialities are included, then that's a good use of your time in not having to do the actual writing of the bulk of the summary.

But if the longer piece of text is actually important - if there are important decisions which were recorded in the original, or if important decisions are going to be made on the basis of the summary then indeed be very careful about simply accepting the summary that the Probability Engine squirts out at you. In all probability, serious mistakes could be made from doing so.

So, is my assertion supported by the evidence? Let's find out!