How Do LLMs Work?

Introduction

Large Language Models (LLMs) like ChatGPT generate text by predicting the most likely next token based on the input they receive. While they can produce remarkably human-like text, they do not think or understand meaning in the way humans do. Instead, they rely on statistical probabilities derived from massive datasets to generate coherent responses. This article breaks down the fundamental concepts behind LLMs, including tokenization, training, neural networks, and their limitations.

How LLMs Predict the Next Token

At their core, LLMs function by continuously predicting what token should come next in a given sequence of text. Each time they receive input, they generate a probability distribution over possible next tokens and select one based on these probabilities.

Tokens: The Building Blocks of LLMs

A token is the smallest unit of text an LLM processes. Unlike words, tokens can represent complete words, partial words, punctuation, or even spaces. The model’s vocabulary consists of all possible tokens it can recognize, typically created using a method like Byte Pair Encoding (BPE) to ensure efficient text representation.

For example, the open-source GPT-2 model has a vocabulary of 50,257 tokens.

Here is an example of GPT-2 token encoding and decoding in python. Note that the tokens are not mapped to words, ‘21831’ including a space ‘ fox’ and ‘Payment’ being divided into 2 tokens. 

Next-Token Prediction in Action

Given an input sentence, the LLM ranks possible next tokens with probabilities. For instance, if the model receives the phrase “The best thing about AI is its ability to”, it might predict the next token as follows:

Since LLMs generate text one token at a time, they loop through this process repeatedly until sufficient text has been generated.

Pseudocode Example

The selection process incorporates randomness to avoid repetitive or unnatural outputs. Hyperparameter variables like temperature, top_p, and top_k control this randomness:

  • Temperature: Controls how random or predictable the model's output. Lower temperatures make the model more predictable, resulting in more focused and consistent outputs. Higher temperatures make the model more creative, resulting in more diverse and varied outputs.

  • Top-p (nucleus sampling): Limits selection to a subset of the most probable tokens.

  • Top-k: Restricts selection to the top-k most probable tokens.

How LLMs Are Trained

Large language models gain knowledge and understanding from vast and diverse datasets containing massive amounts of text. They do this through a process that involves predicting and generating missing words within sentences. This is akin to how humans learn language by understanding context and anticipating what comes next in a conversation or text. By training on these extensive datasets, the models learn patterns, grammar, and even facts about the world. 


Let’s consider a simplified example with a tiny vocabulary: I, You, love, AI, AppSec

Training Data

Train a model with three example sentences:

  • I love AI

  • I love AppSec

  • You love AI

Building a Probability Table

Counting which tokens in the columns follow what tokens in the row:

Calculating the probabilities of each token following each other: 

*Note that for the tokens ‘AI’ and ‘AppSec’, there was a hole in the data. To compensate, a probability was added splitting evenly across the other four possible tokens. This prevents the model from getting stuck, but it could generate strange results. Holes in training data is one of the reasons LLMs can sometimes hallucinate, which happens when the generated text reads well, but contains factual errors or inconsistencies.

LLMs generalize this data using deep neural networks instead of manually counting probabilities.

Neural Networks and Transformers

The Role of Neural Networks

A neural network processes input tokens, applies complex mathematical transformations, and outputs a probability distribution for the next token.

The Transformer Architecture

Most modern LLMs, including ChatGPT, use a Transformer architecture. This design enables efficient processing of long-range text dependencies using mechanisms like:

  • Self-Attention: Allows the model to focus on the most relevant parts of the input.

  • Positional Encoding: Ensures the model understands the order of words.

Context Windows: How Much Text Can LLMs Remember?

An LLM’s context window determines how many tokens it considers before predicting the next one. It determines how long of a conversation it can carry out without forgetting details from earlier in the exchange. It also determines the maximum size of documents or code samples that it can process at once.

For reference:

  • GPT-2: 1,024 tokens

  • GPT-4: 128,000 tokens  (about the length of Harry Potter and the Sorcerer's Stone)

  • Claude 3: 200,000 tokens

  • Gemini 1.5: 1,000,000 tokens (would fit the Lord of the Rings trilogy, and the Hobbit)

A larger context window allows for more coherent and context-aware responses. Increasing an LLM’s context window size translates to increased accuracy, fewer hallucinations, more coherent model responses, longer conversations and an improved ability to analyze longer sequences of data.

Limitations and Hallucinations

Despite their capabilities, LLMs have limitations:

  • Hallucinations: They may generate false but plausible-sounding information.

  • Lack of Reasoning: They don’t truly understand meaning but recognize patterns.

  • Training Biases: LLMs reflect biases in their training data.

Because of these limitations, LLM outputs should always be reviewed before being used in critical applications.

Conclusion

LLMs are powerful tools that generate text by predicting tokens based on statistical probabilities. While they excel at pattern recognition, they don’t truly understand or reason like humans. By understanding how they work—especially tokenization, training, and neural networks—you can better appreciate their strengths and limitations.

Further Reading

Next
Next

Running a Local LLM with LM Studio and Connecting via Chatbox on Mobile