The Building Block of Large Language Models (LLMs): Tokens Explained

Apr 25, 2024

Did you ever read about LLM Tokens and wondered what they are and how they are useful in the world of LLMs? Well, this article is for you to learn basics of Tokens, why they exist and how they differ from LLM Parameters, discussed in our previous article.

LLM Tokens are the basic units that these LLM models use to represent words, phrases, or even individual characters. Think of them as the building blocks of language that AI uses to learn and get trained. For example, in the sentence "The quick brown fox jumps over the lazy dog," each word would be considered a token. However, some models may break down words into smaller sub word units or even individual characters, depending on their tokenization strategy.

LLM Parameters, on the other hand, are like the "brain" of the AI model. They are the adjustable settings that the model learns from vast amounts of text data during its training process. These parameters capture the patterns and relationships between words and phrases, allowing the model to generate coherent and contextually relevant text.

So, how do tokens and parameters work together? When you provide an input prompt to the AI model, it breaks down the text into a sequence of tokens. Then, using its learned parameters, the model predicts the next token in the sequence. This process continues iteratively, with the model generating one token at a time until it has produced the desired output.

Tokens are essential because they provide a structured way for LLM models to understand and process language that's natural to humans. By breaking down text into smaller units, these models can capture the nuances and complexities of language, enabling them to generate human-like text. The LLM model job is not only to predict the next available token, but also to create embedding closer to the words. When embeddings are created, the predictability of words improved.

There are different token techniques used to train a LLM model using token. Here's how tokens are used in the training process:

1. Tokenization: The first step is to tokenize the training data, which involves breaking down the text into smaller units called tokens. This is typically done using techniques like byte-pair encoding (BPE), Word Piece, or unigram tokenization1. The tokenization process creates a vocabulary of unique tokens that the model can understand.

2. Token Embeddings: Each token in the vocabulary is assigned a unique numerical representation called an embedding vector2. These embeddings are learned during the training process and capture the semantic and contextual relationships between tokens.

3. Input Representation: The tokenized text from the training data is converted into sequences of token embeddings, which serve as the input to the LLM3. These sequences are fed into the model during training.

4. Language Modeling Objective: LLMs are trained on a language modeling objective, which involves predicting the next token in a sequence given the previous tokens. The model learns to assign probabilities to the next token based on the input sequence.

5. Token Prediction: During training, the LLM takes the input sequence of token embeddings and predicts the probability distribution over the entire vocabulary for the next token. The model's parameters (weights and biases) are adjusted to minimize the difference between the predicted distribution and the actual next token in the training data.

6. Backpropagation: The errors from the token prediction are propagated back through the model using backpropagation, and the model's parameters are updated to improve its ability to predict the next token accurately.

7. Iterative Training: This process of token prediction and parameter update is repeated iteratively over the entire training dataset, allowing the model to learn patterns and relationships between tokens4.

By training on large amounts of text data, LLMs learn to capture the statistical properties of language, enabling them to generate contextually relevant text when prompted with new input sequences during inference.

Several prominent LLMs rely on tokens for text generation, including:

1. GPT (Generative Pre-trained Transformer) models like GPT-3 by OpenAI are indeed token-based and widely known for their text generation capabilities5.

2. BERT (Bidirectional Encoder Representations from Transformers) developed by Google is a transformer-based language model that uses tokenization for input representation6.

3. XLNet, developed jointly by Google and Carnegie Mellon University, is another transformer-based language model that operates on tokenized input sequences7.

4. RoBERTa (Robustly Optimized BERT Pretraining Approach), created by Facebook AI Research, is a variant of BERT that also relies on tokenized input8.

5. T5 (Text-to-Text Transfer Transformer), developed by Google, is a unified transformer model that can handle various text-based tasks by operating on tokenized input and output sequences9.

Here are some examples of how tokens might be represented in different LLMs:

1. Word-level tokens: "The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"

2. Subword-level tokens: "The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"

3. Character-level tokens: "T", "h", "e", " ", "q", "u", "i", "c", "k", " ", "b", "r", "o", "w", "n", " ", "f", "o", "x", " ", "j", "u", "m", "p", "s", " ", "o", "v", "e", "r", " ", "t", "h", "e", " ", "l", "a", "z", "y", " ", "d", "o", "g"

It's important to note that while tokens are the fundamental units used for training, LLMs can also learn higher-level linguistic concepts and relationships through the learned token embeddings and the model's architecture.

So, the next time you interact with a Large Language Model like ChatGPT, remember that it's the combination of tokens and parameters that makes it possible for the model to understand and generate human-like text. These building blocks are the foundation of these powerful language models, enabling them to assist us in various tasks and applications.

https://datascience.stackexchange.com/questions/123325/how-do-we-adapt-llm-token-embeddings-with-custom-vocab

https://www.linkedin.com/pulse/demystifying-tokens-llms-understanding-building-blocks-lukas-selin

https://huggingface.co/docs/transformers/en/llm_tutorial

https://news.ycombinator.com/item?id=35241584

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10838945/

https://www.cell.com/iscience/pdf/S2589-0042%2824%2900003-8.pdf

https://www.cpomagazine.com/cyber-security/the-death-of-authenticity-generative-ai-and-large-language-models/

https://originality.ai/blog/how-to-detect-ai-generated-articles

https://www.nature.com/articles/s41598-024-53303-w

Bhavana’s Substack

Discussion about this post