Tokenization and Embedding: Worked Example
Tokenization and embedding are key steps in processing input sequences for transformers. Here's a detailed explanation with a practical example:
Step 1: Tokenization
Tokenization splits a text sequence into smaller units (tokens), which can be words, subwords, or characters, depending on the tokenizer used (e.g., WordPiece, Byte Pair Encoding).
Example:
Suppose we have the sentence:
"I love mathematics."
A subword tokenizer might split this into:
["I", "love", "math", "##ematics", "."]
- The
##prefix indicates a subword (continuation of a word). - Each token is assigned a unique token ID based on a vocabulary.
Assume the token IDs are:
["I": 1, "love": 2, "math": 3, "##ematics": 4, ".": 5]
So, the input sequence becomes:
[1, 2, 3, 4, 5]
Step 2: Embedding Lookup
The token IDs are mapped to dense vectors using an embedding matrix. This matrix, , is a learnable parameter of size , where:
- : Vocabulary size.
- : Embedding dimension.
Example:
Let (vocabulary size) and (embedding dimension). A simple embedding matrix might look like:
Each row corresponds to the embedding of a token.
Step 3: Embedding the Input
For the input sequence [1, 2, 3, 4, 5], the embeddings are retrieved by indexing :
Each row in the resulting matrix corresponds to the embedding of a token.
Step 4: Adding Positional Encoding
To account for the order of tokens in the sequence, positional encodings are added to the embeddings.
For simplicity, let’s assume the positional encoding vectors are:
Adding these to the embeddings:
Summary
- Tokenization: Breaks the input into tokens and maps them to token IDs.
- Embedding Lookup: Maps token IDs to dense vectors using .
- Positional Encoding: Adds sequence order information to embeddings.
These processed embeddings are then fed into the transformer layers for further computation.
This work is licensed under a Creative Commons Attribution 4.0 International License.
No comments:
Post a Comment