Saturday, December 28, 2024

x̄ - > Tokenization and Embedding e.g., WordPiece, Byte Pair Encoding : Worked Example

 

Tokenization and Embedding: Worked Example

Tokenization and embedding are key steps in processing input sequences for transformers. Here's a detailed explanation with a practical example:


Step 1: Tokenization

Tokenization splits a text sequence into smaller units (tokens), which can be words, subwords, or characters, depending on the tokenizer used (e.g., WordPiece, Byte Pair Encoding).

Example:

Suppose we have the sentence:

"I love mathematics."

A subword tokenizer might split this into:

["I", "love", "math", "##ematics", "."]
  • The ## prefix indicates a subword (continuation of a word).
  • Each token is assigned a unique token ID based on a vocabulary.

Assume the token IDs are:

["I": 1, "love": 2, "math": 3, "##ematics": 4, ".": 5]

So, the input sequence becomes:

[1, 2, 3, 4, 5]

Step 2: Embedding Lookup

The token IDs are mapped to dense vectors using an embedding matrix. This matrix, WeW_e, is a learnable parameter of size V×dV \times d, where:

  • VV: Vocabulary size.
  • dd: Embedding dimension.

Example:

Let V=6V = 6 (vocabulary size) and d=4d = 4 (embedding dimension). A simple embedding matrix might look like:

We=[0.10.20.30.40.50.60.70.80.91.01.11.21.31.41.51.61.71.81.92.02.12.22.32.4]W_e = \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 \\ % Token 0 (padding) 0.5 & 0.6 & 0.7 & 0.8 \\ % Token "I" (1) 0.9 & 1.0 & 1.1 & 1.2 \\ % Token "love" (2) 1.3 & 1.4 & 1.5 & 1.6 \\ % Token "math" (3) 1.7 & 1.8 & 1.9 & 2.0 \\ % Token "##ematics" (4) 2.1 & 2.2 & 2.3 & 2.4 % Token "." (5) \end{bmatrix}

Each row corresponds to the embedding of a token.


Step 3: Embedding the Input

For the input sequence [1, 2, 3, 4, 5], the embeddings are retrieved by indexing WeW_e:

Embeddings=[0.50.60.70.80.91.01.11.21.31.41.51.61.71.81.92.02.12.22.32.4]\text{Embeddings} = \begin{bmatrix} 0.5 & 0.6 & 0.7 & 0.8 \\ % "I" 0.9 & 1.0 & 1.1 & 1.2 \\ % "love" 1.3 & 1.4 & 1.5 & 1.6 \\ % "math" 1.7 & 1.8 & 1.9 & 2.0 \\ % "##ematics" 2.1 & 2.2 & 2.3 & 2.4 % "." \end{bmatrix}

Each row in the resulting matrix corresponds to the embedding of a token.


Step 4: Adding Positional Encoding

To account for the order of tokens in the sequence, positional encodings are added to the embeddings.

For simplicity, let’s assume the positional encoding vectors are:

Positional Encodings=[0.00.10.20.30.00.20.40.60.00.30.60.90.00.40.81.20.00.51.01.5]\text{Positional Encodings} = \begin{bmatrix} 0.0 & 0.1 & 0.2 & 0.3 \\ 0.0 & 0.2 & 0.4 & 0.6 \\ 0.0 & 0.3 & 0.6 & 0.9 \\ 0.0 & 0.4 & 0.8 & 1.2 \\ 0.0 & 0.5 & 1.0 & 1.5 \end{bmatrix}

Adding these to the embeddings:

Final Embeddings=[0.50.70.91.10.91.21.51.81.31.72.12.51.72.22.73.22.12.73.33.9]\text{Final Embeddings} = \begin{bmatrix} 0.5 & 0.7 & 0.9 & 1.1 \\ 0.9 & 1.2 & 1.5 & 1.8 \\ 1.3 & 1.7 & 2.1 & 2.5 \\ 1.7 & 2.2 & 2.7 & 3.2 \\ 2.1 & 2.7 & 3.3 & 3.9 \end{bmatrix}

Summary

  1. Tokenization: Breaks the input into tokens and maps them to token IDs.
  2. Embedding Lookup: Maps token IDs to dense vectors using WeW_e.
  3. Positional Encoding: Adds sequence order information to embeddings.

These processed embeddings are then fed into the transformer layers for further computation.

This work is licensed under a Creative Commons Attribution 4.0 International License.

No comments:

Meet the Authors
Zacharia Maganga’s blog features multiple contributors with clear activity status.
Active ✔
πŸ§‘‍πŸ’»
Zacharia Maganga
Lead Author
Active ✔
πŸ‘©‍πŸ’»
Linda Bahati
Co‑Author
Active ✔
πŸ‘¨‍πŸ’»
Jefferson Mwangolo
Co‑Author
Inactive ✖
πŸ‘©‍πŸŽ“
Florence Wavinya
Guest Author
Inactive ✖
πŸ‘©‍πŸŽ“
Esther Njeri
Guest Author
Inactive ✖
πŸ‘©‍πŸŽ“
Clemence Mwangolo
Guest Author

x̄ - > Health Insurance & Hospitalization Models

Health Insurance & Hospitalization Models πŸ”Š Read ⏸ Pause ▶ Resume ⏹ Stop Health Insurance & Hospitaliz...

Labels

Data (3) Infographics (3) Mathematics (3) Sociology (3) AI (2) Algebraic structure (2) Economics (2) Environment (2) Machine Learning (2) Sociology of Religion and Sexuality (2) kuku (2) #Mbele na Biz (1) #StopTheSpread (1) #stillamother #wantedchoosenplanned #bereavedmothersday #mothersday (1) #university#ai#mathematics#innovation#education#education #research#elearning #edtech (1) ( Migai Winter 2011) (1) 2026 World Cup (1) 8-4-4 (1) AI Bubble (1) Accrual Accounting (1) Advanced Algebra (1) Agriculture (1) Algebra (1) Algorithms (1) Amusement of mathematics (1) Analysis GDP VS employment growth (1) Analysis report (1) Animal Health (1) Applied AI Lab (1) Arithmetic operations (1) Black-Scholes (1) Bleu Ranger FC (1) Blockchain (1) CATS (1) CBC (1) Capital markets (1) Cash Accounting (1) Cauchy integral theorem (1) Coding theory. (1) Complex Analysis (1) Complex Numbers (1) Computer Science (1) Computer vision (1) Creative Commons (1) Cryptocurrency (1) Cryptography (1) Currencies (1) DISC (1) Data Analysis (1) Data Science (1) Decision-Making (1) Differential Equations (1) Ecdonometric model (1) Economic Indicators (1) Education (1) Euler Formula (1) Experimental design and sampling (1) Financial Data (1) Financial markets (1) Finite fields (1) Fractals (1) Free MCBoot (1) Funds (1) Future stock price (1) Galois fields (1) Game (1) Go-Moku (1) Grants (1) Health (1) Health research (1) Hedging my bet (1) Holormophic (1) Hospitalization models (1) ICICPE 2026 Confrence (1) IEM (1) IS–LM (1) Imaginary Unit (1) Indices (1) Infinite (1) Infographic (1) Investment (1) KCSE (1) KJSE (1) Kapital Inteligence (1) Kenya education (1) Latex (1) Law (1) Limit (1) Literary work (1) Logic (1) MBTI (1) Market Analysis. (1) Market pulse (1) Math Tutorial (1) Mathematical Proofs (1) Mathematical insights (1) Moby dick; ot The Whale (1) Montecarlo simulation (1) Motorcycle Taxi Rides (1) Mural (1) Nature Shape (1) Numerical methods (1) Observed paterns (1) Olympiad (1) Open PS2 Loader (1) Ordered Field Proof (1) Outta Pharaoh hand (1) Physics (1) Polar Coordinates (1) Predictions (1) Programing (1) Proof (1) Python (1) Python Code (1) Quiz (1) Quotation (1) R language (1) R programming (1) RAG (1) RES (1) RL (1) RSI (1) Real Analysis (1) Remove Duplicate Rows (1) Remove Rows with Missing Values (1) Replace Missing Values with Another Value (1) Risk Management (1) Safety (1) Science (1) Scientific method (1) Semantics (1) Stata SE (1) Statistical Modelling (1) Stochastic (1) Stock (1) Stock Markets (1) Stock price dynamics (1) Stock-Price (1) Stocks (1) Sudoku (1) Survey (1) Sustainable Agriculture (1) Symbols (1) Syntax (1) Taroch Coalition (1) Tech humor (1) The Nature of Mathematics (1) The safe way of science (1) Travel (1) Troubleshoting (1) Tsavo National park (1) Volatility (1) WASH (1) World time (1) Youtube Videos (1) analysis (1) and Belbin Insights (1) competency-based curriculum (1) conformal maps. (1) decisions (1) health sector (1) over-the-counter (OTC) markets (1) pedagogy (1) pi (1) power series (1) residues (1) stock exchange (1) uplifted (1)

Followers