Saturday, December 28, 2024

x̄ - > Tokenization and Embedding e.g., WordPiece, Byte Pair Encoding : Worked Example

 

Tokenization and Embedding: Worked Example

Tokenization and embedding are key steps in processing input sequences for transformers. Here's a detailed explanation with a practical example:


Step 1: Tokenization

Tokenization splits a text sequence into smaller units (tokens), which can be words, subwords, or characters, depending on the tokenizer used (e.g., WordPiece, Byte Pair Encoding).

Example:

Suppose we have the sentence:

"I love mathematics."

A subword tokenizer might split this into:

["I", "love", "math", "##ematics", "."]
  • The ## prefix indicates a subword (continuation of a word).
  • Each token is assigned a unique token ID based on a vocabulary.

Assume the token IDs are:

["I": 1, "love": 2, "math": 3, "##ematics": 4, ".": 5]

So, the input sequence becomes:

[1, 2, 3, 4, 5]

Step 2: Embedding Lookup

The token IDs are mapped to dense vectors using an embedding matrix. This matrix, WeW_e, is a learnable parameter of size V×dV \times d, where:

  • VV: Vocabulary size.
  • dd: Embedding dimension.

Example:

Let V=6V = 6 (vocabulary size) and d=4d = 4 (embedding dimension). A simple embedding matrix might look like:

We=[0.10.20.30.40.50.60.70.80.91.01.11.21.31.41.51.61.71.81.92.02.12.22.32.4]W_e = \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 \\ % Token 0 (padding) 0.5 & 0.6 & 0.7 & 0.8 \\ % Token "I" (1) 0.9 & 1.0 & 1.1 & 1.2 \\ % Token "love" (2) 1.3 & 1.4 & 1.5 & 1.6 \\ % Token "math" (3) 1.7 & 1.8 & 1.9 & 2.0 \\ % Token "##ematics" (4) 2.1 & 2.2 & 2.3 & 2.4 % Token "." (5) \end{bmatrix}

Each row corresponds to the embedding of a token.


Step 3: Embedding the Input

For the input sequence [1, 2, 3, 4, 5], the embeddings are retrieved by indexing WeW_e:

Embeddings=[0.50.60.70.80.91.01.11.21.31.41.51.61.71.81.92.02.12.22.32.4]\text{Embeddings} = \begin{bmatrix} 0.5 & 0.6 & 0.7 & 0.8 \\ % "I" 0.9 & 1.0 & 1.1 & 1.2 \\ % "love" 1.3 & 1.4 & 1.5 & 1.6 \\ % "math" 1.7 & 1.8 & 1.9 & 2.0 \\ % "##ematics" 2.1 & 2.2 & 2.3 & 2.4 % "." \end{bmatrix}

Each row in the resulting matrix corresponds to the embedding of a token.


Step 4: Adding Positional Encoding

To account for the order of tokens in the sequence, positional encodings are added to the embeddings.

For simplicity, let’s assume the positional encoding vectors are:

Positional Encodings=[0.00.10.20.30.00.20.40.60.00.30.60.90.00.40.81.20.00.51.01.5]\text{Positional Encodings} = \begin{bmatrix} 0.0 & 0.1 & 0.2 & 0.3 \\ 0.0 & 0.2 & 0.4 & 0.6 \\ 0.0 & 0.3 & 0.6 & 0.9 \\ 0.0 & 0.4 & 0.8 & 1.2 \\ 0.0 & 0.5 & 1.0 & 1.5 \end{bmatrix}

Adding these to the embeddings:

Final Embeddings=[0.50.70.91.10.91.21.51.81.31.72.12.51.72.22.73.22.12.73.33.9]\text{Final Embeddings} = \begin{bmatrix} 0.5 & 0.7 & 0.9 & 1.1 \\ 0.9 & 1.2 & 1.5 & 1.8 \\ 1.3 & 1.7 & 2.1 & 2.5 \\ 1.7 & 2.2 & 2.7 & 3.2 \\ 2.1 & 2.7 & 3.3 & 3.9 \end{bmatrix}

Summary

  1. Tokenization: Breaks the input into tokens and maps them to token IDs.
  2. Embedding Lookup: Maps token IDs to dense vectors using WeW_e.
  3. Positional Encoding: Adds sequence order information to embeddings.

These processed embeddings are then fed into the transformer layers for further computation.

This work is licensed under a Creative Commons Attribution 4.0 International License.

No comments:

Meet the Authors
Zacharia Maganga’s blog features multiple contributors with clear activity status.
Active ✔
πŸ§‘‍πŸ’»
Zacharia Maganga
Lead Author
Active ✔
πŸ‘©‍πŸ’»
Linda Bahati
Co‑Author
Active ✔
πŸ‘¨‍πŸ’»
Jefferson Mwangolo
Co‑Author
Inactive ✖
πŸ‘©‍πŸŽ“
Florence Wavinya
Guest Author
Inactive ✖
πŸ‘©‍πŸŽ“
Esther Njeri
Guest Author
Inactive ✖
πŸ‘©‍πŸŽ“
Clemence Mwangolo
Guest Author

x̄ - > Bloomberg BS Model - King James Rodriguez Brazil 2014

Bloomberg BS Model - King James Rodriguez Brazil 2014 πŸ”Š Read ⏸ Pause ▶ Resume ⏹ Stop ⚽ The Silent Kin...

Labels

Data (3) Infographics (3) Mathematics (3) Sociology (3) Algebraic structure (2) Environment (2) Machine Learning (2) Sociology of Religion and Sexuality (2) kuku (2) #Mbele na Biz (1) #StopTheSpread (1) #stillamother #wantedchoosenplanned #bereavedmothersday #mothersday (1) #university#ai#mathematics#innovation#education#education #research#elearning #edtech (1) ( Migai Winter 2011) (1) 8-4-4 (1) AI Bubble (1) Accrual Accounting (1) Agriculture (1) Algebra (1) Algorithms (1) Amusement of mathematics (1) Analysis GDP VS employment growth (1) Analysis report (1) Animal Health (1) Applied AI Lab (1) Arithmetic operations (1) Black-Scholes (1) Bleu Ranger FC (1) Blockchain (1) CATS (1) CBC (1) Capital markets (1) Cash Accounting (1) Cauchy integral theorem (1) Coding theory. (1) Computer Science (1) Computer vision (1) Creative Commons (1) Cryptocurrency (1) Cryptography (1) Currencies (1) DISC (1) Data Analysis (1) Data Science (1) Decision-Making (1) Differential Equations (1) Economic Indicators (1) Economics (1) Education (1) Experimental design and sampling (1) Financial Data (1) Financial markets (1) Finite fields (1) Fractals (1) Free MCBoot (1) Funds (1) Future stock price (1) Galois fields (1) Game (1) Grants (1) Health (1) Hedging my bet (1) Holormophic (1) IS–LM (1) Indices (1) Infinite (1) Investment (1) KCSE (1) KJSE (1) Kapital Inteligence (1) Kenya education (1) Latex (1) Law (1) Limit (1) Logic (1) MBTI (1) Market Analysis. (1) Market pulse (1) Mathematical insights (1) Moby dick; ot The Whale (1) Montecarlo simulation (1) Motorcycle Taxi Rides (1) Mural (1) Nature Shape (1) Observed paterns (1) Olympiad (1) Open PS2 Loader (1) Outta Pharaoh hand (1) Physics (1) Predictions (1) Programing (1) Proof (1) Python Code (1) Quiz (1) Quotation (1) R programming (1) RAG (1) RL (1) Remove Duplicate Rows (1) Remove Rows with Missing Values (1) Replace Missing Values with Another Value (1) Risk Management (1) Safety (1) Science (1) Scientific method (1) Semantics (1) Statistical Modelling (1) Stochastic (1) Stock Markets (1) Stock price dynamics (1) Stock-Price (1) Stocks (1) Survey (1) Sustainable Agriculture (1) Symbols (1) Syntax (1) Taroch Coalition (1) The Nature of Mathematics (1) The safe way of science (1) Travel (1) Troubleshoting (1) Tsavo National park (1) Volatility (1) World time (1) Youtube Videos (1) analysis (1) and Belbin Insights (1) competency-based curriculum (1) conformal maps. (1) decisions (1) over-the-counter (OTC) markets (1) pedagogy (1) pi (1) power series (1) residues (1) stock exchange (1) uplifted (1)

Followers