kapitals-pi & SEN: x̄ - > Tokenization and Embedding e.g., WordPiece, Byte Pair Encoding : Worked Example

Saturday, December 28, 2024

x̄ - > Tokenization and Embedding e.g., WordPiece, Byte Pair Encoding : Worked Example

Tokenization and Embedding: Worked Example

Tokenization and embedding are key steps in processing input sequences for transformers. Here's a detailed explanation with a practical example:

Step 1: Tokenization

Tokenization splits a text sequence into smaller units (tokens), which can be words, subwords, or characters, depending on the tokenizer used (e.g., WordPiece, Byte Pair Encoding).

Example:

Suppose we have the sentence:

"I love mathematics."

A subword tokenizer might split this into:

["I", "love", "math", "##ematics", "."]

The ## prefix indicates a subword (continuation of a word).
Each token is assigned a unique token ID based on a vocabulary.

Assume the token IDs are:

["I": 1, "love": 2, "math": 3, "##ematics": 4, ".": 5]

So, the input sequence becomes:

[1, 2, 3, 4, 5]

Step 2: Embedding Lookup

The token IDs are mapped to dense vectors using an embedding matrix. This matrix, $W_e$ , is a learnable parameter of size $V \times d$ , where:

$V$ : Vocabulary size.
$d$ : Embedding dimension.

Example:

Let $V = 6$ (vocabulary size) and $d = 4$ (embedding dimension). A simple embedding matrix might look like:

W_e = \begin{bmatrix} 0.1 & 0.2 & 0.3 & 0.4 \\ % Token 0 (padding) 0.5 & 0.6 & 0.7 & 0.8 \\ % Token "I" (1) 0.9 & 1.0 & 1.1 & 1.2 \\ % Token "love" (2) 1.3 & 1.4 & 1.5 & 1.6 \\ % Token "math" (3) 1.7 & 1.8 & 1.9 & 2.0 \\ % Token "##ematics" (4) 2.1 & 2.2 & 2.3 & 2.4 % Token "." (5) \end{bmatrix}

Each row corresponds to the embedding of a token.

Step 3: Embedding the Input

For the input sequence [1, 2, 3, 4, 5], the embeddings are retrieved by indexing $W_e$ :

\text{Embeddings} = \begin{bmatrix} 0.5 & 0.6 & 0.7 & 0.8 \\ % "I" 0.9 & 1.0 & 1.1 & 1.2 \\ % "love" 1.3 & 1.4 & 1.5 & 1.6 \\ % "math" 1.7 & 1.8 & 1.9 & 2.0 \\ % "##ematics" 2.1 & 2.2 & 2.3 & 2.4 % "." \end{bmatrix}

Each row in the resulting matrix corresponds to the embedding of a token.

Step 4: Adding Positional Encoding

To account for the order of tokens in the sequence, positional encodings are added to the embeddings.

For simplicity, let’s assume the positional encoding vectors are:

\text{Positional Encodings} = \begin{bmatrix} 0.0 & 0.1 & 0.2 & 0.3 \\ 0.0 & 0.2 & 0.4 & 0.6 \\ 0.0 & 0.3 & 0.6 & 0.9 \\ 0.0 & 0.4 & 0.8 & 1.2 \\ 0.0 & 0.5 & 1.0 & 1.5 \end{bmatrix}

Adding these to the embeddings:

\text{Final Embeddings} = \begin{bmatrix} 0.5 & 0.7 & 0.9 & 1.1 \\ 0.9 & 1.2 & 1.5 & 1.8 \\ 1.3 & 1.7 & 2.1 & 2.5 \\ 1.7 & 2.2 & 2.7 & 3.2 \\ 2.1 & 2.7 & 3.3 & 3.9 \end{bmatrix}

Summary

Tokenization: Breaks the input into tokens and maps them to token IDs.
Embedding Lookup: Maps token IDs to dense vectors using $W_e$ .
Positional Encoding: Adds sequence order information to embeddings.

These processed embeddings are then fed into the transformer layers for further computation.

This work is licensed under a Creative Commons Attribution 4.0 International License.

kapitals-pi & SEN

Saturday, December 28, 2024

x̄ - > Tokenization and Embedding e.g., WordPiece, Byte Pair Encoding : Worked Example

Tokenization and Embedding: Worked Example

Step 1: Tokenization

Example:

Step 2: Embedding Lookup

Example:

Step 3: Embedding the Input

Step 4: Adding Positional Encoding

Summary

No comments:

x̄ - > Bloomberg BS Model - King James Rodriguez Brazil 2014

Labels

Followers

Report Abuse