BPE Tokenizer Showcase

Visualize how Byte-Pair Encoding learns a vocabulary and tokenizes text.

1. Learning Phase (Training)

Training Process

The token sequence at each step. The next pair to be merged is highlighted in yellow.

Pair Frequency Counts

The most frequent pair (highlighted) is selected for the next merge.

Learned Merge Rules

Pairs are merged in this specific order.

Final Vocabulary

Includes initial characters and new merged tokens.

2. Inference Phase (Tokenization)

Tokenized Output

Text is broken down using the learned vocabulary.

Token IDs

The final integer representation of the text.