BPE Tokenizer Showcase

Visualize how Byte-Pair Encoding learns a vocabulary and tokenizes text.

1. Learning Phase (Training)

Input Text

The token sequence at each step. The next pair to be merged is highlighted in yellow.

The most frequent pair (highlighted) is selected for the next merge.

Pairs are merged in this specific order.

Includes initial characters and new merged tokens.

Text to Tokenize

Text is broken down using the learned vocabulary.

The final integer representation of the text.