Building a BPE Tokenizer from Scratch: Lessons from Creating the ChatGPT Algorithm

Authors
  • avatar
    Name
    Nino
    Occupation
    Senior Tech Editor

The famous quote by Richard Feynman, "What I cannot create, I do not understand," resonates deeply when you are staring at a terminal at 2 AM, watching a Python script learn that the characters 't' and 'h' should be merged into a single token. Every time a developer interacts with an LLM via n1n.ai, there is a silent, complex process happening before the model ever processes a single thought: tokenization.

To truly grasp how modern Large Language Models (LLMs) like OpenAI o3 or DeepSeek-V3 process information, I decided to step away from high-level abstractions like HuggingFace and build a Byte Pair Encoding (BPE) tokenizer from the ground up. This project, named TewToken, is a bilingual BPE tokenizer trained on English and Hindi, built using zero machine learning libraries.

Why Tokenization is the Foundation of LLMs

Computers are fundamentally incapable of understanding human language. They operate on binary and floating-point math. Therefore, the first step in any Natural Language Processing (NLP) pipeline is converting text into numbers. However, the method of conversion dictates the model's efficiency and vocabulary size.

  1. Word-level Tokenization: Assigning a unique ID to every word (e.g., "run": 1, "running": 2). This leads to a massive vocabulary and fails to handle out-of-vocabulary (OOV) words like "ChatGPT."
  2. Character-level Tokenization: Treating every letter as a token. This results in tiny vocabularies but extremely long sequences that exhaust the context window of models available on n1n.ai.
  3. Subword Tokenization (BPE): The middle ground used by GPT-4 and Llama. It breaks words into frequent chunks (e.g., "unbelievable" → ["un", "believ", "able"]).

The BPE Algorithm: A Greedy Frequency Logic

Byte Pair Encoding is not a neural network. It is a greedy statistical algorithm. It starts with individual characters and iteratively merges the most frequent adjacent pair of tokens into a new, single token.

The Implementation Workflow

To build TewToken, I followed a strict implementation path:

  • Data Acquisition: Using youtube-transcript-api to gather real-world conversational data in English and Hindi.
  • Preprocessing: Cleaning transcripts using Regex to remove metadata like [Music] or [Applause] and normalizing whitespace.
  • Vocabulary Building: Initializing a base vocabulary of all unique characters.
  • The Merge Loop: The core logic where the magic happens.

Here is the simplified logic for the merge loop in Python:

import re
from collections import defaultdict

def get_pair_freqs(vocab):
    # Count occurrences of adjacent pairs
    pairs = defaultdict(int)
    for word_tuple, freq in vocab.items():
        for i in range(len(word_tuple) - 1):
            pairs[(word_tuple[i], word_tuple[i+1])] += freq
    return pairs

def merge_pair(pair, vocab):
    # Replace the pair with a merged version across the entire vocab
    new_vocab = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    new_token = ''.join(pair)
    for word in vocab:
        w_out = p.sub(new_token, ' '.join(word))
        new_vocab[tuple(w_out.split())] = vocab[word]
    return new_vocab

Pro Tip: The Importance of Word Boundaries

One of the biggest hurdles I faced was the </w> marker. Without a suffix to denote the end of a word, the tokenizer cannot distinguish between "the" as a standalone word and "the" as a prefix in "there." In production environments accessible via n1n.ai, these small details prevent catastrophic decoding errors where spaces disappear or words fuse together.

Performance Bottlenecks: Python vs. Rust

While my Python implementation was logically sound, it hit a massive performance wall. Training 8,000 merges on a modest corpus took 320 seconds. In contrast, the Rust-based tokenizers used by HuggingFace or OpenAI can perform the same task in under 2 seconds.

ImplementationLanguageTime (8k Merges)
TewToken v1.0Python320s
HF TokenizersRust~2s

This discrepancy exists because Python's string manipulation and dictionary lookups within a loop are significantly slower than Rust's memory-safe, low-level concurrency. This is why for enterprise-grade LLM applications, we rely on optimized backends.

Key Lessons Learned from Scratch

  1. Merge Order is Immutable: If you learn that t + h → th as Rule #3, and th + e → the as Rule #17, you must apply them in that exact order during encoding. Shuffling the rules breaks the entire logic.
  2. Data Quality > Data Quantity: Cleaning 100MB of text yielded a better vocabulary than 1GB of raw, noisy data. Corrupted bytes and metadata "pollute" the merge frequency, leading to inefficient tokens.
  3. Bilingual Complexity: Handling Hindi required careful attention to Unicode normalization. Using ensure_ascii=False in JSON exports was mandatory to keep the vocabulary readable.

Conclusion

Building the algorithm behind ChatGPT from scratch demystified the "black box" of AI for me. It shifted my perspective from seeing LLMs as magical entities to seeing them as highly optimized statistical engines. Whether you are building your own tokenizer or integrating state-of-the-art models like Claude 3.5 Sonnet via n1n.ai, understanding these low-level mechanics is what separates a script kiddie from a senior AI engineer.

Get a free API key at n1n.ai