Tokenizer

ai ai-tools

The component that converts text into numerical tokens that AI models can process and generate.

Definition

A tokenizer is the preprocessing component that breaks input text into discrete units called tokens before feeding them to a language model. Tokens can represent whole words, subwords, individual characters, or even byte sequences depending on the tokenization strategy.

Different models use different tokenizers. GPT models use byte-pair encoding (BPE), which splits uncommon words into smaller subword pieces. Understanding tokenization matters because model pricing, context limits, and processing speed are all measured in tokens rather than words.

Why It Matters

Tokenization directly affects AI costs and capabilities. Most API providers charge per token, and context windows are measured in tokens. A single English word typically translates to 1-3 tokens, but specialized terminology, non-English languages, and code can tokenize less efficiently.

Knowing how tokenization works helps teams optimize prompts for cost, estimate expenses for large-scale AI operations, and understand why some content types are more expensive to process than others.

Examples in Practice

The word "marketing" is typically one token, but "photojournalism" might be split into "photo," "journal," and "ism," consuming three tokens. This explains why technical or compound terms cost more to process.

A content team running AI analysis on 10,000 blog posts can estimate costs by checking token counts with a tokenizer tool before committing to the full batch, avoiding surprise bills.

Definition

Why It Matters

Examples in Practice

Explore More Industry Terms