Token
A token is a unit of text (such as a word, part of a word, or symbol) that an AI model processes when reading or generating language.
A token is the fundamental unit of text that language models process and generate. Tokens are not always complete words; they can be words, subwords, characters, or punctuation marks, depending on how the tokenization system breaks down text.
Tokenization is the process of converting text into tokens before feeding it to a language model. Different models use different tokenization schemes. For example, some models might tokenize "running" as a single token, while others might break it into "run" and "ning" as separate tokens. Common tokenization approaches include byte-pair encoding (BPE), which iteratively merges the most frequent character pairs, and WordPiece, which builds tokens from characters and subwords.
Understanding tokens is important for several reasons. First, language models have context windows measured in tokens, not words, so knowing token counts helps predict how much text a model can process. Second, API pricing for many language models is based on token usage, so understanding tokenization helps estimate costs. Third, token efficiency affects model performance and speed.
The relationship between words and tokens varies by language and content. English text typically has roughly 1.3 tokens per word, but this varies with punctuation, special characters, and formatting. Some tokens represent common words, while others represent rare words or subword pieces. Understanding tokenization helps users work more effectively with language models, estimate costs accurately, and optimize prompts for efficiency.