A context window is the maximum amount of text an AI model can consider at once — its system prompt, conversation history, documents, and your current question all have to fit inside. Think of it as the model's working memory: finite, measured in tokens, and critical to how the model performs.

Every language model has a hard limit on how much text it can process in a single inference call — this is the context window, measured in tokens (roughly ¾ of a word in English). GPT-3.5 originally had a 4K token window, roughly 3,000 words. Today's frontier models have windows ranging from 128K tokens (about 300 pages) to over 1 million tokens (about 2,500 pages) — Gemini and Claude offer the largest windows currently available. Everything the model sees on a single turn lives in this window: the system prompt that defines its behavior, the full conversation history, any documents you've uploaded, your current message, and — for reasoning models — the model's internal thinking tokens. When the window fills up, the oldest content gets truncated or summarized, which is why long conversations sometimes feel like the AI 'forgets' earlier context. Longer context windows have made dramatic new use cases possible: feeding entire codebases, legal documents, or research papers to the model in a single call. But bigger isn't free: larger contexts mean higher latency, higher API cost, and sometimes reduced performance on tasks where the relevant information sits in the middle of the context (a phenomenon called 'lost in the middle'). Understanding context window mechanics is foundational for working effectively with any LLM-based system.

What is a Context Window in AI Models?