Context Window

The maximum amount of text a language model can consider at once — both your input and its output must fit within this limit.

A context window is the total amount of text an LLM can hold in memory at once. Your prompt, any retrieved documents, conversation history, and the model's output all have to fit inside the same window — measured in tokens. Exceed it and content gets cut off, or your request errors entirely.

This is a hard product constraint, not a soft limitation to engineer around later. If the data you need fits in the window, you can often ship a straightforward prompt and be done. If it doesn't, you're building retrieval (RAG), chunking logic, summarization pipelines, and caching — each one adding latency, cost, and new failure modes. The engineering surface area grows fast once you're stitching context together manually.

Context windows have expanded dramatically. Early models topped out at 4K tokens. Current frontier models support 128K to 1M tokens. That sounds like the problem is solved. It isn't. Larger windows are expensive — transformer attention cost grows quadratically with length — and models don't attend equally well to everything in a long context. Important information buried in the middle of a 200K-token prompt often gets underweighted compared to content at the start or end. Longer doesn't mean better-read.

Match your context strategy to your actual data sizes and retrieval patterns before defaulting to the largest available window. Bigger windows cost more per call. Sometimes chunking and retrieval is cheaper and more reliable than stuffing everything in.

Further reading:

Anthropic Claude Models Documentation — context window specs for Claude models
Google Gemini 1.5 Announcement — the 1M token context window introduction

Related Terms