Reference
Glossary
Core concepts and terminology for working with large language models and AI infrastructure.
- Token
- The fundamental unit of text that a language model processes. A token can be a word, subword, or character. Models have context windows measured in tokens, and API pricing is typically per-token. Roughly, 1 token ≈ 0.75 English words.
- Context
- The total input provided to a model during inference, including system prompts, conversation history, and user messages. The context window is the maximum number of tokens a model can process in a single request. Larger contexts allow more background information but increase latency and cost.
- Reasoning
- A model's internal chain-of-thought process where it breaks down complex problems into intermediate steps before producing a final answer. Reasoning models allocate extra compute at inference time to explore multiple solution paths, improving accuracy on math, logic, and multi-step tasks.
- Streaming
- A delivery mode where the model sends tokens incrementally as they are generated, rather than returning the full response at once. Streaming reduces time-to-first-byte and creates a responsive user experience. Server-Sent Events (SSE) are the standard transport for streaming responses.
- Tool
- An external function or API that a model can invoke during a conversation. Tools extend model capabilities beyond text generation — enabling web search, code execution, database queries, or file operations. The model emits structured tool calls which the application executes and returns as context.
- Embedding
- A dense vector representation of text in high-dimensional space, typically 768 to 3072 dimensions. Embeddings capture semantic meaning — similar texts cluster together. They power semantic search, clustering, and retrieval systems by enabling cosine-similarity comparisons between documents and queries.
- RAG
- Retrieval-Augmented Generation. A pattern that combines a retrieval system with a generative model. When a user asks a question, relevant documents are fetched from a knowledge base and injected into the model's context. RAG grounds responses in authoritative data and reduces hallucination.
- Temperature
- A sampling parameter (0.0 to 2.0) controlling output randomness. Low temperatures (0.0–0.3) produce deterministic, focused outputs ideal for factual tasks. High temperatures (0.7–1.5) increase variety and creativity. Temperature 0 does not guarantee identical outputs across requests.
- top-p
- Nucleus sampling parameter (0.0 to 1.0) that limits token selection to the smallest set whose cumulative probability exceeds p. A top-p of 0.1 considers only the top 10% probability mass, producing focused outputs. Often used alongside or instead of temperature for fine-grained sampling control.
- JSON mode
- A constrained generation mode where the model is forced to output valid JSON. The grammar is restricted at the token level, guaranteeing syntactically correct output. Essential for structured data extraction, tool calling, and any pipeline that parses model responses programmatically.
- Logprobs
- Log probabilities returned alongside generated tokens, indicating the model's confidence in each token choice. Logprobs enable uncertainty quantification, token-level debugging, and classifier-free guidance. Values are negative log-likelihoods — closer to zero means higher confidence.
- Idempotency
- A property ensuring that submitting the same request multiple times produces the same result without duplicate side effects. Idempotency keys are unique identifiers sent with API requests. If a network failure occurs, retrying with the same key returns the original result rather than creating a duplicate.