Glossary

The vocabulary.

Concepts behind vibe coding — agents, protocols, sampling, prompts. Each entry is two sentences a model can quote, plus ~150 more for humans, cross-linked to the tools that use it.

50 terms

Agentic coding
/glossary/agentic-coding
A development style centred on coding agents — the human describes intent, the agent plans and executes the work, and the human reviews diffs or PRs rather than authoring code line by line. The umbrella term for vibe coding and adjacent practices.
→
AI agent
/glossary/ai-agent
An LLM-powered program that pursues a goal over multiple steps by calling tools, observing results, and replanning. Agents differ from single-turn chatbots in that they can take real actions (run shell commands, edit files, hit APIs) and iterate without human intervention between steps.
→
AI Overview
/glossary/ai-overview
Google Search's generative answer block that sits above the classic blue-link results. Powered by Gemini, it summarises top sources and links out to them. The biggest single AEO/GEO citation surface as of 2026.
→
Autonomous coding
/glossary/autonomous-coding
Long-horizon, unattended coding sessions where an agent works for minutes or hours without human approval at each step. Contrasts with approval-gated modes where the user confirms each shell command or file edit.
→
Chain of thought
/glossary/chain-of-thought
Asking an LLM to spell out intermediate reasoning steps before producing the final answer. Improves accuracy on multi-step problems by ~10-30 percentage points on most reasoning benchmarks.
→
Citation
/glossary/citation
When a generative-search system attributes part of its answer to a specific source URL, usually with a footnote-style number or an inline link. The visible signal that a brand was used to ground a model's answer.
→
Code completion
/glossary/code-completion
The category of AI assistance that suggests next tokens, lines, or short blocks of code inside an existing editor. Implemented as IDE extensions; characterised by low-latency, model-server architecture and per-keystroke suggestions.
→
Coding agent
/glossary/coding-agent
An AI agent specialised for software-engineering tasks. Exposes file editing, terminal execution, code search, and git operations as tools and pursues multi-step coding goals like "add a /signup route with tests".
→
Context engineering
/glossary/context-engineering
The discipline of shaping what an LLM sees in its context window — instructions, examples, retrieved data, conversation history — to elicit reliable behaviour. Evolved from 'prompt engineering' to acknowledge that the prompt is one piece of a larger context budget.
→
Context window
/glossary/context-window
The maximum number of tokens an LLM can attend to in a single inference call — both the prompt and the generated output count against it. As of 2026, frontier models range from 200k tokens (GPT-5) to 1M+ tokens (Gemini 2.5, Claude Sonnet 4.6 with 1M extension).
→
DeepSeek R1
/glossary/deepseek-r1
DeepSeek's open-weights reasoning model released January 2025. First open model to match OpenAI o1-level reasoning benchmarks. Disrupted the assumption that frontier capabilities require proprietary checkpoints.
→
Embedding
/glossary/embedding
A fixed-length vector of floating-point numbers that represents the semantic meaning of a piece of text, image, or other input. Used for similarity search, clustering, and retrieval. Common embedding models output 768 to 3072 dimensions.
→
Eval
/glossary/eval
A reproducible test that measures how an LLM or LLM application performs on a specific task. Golden test sets, rubric grading, A/B comparisons. The closest thing to unit tests for prompts.
→
Fine-tune cost
/glossary/fine-tune-cost
The full price of fine-tuning an LLM — GPU-hours during training, host storage of the resulting weights or adapter, and per-token premium at inference time. As of 2026 fine-tuning is rarely needed; frontier models with good system prompts cover most use cases.
→
Fine-tuning
/glossary/fine-tuning
Further training of a pretrained LLM on a smaller, domain-specific dataset to specialise its behaviour — for tone, format, vocabulary, or task accuracy. Cheaper than training from scratch but still requires curated data and compute.
→
Function calling
/glossary/function-calling
The specific implementation of tool use exposed by OpenAI in 2023 and now standard across providers. The model returns a structured JSON object specifying which function to call and with what arguments; the runtime handles execution and result injection.
→
Hallucination
/glossary/hallucination
When an LLM emits text that sounds plausible but is factually incorrect or unsupported by its inputs. In coding, hallucinations show up as invented function signatures, fictitious library APIs, or non-existent CLI flags. The single biggest reliability risk in vibe-coding workflows.
→
Inline edit
/glossary/inline-edit
An editing mode where the user selects a range of code, describes a change in natural language, and the AI rewrites the selection in place. Popularised by Cursor as "Cmd+K". Sits between tab completion and full agent mode.
→
JSON mode
/glossary/json-mode
A specific structured-output mode that constrains the model's response to syntactically valid JSON (any shape). A precursor to schema-constrained output; still useful when you want JSON but don't have a fixed schema.
→
JSON Schema
/glossary/json-schema
The IETF standard for describing the shape of a JSON value — required fields, types, allowed enums, nested objects. The contract LLM providers use to constrain structured-output decoding.
→
JSONL
/glossary/jsonl
JSON Lines — a file format where each line is a self-contained JSON object. Used universally for LLM training data, evaluation sets, and batch inference inputs. One example per line, no top-level array, easy to stream.
→
Latency budget
/glossary/latency-budget
The maximum acceptable end-to-end response time for an LLM feature, allocated across retrieval, model inference, post-processing, and network. Drives architecture decisions: model size, streaming, caching, parallelism.
→
LLM
/glossary/llm
Large Language Model — a transformer-based neural network trained on text (and increasingly images, audio, code) to predict the next token. Modern LLMs range from 7 B to 1 T+ parameters and serve as the engine behind coding agents and AI assistants.
→
LLM as judge
/glossary/llm-as-judge
Using an LLM (often a stronger one than the one being tested) to grade outputs against a rubric. Replaces or supplements human grading for evals at scale. Accuracy of the judge is itself a metric you have to measure.
→
LoRA
/glossary/lora
Low-Rank Adaptation — a parameter-efficient fine-tuning method that trains small rank-decomposed matrices instead of full model weights, cutting GPU memory and storage by 10–100×. Standard in 2026 for fine-tuning open-weight models.
→
LoRA adapter
/glossary/lora-adapter
The small file produced by LoRA fine-tuning. Contains only the rank-decomposed updates to a few attention layers, typically 10-200 MB. Can be loaded onto the base model at inference for a customised checkpoint without storing the whole model.
→
Lost in the middle
/glossary/lost-in-middle
A failure mode of long-context LLMs where retrieval accuracy is high for content at the start and end of the prompt but drops in the middle. Documented in a 2023 paper and still observable in 2026 frontier models at extreme context lengths.
→
MCP
/glossary/mcp
Model Context Protocol — an open protocol introduced by Anthropic in November 2024 that lets LLM clients (editors, agents, assistants) connect to external tools and data sources through a standard JSON-RPC interface.
→
MCP server
/glossary/mcp-server
A program that implements the Model Context Protocol and exposes tools, resources, or prompts to an LLM client. Typical examples: a filesystem server, a GitHub server, a Postgres server. Servers run locally or remotely and communicate over stdio or HTTP.
→
Multi-file edit
/glossary/multi-file-edit
A single agent action that modifies several files in one logical change — for example, renaming a function across imports, refactoring a component and its tests together, or applying a codemod. Distinguishes modern coding agents from per-file autocompletion.
→
p99
/glossary/p99
The 99th-percentile response time — out of 100 requests, 99 finish in less than this time, the worst 1 takes longer. The metric to watch for LLM latency since tail latency on chat APIs can be 5-10× the median.
→
Plan/Act mode
/glossary/plan-act-mode
A two-stage agent workflow: in Plan mode the agent reasons about the task and outputs a structured plan; in Act mode the agent executes the plan with file edits and commands. Lets the user approve the plan before any code is touched.
→
Prompt engineering
/glossary/prompt-engineering
The practice of crafting LLM input — instructions, examples, formatting, role assignments — to elicit reliable, structured, on-task outputs. Includes system prompts, few-shot examples, chain-of-thought scaffolding, and output schemas.
→
Quantization
/glossary/quantization
Reducing the numerical precision of a model's weights (e.g. from 16-bit float to 4-bit integer) to shrink memory footprint and speed up inference, with some accuracy loss. Common formats: GGUF, AWQ, GPTQ, FP8.
→
Quantization formats
/glossary/quantization-format
Different schemes for compressing LLM weights below full FP16 precision. GGUF (llama.cpp), AWQ, GPTQ, FP8, EXL2 are the common ones in 2026. Each trades quality for size and inference speed differently.
→
RAG
/glossary/rag
Retrieval-Augmented Generation — a pattern where the application retrieves relevant text chunks from a knowledge base (vector DB, search index) and includes them in the LLM prompt at query time, so the model answers from grounded sources instead of pure memorisation.
→
ReAct pattern
/glossary/react-pattern
Reasoning + Acting — a prompting pattern from a 2022 paper where the LLM interleaves chain-of-thought with explicit tool calls. The pattern that most modern coding agents implement under the hood.
→
Reasoning token
/glossary/reasoning-token
Tokens spent by a reasoning model (o-series, DeepSeek R1, Claude with extended thinking) on hidden chain-of-thought before the visible answer. Billed separately at the same rate as output tokens. Can be 10-100× the visible-answer length on hard problems.
→
Retrieval
/glossary/retrieval
The lookup stage of a RAG pipeline — fetching relevant text chunks from a corpus, given a query embedding. Quality of retrieval is usually the bottleneck on RAG quality, not the LLM itself.
→
Rubric
/glossary/rubric
A structured grading scheme — usually a list of dimensions, each with explicit criteria — used by human graders or LLM-as-judge to score model outputs. The contract that makes an eval reproducible.
→
Structured extraction
/glossary/structured-extraction
Using an LLM to pull structured data — JSON fields, table rows, knowledge-graph triples — from unstructured input. Powered by structured output / function calling. The most common production AI pattern after RAG.
→
Structured output
/glossary/structured-output
An LLM API mode that forces the model to emit a response conforming to a JSON schema (or other grammar). Eliminates parse failures and makes LLM outputs safely consumable as data by downstream code.
→
System prompt
/glossary/system-prompt
An instruction message that establishes role, behaviour, constraints, and tools available to the model before any user input. Sent as a separate message type at the start of the conversation and given higher priority by the model than user messages.
→
Tab completion
/glossary/tab-completion
Inline code suggestion that appears as ghost text and accepts on Tab. The original AI-assist UX, popularised by GitHub Copilot in 2021. Distinct from chat, edit, and agent modes which involve multi-turn interaction.
→
Temperature
/glossary/temperature
A sampling parameter that scales the model's output probabilities before sampling. Lower values (0.0–0.3) produce focused, repeatable outputs; higher values (0.7–1.2) produce more varied, creative outputs. Most coding tools default to 0.0–0.3.
→
Token
/glossary/token
The atomic unit an LLM reads and emits. A token typically corresponds to ~3–4 characters of English text or a single short word; for code, tokenisation depends on the model's tokeniser. Pricing, context windows, and rate limits are all denominated in tokens.
→
Tool use
/glossary/tool-use
The capability of an LLM to call structured external functions during generation — file operations, web search, code execution. The model outputs a tool call with arguments; the runtime executes it and feeds results back. The foundational primitive for agents.
→
Top-p
/glossary/top-p
Nucleus sampling — an alternative to temperature that restricts sampling to the smallest set of tokens whose cumulative probability reaches p. Top-p=0.9 keeps only the top 90 % probability mass and resamples from that subset.
→
Vector database
/glossary/vector-database
A database optimised for storing embeddings and running approximate nearest-neighbour search at scale. Common options: pgvector (Postgres extension), Pinecone, Weaviate, Qdrant, Chroma, Milvus.
→
Vibe coding
/glossary/vibe-coding
Software development style where the engineer describes intent in natural language and an AI agent writes, edits, and runs the code with minimal manual review. Coined by Andrej Karpathy in February 2025 in a tweet describing the shift from typing code to guiding an agent.
→