RAG beyond chunks: graph RAG, agentic RAG, long-context RAG
Classic chunk-based RAG has limits. Graph RAG, agentic RAG, and long-context RAG each break those limits in different ways. When each is the right tool, how they actually work, and the production trade-offs that matter.
If you've built a classic RAG system, you know its strengths: it retrieves relevant chunks, the LLM generates grounded answers, performance is reasonable, costs are predictable. For most knowledge-base queries, this is fine.
But some queries break classic RAG. Multi-hop questions ("Which of our customers are using feature X and have churned in the last 6 months?"). Relationship-heavy questions ("How does our pricing compare across competitors A, B, and C?"). Synthesis questions ("Summarize everything we know about this customer's journey"). Classic chunk-based retrieval doesn't compose chunks well; the LLM ends up missing context that's spread across many places.
The "beyond chunks" approaches — graph RAG, agentic RAG, long-context RAG — each tackle these limits in different ways. This article covers what each is, when each is the right tool, how they actually work in production, and the trade-offs that matter.
The limits of chunk-based RAG
To understand what we're fixing, the limits:
Limit 1: No relationship structure. Chunks are independent units. The fact that chunk A is about customer X's account and chunk B is about a complaint customer X made is lost — they're just two chunks in a vector space, retrieved (or not) independently.
Limit 2: No multi-hop. "Customers who use feature X and complained about Y" requires combining information from two different sources, with set logic. Chunk-based retrieval doesn't do this.
Limit 3: Limited synthesis. "Summarize this account's relationship over time" needs to pull together many chunks in a coherent narrative. Chunks are presented as fragments; the LLM has to do the synthesis from scratch each time.
Limit 4: Fixed-pipeline rigidity. Classic RAG always does: embed query → retrieve K → generate. Complex queries that need iterative retrieval or multi-step reasoning don't fit this pipeline.
Limit 5: Context dilution. Top 5 chunks might include some that match the query but are off-topic. The LLM is forced to wade through them; quality suffers.
The variants we'll discuss each address subsets of these.
Graph RAG
The idea: represent your data as a knowledge graph. Entities (people, products, documents, events) are nodes; relationships are edges. Queries traverse the graph rather than (or in addition to) doing vector search.
When graph RAG helps
Relationship-heavy domains. Customer-account-deal-interaction structures. Org hierarchies. Product taxonomies. Citation networks. Anything where the connections between entities matter as much as the entities themselves.
Multi-hop reasoning. "Who is the manager of the team that bought product X in Q3?" requires hopping: product → deal → team → manager. Graph RAG handles this naturally.
Aggregation. "How many customers in segment Y have integrated with system Z?" requires set operations across entities. SQL on a knowledge graph beats text retrieval.
Citations and explanations. Graph relationships are explicit and auditable. The LLM can cite "John is the manager of Team Acme [edge: manages]" rather than "I think John manages Team Acme based on context."
How graph RAG actually works
The typical pipeline:
1. Extraction. Build the graph from your data. Two common approaches:
- Structured sources (databases, structured APIs): import directly. Customers, products, transactions are already in tables.
- Unstructured sources (documents, transcripts, emails): use an LLM to extract entities and relationships. "From this transcript, extract people, organizations, and the relationships between them."
The output: nodes (with types and properties) and edges (with types and properties).
2. Storage. A graph database — Neo4j, Memgraph, custom Postgres with edge tables. The choice depends on query patterns and scale.
3. Augment with embeddings. Each node also gets a text representation and an embedding. This enables hybrid: traverse the graph AND do semantic search.
4. Query at retrieval time. Three common patterns:
- Graph-only query. The LLM (or a routing logic) generates a graph query (Cypher, SQL). Execute. Return results to the LLM.
- Embedding-first, graph-expand. Find relevant entities by embedding. Then expand to their neighbors and related entities.
- Hybrid. Combine vector search + graph traversal in one pipeline.
5. Format for the LLM. Graph results are formatted as structured text the LLM can use. Entities with their properties; relationships explicit.
A concrete example
A SaaS company with customer data. Entities: customers, contracts, products, support tickets, interactions, employees.
Classic RAG approach: chunk customer documents, embed, retrieve. Loses relational structure.
Graph RAG approach:
- Nodes: customer, contract, product, ticket, interaction, employee.
- Edges: customer→has→contract, customer→subscribed_to→product, customer→submitted→ticket, ticket→assigned_to→employee, contract→sold_by→employee.
Query: "Which customers in the SaaS tier had more than 3 support tickets in Q1 and are up for renewal in Q2?"
This is naturally a graph query:
MATCH (c:Customer)-[:HAS]->(contract:Contract)
WHERE contract.tier = "SaaS"
AND contract.renewal_date BETWEEN "2026-04-01" AND "2026-06-30"
MATCH (c)-[:SUBMITTED]->(t:Ticket)
WHERE t.created BETWEEN "2026-01-01" AND "2026-03-31"
WITH c, count(t) as ticket_count
WHERE ticket_count > 3
RETURN c, ticket_countThe LLM generates this query (or selects from templated queries). Execute. Format results. Generate response.
Classic RAG can't easily answer this. Graph RAG does so cleanly.
Trade-offs
Pros:
- Handles relational queries naturally.
- Explicit, auditable structure.
- Composable with embeddings.
Cons:
- Building the graph is real engineering work. Especially for unstructured sources, extraction is imperfect.
- Schema design matters; bad schemas constrain you.
- Maintenance: as data evolves, the graph evolves.
- Less mature tooling than vector search.
When to choose:
Choose graph RAG when relationships are first-class in your domain. Don't choose it just because it sounds sophisticated; for many document-heavy domains, classic RAG is simpler and just as good.
Microsoft's Graph RAG and related work
Microsoft's open-source GraphRAG project (2024) popularized a specific approach:
- Extract entities and relationships from documents (LLM-based).
- Cluster entities into communities.
- Generate summaries per community at multiple hierarchical levels.
- At query time: retrieve relevant community summaries; use them as context.
This works well for "global" questions that span a corpus (e.g., "what are the main themes in this customer's complaint history?") rather than specific lookups.
Variants include LightRAG, Graphiti, and others — each with specific architectural choices.
Agentic RAG
The idea: instead of a fixed retrieve-then-generate pipeline, use an LLM agent that decides what to retrieve, when, and how to refine. The agent can ask follow-up retrieval queries, look at results and decide they're insufficient, try different angles.
When agentic RAG helps
Complex queries needing iteration. "Help me understand why our churn went up in Q1" requires looking at many angles (which segment, which time, which features, which competitors). An agent can iteratively explore.
Queries where one retrieval isn't enough. If the answer requires combining info from multiple distinct retrievals, an agent handles this naturally.
Queries with conditional logic. "If X is true based on retrieval 1, then look up Y; otherwise look up Z." Agents handle branching; fixed pipelines don't.
Ambiguous queries. The agent can ask the user (or the data) for clarification.
How agentic RAG works
The pipeline:
User query
↓
Agent reasons about what it needs
↓
Agent calls retrieval tools (one or many)
↓
Agent reads results
↓
Agent decides: enough info? Or another retrieval?
↓
Loop until done
↓
Generate final answerThe implementation involves:
Retrieval as tools. Expose retrieval functions to the agent: search_documents(query), lookup_by_id(id), aggregate(field, filter). The agent calls them as needed.
Memory. The agent remembers what it's retrieved across calls. Avoids re-fetching the same content.
Decision-making. The agent reasons explicitly about whether it has enough information. "Do I know the answer to the user's question? If not, what else do I need to retrieve?"
Termination. The agent must know when to stop. A max-step budget. A confidence threshold. A "I've answered" condition.
A concrete example
Query: "What were the top three customer concerns in Q1, with examples?"
Classic RAG approach: retrieve some customer feedback chunks, hope they cover the topic.
Agentic RAG approach:
Agent: I need to find Q1 customer concerns. Let me start by searching for customer complaints in that period.
> Tool: search_documents(query="customer complaints Q1 2026", filter={date_range: "Q1 2026"})
Agent: I got 25 results. Let me see what topics they cover.
> [reads results]
Agent: I see three main themes: pricing, slow support, and missing integrations. Let me get specific examples for each.
> Tool: search_documents(query="customer pricing complaints", filter={...})
> Tool: search_documents(query="customer support speed complaints", filter={...})
> Tool: search_documents(query="customer integration missing complaints", filter={...})
Agent: Now I have 3-5 specific examples per theme. Let me compile the answer.Multiple retrievals, iteratively refined. The agent decides the structure based on what it finds.
Trade-offs
Pros:
- Handles complex, multi-step queries.
- Adapts to query complexity (simple queries don't trigger long agent runs).
- Can clarify ambiguity by asking.
Cons:
- Higher latency (multiple retrievals).
- Higher cost (multiple LLM calls).
- Agent reliability matters; bad agents loop or give up.
- Harder to evaluate (more varied execution paths).
- Harder to control (the agent might do unexpected things).
When to choose:
Choose agentic RAG when query complexity varies widely. Simple queries can use fast paths; complex queries get the agentic treatment. For uniformly simple queries, the overhead isn't worth it.
Patterns within agentic RAG
A few common patterns:
Pattern 1: ReAct (Reason + Act). Agent reasons explicitly, then acts (retrieves), then observes, then reasons again. Loops until done.
Pattern 2: Plan-and-execute. Agent first creates a multi-step plan (what to retrieve in what order), then executes the plan, possibly adjusting.
Pattern 3: Self-critique. After retrieval, agent evaluates whether the retrieved info is sufficient. If not, refines query and retrieves again.
Pattern 4: Tool-rich agent. Agent has many retrieval tools (full-text search, SQL query, graph query, API calls) and picks among them.
Different patterns suit different use cases. Tool-rich works for heterogeneous data sources; ReAct works for exploratory queries; plan-and-execute works when the structure of a complex query can be planned upfront.
Long-context RAG
The idea: with 1M+ token context windows (Gemini, GPT-5), why retrieve chunks at all? Just put the whole corpus in the context.
When long-context RAG helps
Small corpora. A 100K-token corpus easily fits in a 1M-token window. No retrieval infrastructure needed.
Whole-document understanding. "Summarize this entire 500-page document." A long-context model handles this directly.
Cross-document queries on small sets. "Compare these 10 contracts." Easier to put them all in context than to retrieve carefully.
Prototyping. Long-context is the simplest path to a working system. Build the prototype with long context; optimize with retrieval later if needed.
How long-context RAG works
The pipeline is trivial:
[corpus, possibly 100K-1M tokens]
↓
+ [user query]
↓
LLM call
↓
[answer]No vector store. No chunking. No reranking.
In practice, you might still do light retrieval to fit the corpus into context (e.g., for a 5M token corpus, retrieve a 500K-token subset). But the retrieval is coarse-grained — the LLM does the fine-grained "find the relevant parts" work.
The "context rot" problem
A 2025-2026 reality: long-context models don't actually use long contexts well.
Empirically:
- Quality is highest with 5-50K tokens of context.
- Quality drops noticeably at 100K+ tokens.
- At 500K+ tokens, important information is often missed or mis-applied.
Models can technically handle long contexts but the "needle in a haystack" benchmarks oversell their performance. Real-world long-context use suffers.
This means long-context RAG works for corpora up to ~50K tokens reliably. Beyond that, it's degraded compared to good retrieval.
Trade-offs
Pros:
- Simplest possible architecture.
- No retrieval pipeline to maintain.
- Best for whole-corpus understanding tasks.
Cons:
- Quality degradation at large context sizes.
- Cost per query is high (you're paying for the full context every time).
- Latency is high (large context = slower response).
- Doesn't scale beyond corpora that fit reliably.
When to choose:
Small corpora (under 50K tokens). One-off analyses. Prototypes. NOT general-purpose retrieval for large knowledge bases.
Hybrid: retrieval + long context
A common pattern: retrieve a larger context (50-200K tokens of relevant content) than classic RAG would, but smaller than the full corpus. The LLM gets enough context to handle the query well without context rot.
Implementation: retrieve top-50 chunks (instead of top-5), include them all, let the LLM sort through.
This works well when:
- Queries need broad context.
- Models handle medium contexts well (50-200K).
- Cost is acceptable.
A 2026 sweet spot: retrieve aggressively (50-100 chunks), include all, let the LLM use the relevant parts. Trades cost for simplicity and quality.
Choosing the right variant
A decision framework:
Use classic chunk-based RAG when:
- Documents are the primary data.
- Queries are mostly retrieval-style.
- Volume and cost matter (cheapest per query).
- You need predictable latency.
Use graph RAG when:
- Data has rich entity-relationship structure.
- Queries involve multi-hop reasoning, set operations, aggregation.
- You can invest in graph construction and maintenance.
Use agentic RAG when:
- Query complexity varies widely.
- Some queries need iterative exploration.
- You're willing to pay higher latency/cost for hard queries.
- You have observability to debug agent runs.
Use long-context RAG when:
- Corpus is small (under 50K tokens).
- You want simplest architecture.
- One-off analyses or prototypes.
Combine when:
- Most real systems do.
- Classic + graph for relational queries.
- Classic + agentic for complex queries.
- Classic with broader retrieval (medium context) for borderline cases.
The mature 2026 answer is "all of the above, applied per query." A router determines which variant to use based on the query's characteristics.
Production realities
A few observations from real deployments:
Complexity compounds. Each variant adds complexity. A system using all four is a significant engineering undertaking. Start with classic; add variants only when you hit clear limits.
Eval is harder. With multiple retrieval pathways, evaluation must cover them all. The test set should include queries that exercise each pathway.
Costs vary widely. Graph RAG might be cheap (a database query). Long-context RAG is expensive. Agentic RAG varies (simple = cheap, complex = expensive). Track per-query costs.
Latency varies similarly. A 30-second agentic RAG response is okay for some use cases; not for others. Pick variants per UX context.
Maintenance burden. Graph schemas drift, agent prompts need tuning, embedding models update. Each variant has its own maintenance cost. Plan accordingly.
The 80/20. Classic RAG handles 80% of queries adequately. Beyond-chunks variants handle the hard 20% that classic does badly. Don't replace; augment.
A combined architecture
A practical architecture using multiple variants:
Query
↓
Router (classify the query)
├→ "Lookup" → Classic RAG (cheap, fast)
├→ "Relational" → Graph RAG
├→ "Complex / open-ended" → Agentic RAG
└→ "Whole-corpus / small corpus" → Long-context
Each variant produces an answer.
Observability tracks which path was used.
Eval suites cover all paths.This is more complex than any single variant but handles the full range of queries well. For mature systems with diverse query types, this is the eventual architecture.
A practical buildout
If you're starting from a working classic RAG and want to expand:
Add long-context first. Lowest engineering cost. Useful for specific query types. Often produces wins immediately.
Add agentic for complex queries. Identify the queries classic RAG struggles with. Build an agent that handles them. Route to it conditionally.
Add graph RAG last. Highest engineering cost. Only worth it if you have clear relational query patterns.
This order matches the ROI typically. Long-context: low cost, real value. Agentic: moderate cost, addresses real gaps. Graph: high cost, specific use cases.
The takeaway
Classic chunk-based RAG is the workhorse, but it has limits. The "beyond chunks" variants — graph RAG, agentic RAG, long-context RAG — each unlock different capabilities.
The path forward isn't to replace classic RAG but to augment it. Mature systems use multiple variants, routed appropriately, with each handling the queries it's good at.
The investment is meaningful — each variant is real engineering work. But for systems where classic RAG plateaus, the gains in capability are real. A system that handles only "lookup" queries well is much less useful than one that handles relational, complex, and whole-corpus queries too.
Map your queries. Identify the ones classic RAG handles badly. Pick the variant that fits. Build the augmentation. Iterate.
That's how RAG systems grow from useful-for-some-queries to genuinely capable across the diversity of real-world questions.