
Building a RAG Pipeline from Your Sitecore Content
A comprehensive technical guide to building an AI-powered chatbot that uses your Sitecore CMS content as its knowledge base.
Disclaimer: Code snippets in this guide are simplified pseudocode. They illustrate patterns and architecture — not copy-paste solutions. You'll need to adapt them to your own Sitecore instance, tech stack, and domain.
Introduction
What if every page on your Sitecore site could answer questions in real time? Not with a generic chatbot spitting out canned responses, but with a custom AI assistant that actually knows your content — every article, every FAQ, every product description — and can cite the exact page where the answer lives?
That's what I built. And the secret isn't fine-tuning an LLM on your content (expensive, brittle, stale within days). It's Retrieval-Augmented Generation — RAG.
What is RAG?
RAG is an AI architecture pattern that separates knowledge from reasoning. Instead of baking your content into the model's weights through expensive fine-tuning, you keep the content external and feed it to the model at query time:
- Index your content into a vector database at ingestion time
- Retrieve the most relevant chunks when a user asks a question
- Generate a response using an LLM, grounded in the retrieved content
Think of it like giving someone an open-book exam instead of asking them to memorize a textbook. The LLM is the student — brilliant at reasoning and communication, but we don't trust it to remember specific facts. The vector database is the textbook — it holds the authoritative content. At query time, we find the right pages and hand them to the student.
The result: an AI that always has up-to-date answers, never hallucinates beyond your content, and can point users to the exact page on your site. For Sitecore teams, this means your existing CMS investment becomes the knowledge backbone of an intelligent conversational experience — no content migration, no dual authoring, no sync headaches.
What You'll Learn
By the end of this guide, you'll understand:
- Why Sitecore's Layout Service is the ideal data source for a RAG pipeline (and why scraping HTML is a trap)
- How to turn Sitecore's component/placeholder model into semantic content chunks
- The theory behind vector embeddings and why chunk quality determines everything
- How to design LLM prompts that keep the AI grounded, accurate, and citation-rich
- How multi-turn conversation design keeps users engaged on your site
- The architectural decisions that make this system production-ready
Architecture Overview
The Two Pipelines
The system is built around two distinct pipelines that operate on completely different timescales:
┌─────────────────────────────────────────────────────────────────────────┐
│ INGESTION PIPELINE (periodic) │
│ │
│ Sitemap ──▶ Layout Service ──▶ Component Parser ──▶ Chunker │
│ │ │
│ ▼ │
│ Embedding Model ──▶ Qdrant │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ QUERY PIPELINE (real-time) │
│ │
│ User Query ──▶ Query Rewriter ──▶ Embedding ──▶ Vector Search │
│ │ │
│ ▼ │
│ Session ◀── LLM + Context ◀── Retrieved Chunks │
│ │ │
│ ▼ │
│ Response with Source Citations (links to Sitecore pages) │
└─────────────────────────────────────────────────────────────────────────┘
The ingestion pipeline runs periodically — every 48 hours in our case, triggered by a Kubernetes CronJob. It crawls your Sitecore content, breaks it into semantic chunks, converts those chunks into mathematical vectors, and stores them in a vector database. This is the "loading the textbook" step.
The query pipeline runs in real time, every time a user sends a message. It converts the user's question into a vector, finds the most similar content chunks, and feeds those chunks as context to an LLM that generates a natural-language response with citations. This is the "open-book exam" step.
This separation is fundamental. Your content authors keep working in Sitecore exactly as they always have. The AI automatically picks up changes on the next ingestion cycle. No retraining, no fine-tuning, no manual intervention. The CMS remains the single source of truth.
The Tech Stack
| Layer | Technology | Why This Choice |
|---|---|---|
| Content Source | Sitecore Layout Service (JSS) | Structured JSON instead of messy HTML |
| URL Discovery | Sitemap (XML) | Authoritative, complete, auto-updated |
| API Framework | FastAPI (Python 3.11) | Async-native, type-safe, SSE streaming support |
| Embeddings | BAAI/bge-large-en-v1.5 | Top-tier quality, self-hosted, 1024 dimensions |
| Vector Database | Qdrant | Cosine similarity, rich payloads, Kubernetes-native |
| LLM | Claude API (Anthropic) | Strong instruction following, low hallucination |
| Session Store | Redis | Fast, ephemeral, TTL-based conversation state |
| HTML Parsing | BeautifulSoup4 | Reliable rich text field cleaning |
| Orchestration | Docker + Kubernetes | Scalable, reproducible deployment |
Why RAG Over Fine-Tuning?
If you're a Sitecore developer evaluating how to bring AI into your CMS ecosystem, you might wonder whether you should fine-tune a model on your content instead. Here's why RAG is the better fit for CMS-driven content:
- Freshness: Your Sitecore content changes daily. Fine-tuning requires retraining (long, expensive). RAG re-ingests faster and can be updated more frequently.
- Traceability: RAG can cite the exact URL where an answer came from. We embed metadata with the chunks we search against that serve as reference points. Fine-tuned models can't tell you where they learned something.
- Cost: Fine-tuning requires GPU hours and specialized infrastructure. RAG runs on commodity hardware.
- Accuracy: By constraining the LLM to answer only from retrieved content, you eliminate hallucination. Fine-tuned models still hallucinate — they just hallucinate things that sound like your content, which is arguably worse.
- Composability: You can swap LLM providers (Claude, GPT, a local open-source model) without re-ingesting content. Your knowledge layer and reasoning layer are independent.
For a CMS like Sitecore, where content is structured, well-organized, and frequently updated, RAG is the natural architecture.
The Sitecore Advantage: Why the Layout Service Changes Everything
Most RAG tutorials start with "scrape your website." This is a mistake for Sitecore sites, and understanding why is the most important insight in this entire guide.
The Problem with Scraping HTML
When you scrape a rendered Sitecore page, you get everything: the navigation bar, the footer, the cookie consent banner, the hero image alt text, the breadcrumb trail, the sidebar CTAs, and — somewhere buried in there — the actual content. Your scraper doesn't know which text is the article body and which is navigational chrome. The result:
- Noisy chunks: Navigation text and boilerplate appear in your search results
- Duplicated content: Headers and footers are repeated on every page
- Lost structure: You can't tell the difference between an FAQ answer and a CTA button
- Fragile parsing: Any front-end redesign breaks your scraper
The Layout Service Alternative
Sitecore's Layout Service (the JSS REST API) solves all of these problems. Instead of rendered HTML, it returns the structured JSON representation of a page — every component, every field, every placeholder, with type metadata intact.
The endpoint follows this pattern:
GET /sitecore/api/layout/render/jss
?sc_site=your-site
&item=/path/to/page
&sc_lang=en
&sc_apikey={YOUR_API_KEY}
What you get back is a JSON tree like this:
{
"sitecore": {
"route": {
"displayName": "Sourdough Bread Recipe",
"placeholders": {
"jss-main": [
{
"componentName": "HeroBanner",
"fields": { "title": { "value": "Sourdough Bread Recipe" } }
},
{
"componentName": "Section",
"placeholders": {
"section-content": [
{
"componentName": "ContentBlock",
"fields": {
"heading": { "value": "What Makes Sourdough Special?" },
"content": { "value": "<p>Sourdough uses a <strong>natural starter</strong>...</p>" }
}
},
{
"componentName": "FAQAccordion",
"fields": {
"items": [
{
"fields": {
"question": { "value": "How long does it take?" },
"answer": { "value": "<p>About 24 hours total...</p>" }
}
}
]
}
}
]
}
}
]
}
}
}
}
This is transformative for RAG because:
-
You know what each piece of content is. A
ContentBlockis an article body. AFAQAccordionis a list of Q&A pairs. AHeroBanneris decorative. You can make intelligent decisions about how to chunk each one. -
You can skip non-content components entirely. Navigation, CTAs, forms, video embeds — skip them. They add noise, not knowledge.
-
The structure is recursive and predictable. Placeholders contain components. Components may contain nested placeholders. You write one recursive parser and it handles every page on your site.
-
Field values are clean and typed. You get the raw content field, not the rendered HTML with all its wrapper divs and CSS classes.
Discovering URLs via the Sitemap
Before you can fetch Layout Service data, you need to know which pages to fetch. Your Sitecore sitemap is the answer — one HTTP request gives you every indexable URL. Parse the XML, apply some exclusion filters (skip /contact-us, /privacy-policy, /login, etc.), and you have a clean list of content pages.
The key insight is that the sitemap is authoritative. Content authors control what appears in it through sitemap configuration. If a page shouldn't be in the AI's knowledge base, they exclude it at the source.
Fetching at Scale
With a large set of pages, you need to be thoughtful about how you hit the Layout Service. We use:
- Controlled concurrency: Fetch 10 pages simultaneously, wait for the batch to complete, then fetch the next 10. This prevents overwhelming your Sitecore instance.
- Rate limiting: A small delay (100ms) between requests within each batch.
- Error tolerance: If a page fails, log it and move on. Don't let one 404 kill the entire ingestion.
# Pseudocode: batch fetching with controlled concurrency
for batch in chunk_urls(all_urls, batch_size=10):
results = await asyncio.gather(*[
fetch_layout_service(url) for url in batch
])
for url, data in zip(batch, results):
if data is not None:
page_data[url] = data
Parsing Sitecore Components: Turning Structure into Semantics
This is the most Sitecore-specific step and where your domain knowledge as a Sitecore developer matters most. The Layout Service gives you a tree of components. You need to turn that tree into flat, semantic text chunks suitable for embedding.
The Component Taxonomy
Every Sitecore site has its own set of component renderings. You need to categorize yours into four groups:
1. Skip Components — No indexable content. These are visual, interactive, or navigational:
Navigation, Footer, HeroBanner, CTA, CTABlock, FormComponent,
VideoEmbed, ImageGallery, BreadcrumbNavigation, SocialMediaLinks...
2. Single-Chunk Components — Each instance produces one content chunk:
ContentBlock, HeadingBody, Bodytext, Testimonial, Table, Blog-Content...
3. Multi-Chunk Components — Each item within the component becomes its own chunk:
FAQAccordion, AccordionList, CollapsiblePanel...
4. Container Components — No content themselves, but they contain nested placeholders:
Section, MultiColumn, CollapsiblePanelWithIcon...
This categorization is the single most impactful design decision in the pipeline. Get it right, and your AI gives precise, relevant answers. Get it wrong, and it returns navigation text and CTA copy.
Recursive Traversal
Sitecore's component tree is recursive. A Section component has a section-content placeholder, which contains ContentBlock components, which might be inside a MultiColumn that itself has nested placeholders. Your parser needs to handle arbitrary nesting depth.
The algorithm is straightforward:
# Pseudocode: recursive component tree traversal
def parse_placeholders(placeholders):
chunks = []
for name, components in placeholders.items():
for component in components:
# Extract content from this component based on its type
chunks.extend(parse_component(component))
# Recurse into any nested placeholders
if "placeholders" in component:
chunks.extend(parse_placeholders(component["placeholders"]))
return chunks
Why FAQ Components Are RAG Gold
FAQ and accordion components deserve special attention. Each Q&A pair is a self-contained, question-answerable unit of knowledge — exactly what a RAG system needs. When a user asks a question that matches an FAQ, the retrieval is precise and the LLM response is grounded.
We format each Q&A pair as its own chunk: "Q: {question}\nA: {answer}". This means when a user asks "How long does sourdough take?", the retrieval system finds the exact Q&A pair, and the LLM can give a precise, cited answer.
Cleaning Rich Text Fields
Sitecore's rich text fields contain HTML. Before embedding, you need clean plaintext. The transformation is: strip all HTML tags, remove <script> and <style> elements, normalize whitespace, and remove invisible Unicode characters (zero-width spaces, byte order marks).
A library like BeautifulSoup handles this reliably. The key is not to over-engineer it — soup.get_text(separator=" ", strip=True) does 95% of the work.
Chunking Theory: Why Size Matters
Before you embed your content, you need to understand why chunk size is critical. This is one of the most underappreciated aspects of RAG pipeline design.
The Goldilocks Problem
An embedding is a mathematical summary of a piece of text. The quality of that summary depends entirely on how much text you're summarizing:
-
Too short (< 50 characters): A heading like "Recipe Tips" has almost no semantic content. Its embedding will be generic and match far too many queries. You'll get false positives constantly.
-
Too long (> 1000 characters): A 5,000-character article covers multiple topics. Its embedding becomes a blurry average — a little bit about everything, strongly about nothing. Relevant queries produce weak similarity scores.
-
Just right (50–1000 characters): A focused paragraph or Q&A pair that covers one specific topic. The embedding is sharp and semantically meaningful. When a query matches, it matches strongly.
This is why the component-aware parsing matters. A Sitecore ContentBlock with a heading and two paragraphs is usually a perfect chunk. An FAQAccordion item is a perfect chunk. But a massive Blog-Content field needs to be split.
Splitting Strategy
When a chunk exceeds your maximum length, you need to split it. The naive approach — split every N characters — creates incoherent fragments. Instead, split at natural boundaries:
- Prefer sentence boundaries: Find the last
.before the max length - Fall back to word boundaries: Find the last space before the max length
- Add overlap: Include 50 characters of overlap between consecutive chunks so context isn't lost at the boundary
Deduplication
Sitecore sites often reuse components across pages — a shared disclaimer, a "related articles" sidebar, a boilerplate legal notice. Without deduplication, these appear dozens of times in your vector database, polluting search results.
The solution is simple: hash each chunk's content (MD5 is fine — this isn't cryptography) and skip duplicates. In our production system, this typically removes 5–10% of chunks.
Understanding Embeddings: The Math Behind Semantic Search
Embeddings are the bridge between human language and machine computation. Understanding how they work will help you make better design decisions throughout your pipeline.
What an Embedding Actually Is
An embedding model takes a piece of text and produces a fixed-size vector — in our case, 1024 floating-point numbers. This vector captures the semantic meaning of the text, not its keywords. Two texts about the same topic will have similar vectors, even if they use completely different words.
For example:
- "How do I make sourdough bread from scratch?" →
[0.0234, -0.0891, 0.1456, ...] - "Beginner guide to baking with natural starter" →
[0.0219, -0.0903, 0.1441, ...]
These vectors are very close in 1024-dimensional space (high cosine similarity), even though the two sentences share almost no words. The model has learned that "sourdough" and "natural starter" are related concepts, that "make from scratch" and "beginner guide" imply similar intent.
Cosine Similarity
We measure the "closeness" of two vectors using cosine similarity — the cosine of the angle between them in high-dimensional space. A score of 1.0 means identical meaning, 0.0 means completely unrelated.
In practice, for content retrieval:
- > 0.7: Strong match — very likely the right content
- 0.5–0.7: Moderate match — probably relevant
- < 0.5: Weak match — likely noise
We set a score threshold (0.5 in our system) below which results are discarded entirely. This prevents the AI from generating responses based on tangentially related content.
Why We Chose BAAI/bge-large-en-v1.5
The embedding model is the foundation of your retrieval quality. We chose BGE-large for several reasons:
- 1024 dimensions: High enough to capture nuanced semantic relationships without excessive storage cost
- MTEB benchmark leader: Consistently ranks among the top embedding models for retrieval tasks
- Self-hosted: Runs locally via the Sentence Transformers library — no API calls, no per-token costs, no data leaving your infrastructure
- Batch processing: Can embed thousands of chunks in seconds using batch encoding
The model is ~1.3 GB and runs well on CPU. No GPU required for the embedding step.
At query time, the same embedding model converts the user's question into a vector. We then search the vector database for the K most similar content chunks (we use K=5). Qdrant handles this search efficiently using approximate nearest neighbor algorithms — even with millions of vectors, the search takes milliseconds.
Vector Storage: More Than Just Vectors
Qdrant isn't just storing vectors — it's storing vectors with payloads. Every chunk carries metadata alongside its embedding:
{
"content": "The actual text of the chunk",
"page_url": "https://www.example.com/recipes/sourdough-bread",
"page_title": "Sourdough Bread Recipe",
"component_type": "ContentBlock",
"content_type": "article",
"position_on_page": 3,
"total_chunks_on_page": 12
}
This metadata is what makes the system useful beyond basic search:
page_url: Enables the LLM to cite the exact source page — this is what drives users back to your Sitecore sitepage_title: Provides human-readable context for the LLM's citationscomponent_type: Enables filtered search (e.g., "only search FAQ chunks") and helps the LLM understand what kind of content it's readingposition_on_page: Enables page context expansion — fetching sibling chunks from the same page for more coherent answers
Page Context Expansion
A powerful retrieval enhancement: after finding the top-K chunks, identify which pages they came from and fetch additional chunks from those same pages. This gives the LLM broader context from the most relevant pages, producing more coherent and comprehensive answers.
The intuition: if a chunk from page /recipes/sourdough-bread scored highest for the user's query, the other chunks on that page are probably relevant too. By fetching them, we give the LLM the full picture instead of an isolated paragraph.
The Ingestion Pipeline: Putting It All Together
The complete ingestion flow is a seven-step orchestration:
1. Discover URLs → Fetch sitemap, apply exclusion filters
2. Fetch Content → Hit Layout Service for each URL (batch, rate-limited)
3. Parse Components → Recursive traversal, component-type routing
4. Chunk & Clean → Validate lengths, split oversized, strip HTML
5. Deduplicate → Hash-based content deduplication
6. Generate Embeddings → Batch encode with BGE-large (1024-dim vectors)
7. Upload to Qdrant → Upsert vectors + metadata payloads
In our production deployment processing ~800 Sitecore pages, this pipeline runs in about 10 minutes and produces ~3,000 unique content chunks. We trigger it every 48 hours via a Kubernetes CronJob, with the option to run it manually when content authors make significant updates.
Each step is a separate, testable class. The orchestrator simply calls them in sequence. This modularity means you can swap out any component — different embedding model, different vector database, different content source — without touching the rest.
The Art of LLM Prompting: Making the AI Do What You Want
This section is about the most underestimated part of the system: prompt engineering. The LLM is the public face of your chatbot. How you instruct it determines whether users get helpful, accurate, well-cited responses — or a confidently wrong hallucination.
The System Prompt: Setting the Boundaries
The system prompt is the most important piece of text in your entire application. It runs before every conversation and defines the AI's identity, capabilities, and constraints. Here's the philosophy behind ours:
Rule 1: Ground the AI in context only.
Answer ONLY based on the provided context — do not use outside knowledge.
If you don't have relevant information, say so and ask a clarifying question.
This is the anti-hallucination rule. Without it, the LLM will happily fill in gaps with plausible-sounding but invented information. With it, the AI admits when it doesn't know something — which is far more trustworthy.
Rule 2: Enforce exact URL citation.
When citing a source, copy its URL EXACTLY as shown in the context.
Format: <a href="EXACT_URL_FROM_CONTEXT">Learn more here</a>
URLs must be copied character-for-character.
This is how you drive users back to your Sitecore pages. The retrieved context includes URLs alongside each chunk. By instructing the LLM to copy them exactly, you ensure every link works. Without this rule, LLMs will sometimes "improve" URLs by guessing at the structure — and broken links destroy trust.
Rule 3: Keep responses concise.
Keep responses under 100 words.
Use short paragraphs (2-3 sentences max).
Chat interfaces aren't blog posts. Walls of text feel like being lectured. Short, direct answers with a link to "learn more" respect the user's time and encourage them to click through to your site — which is the whole point.
Rule 4: Drive conversation forward.
End with ONE brief follow-up question to continue the conversation.
Questions should be specific to the user's situation.
Follow-up questions serve two purposes: they keep the conversation going (higher engagement), and they help the system gather context for better retrieval on the next turn. "Are you starting a new sourdough starter, or do you already have one?" is better than "Do you have any other questions?" because it elicits information that improves the next retrieval.
Rule 5: Control formatting.
Use HTML tags for formatting: <strong>, <em>, <a href="">, <ul>, <li>
Do NOT use markdown syntax.
If your front-end renders HTML (which most chat widgets do), you need the LLM to output HTML — not markdown. Without this explicit instruction, models will inconsistently switch between formats.
The Context Injection Pattern
The most critical architectural decision is how you feed retrieved content to the LLM. We format it as a structured block within the user message:
Context from our website:
Source: Sourdough Bread Recipe
URL: https://www.example.com/recipes/sourdough-bread
Sourdough bread uses a natural starter instead of commercial yeast,
resulting in a tangy flavor, chewy crumb, and crispy crust.
Q: How long does it take to make sourdough?
A: The full process takes about 24 hours, including 8-12 hours of
bulk fermentation and an overnight cold proof.
---
User Question: How do I get started with sourdough?
The structure matters. Each source is clearly labeled with a title and URL. The LLM can see exactly where each piece of information came from and cite it accurately. The --- separators prevent the model from conflating content from different pages.
Query Rewriting: Solving the Follow-Up Problem
Multi-turn conversations create a retrieval challenge. When a user asks "What about the cold proof?" after discussing sourdough, the raw query is nearly useless for vector search — "cold proof" alone doesn't embed well without context.
The solution: use the LLM itself to rewrite follow-up questions into standalone queries before retrieval. We send the last few conversation turns to the LLM with instructions like:
Given the conversation above, rewrite this follow-up question
into a standalone question. Replace pronouns with specific terms.
Make vague questions specific. Return ONLY the rewritten question.
The result: "What about the cold proof?" becomes "How does cold proofing work for sourdough bread?" — a much better retrieval query.
Key implementation details:
- Use temperature 0.0 for query rewriting. You want deterministic, precise output — not creative interpretation.
- Only rewrite when there's history. The first message in a conversation doesn't need rewriting.
- Send minimal history (last 2 exchanges). More history means more tokens and more potential for the rewriter to get confused.
Temperature and Token Limits
Two LLM parameters that dramatically affect output quality:
Temperature (0.3): Controls randomness. At 0.0, the model always picks the most likely next token — deterministic but sometimes robotic. At 1.0, it samples broadly — creative but unpredictable. We use 0.3: low enough to stay grounded in the context, high enough to produce natural-sounding prose. For a RAG chatbot, you want the model to be reliable, not creative.
Max tokens (800): The hard ceiling on response length. This isn't just about cost — it's a UX decision. Chat responses should be 2-3 short paragraphs plus a follow-up question. 800 tokens (~600 words) is more than enough. Without a cap, the LLM will sometimes produce essay-length responses that bury the answer.
Conversation History Management
We include the last 10 messages (5 user/assistant pairs) in each LLM call. This gives the model enough context for coherent multi-turn conversation without blowing up the context window or adding excessive latency.
Messages are stored in Redis with a 24-hour TTL. This is intentionally ephemeral — chat conversations are transient by nature, and permanent storage creates GDPR/privacy concerns. If a user closes their browser and comes back within 24 hours, their conversation picks up where they left off.
Serving the Chat: API Design for Real-Time AI
The chatbot needs a real-time API that feels responsive. Modern users expect the "typing" effect — they want to see tokens appearing as the AI generates them, not wait 5 seconds for a complete response.
Streaming with Server-Sent Events (SSE)
We use SSE rather than WebSockets because:
- It's simpler (one-directional server → client)
- It works through standard HTTP (no upgrade negotiation)
- It's natively supported in all modern browsers via
EventSource - It's compatible with load balancers and CDNs without special configuration
The streaming flow sends events in a specific order:
session— The session ID (so the client can maintain state)content— Response text chunks, sent as they're generated by the LLMsources— Source citationsdone— Completion signal
Rate Limiting
Two tiers of rate limiting protect the system:
- Per-session: 20 requests per minute (prevents individual abuse)
- Per-IP: 200 requests per hour (prevents coordinated abuse)
Both are implemented with Redis counters and sliding windows. Simple, fast, and stateless.
The Full Request Lifecycle
When a user sends a message, the system orchestrates 9 steps:
- Load session from Redis (or create a new one)
- Get conversation history for multi-turn context
- Rewrite the query using conversation context (if follow-up)
- Embed the query into a vector
- Search Qdrant for the top-K most similar chunks
- Format context with source URLs for the LLM prompt
- Generate response from the LLM with context + history
- Save messages to Redis for conversation continuity
- Return response with source citations
Steps 4-5 typically take <100ms. Step 7 (LLM generation) takes 1-3 seconds, which is why we stream it.
Keeping Users Engaged on Your Site
The real power of this architecture isn't just answering questions — it's keeping users on your site. This is what separates a RAG chatbot from a search bar.
Every Response is a Page View Opportunity
Every response includes links to real pages on your Sitecore site. When a user asks about a topic, the AI doesn't just answer — it directs them to the authoritative page. A single chatbot conversation can generate 3-5 page views as users click through citations to explore related content. Traditional search gives you one click. The chatbot gives you a journey.
Conversational Discovery
The multi-turn design lets users explore your content conversationally. Instead of bouncing after a single search result, they ask follow-up questions, go deeper into topics, and discover pages they would never have found through menu navigation. The query rewriting system ensures that even vague follow-ups ("what about that?") produce relevant results. The conversation feels natural — like talking to a knowledgeable guide rather than fighting a search engine.
Surfacing the Long Tail
Many Sitecore sites have hundreds of pages organized in deep hierarchies. A niche FAQ answer buried three levels deep in your information architecture might be exactly what a user needs — but they'll never find it through navigation. The chatbot acts as a semantic search layer that surfaces content based on meaning, not URL structure. Every page in your vector database is equally accessible, regardless of where it sits in the content tree.
Session Continuity
Redis-backed sessions let users return to their conversation within 24 hours. This is important for complex topics where a user might research, leave to discuss with colleagues, and come back with follow-up questions. The conversation picks up exactly where it left off, with full context intact.
Deployment Architecture
Infrastructure
The system runs on three main workloads:
- Deployment: FastAPI chat API (2+ replicas, horizontal scaling)
- StatefulSet: Qdrant vector database (persistent volume, single replica)
- StatefulSet: Redis (persistent volume, single replica)
Plus a CronJob that runs the ingestion pipeline every 48 hours.
Docker
Multi-stage Docker builds keep images small. The first stage installs dependencies with Poetry and exports a requirements.txt. The second stage copies only the installed packages and application code. The final image is ~500MB (mostly the embedding model weights).
Configuration
All settings are managed through environment variables using, making the application 12-factor compliant. Key settings include:
- Sitecore connection details (sitemap URL, Layout Service URL, API key)
- Embedding model name and dimension
- Qdrant connection URL and collection name
- LLM provider, model, temperature, max tokens
- Retrieval parameters (top-K, score threshold)
- Redis connection and session TTL
- Rate limiting thresholds
Everything is configurable without code changes. Staging uses one set of environment variables. Production uses another. The code is identical.
Conclusion
You've now seen the full architecture of a production-grade RAG pipeline built on top of Sitecore:
- Sitemap discovery to find every indexable URL automatically
- Layout Service integration to get clean, structured content instead of scraping HTML
- Component-aware parsing that understands Sitecore's rendering model
- Intelligent chunking with sentence-boundary splitting and deduplication
- Semantic embeddings that capture meaning, not keywords
- Vector storage with rich metadata payloads for citation and filtering
- Semantic retrieval with page context expansion for coherent answers
- Prompt engineering that keeps the LLM grounded, cited, and conversational
- Multi-turn conversation with query rewriting and session continuity
- A streaming API that feels responsive and real-time
The architecture is intentionally composable. You can swap Claude for GPT or a local model. You can replace Qdrant with Pinecone or Weaviate. You can extend the component parser to handle your custom Sitecore renderings. The patterns are what matter — the specific technologies are interchangeable.
What makes this approach powerful for Sitecore teams is that it respects your existing content workflow. Content authors keep working in Sitecore. The AI keeps itself updated through periodic re-ingestion. No dual authoring, no content migration, no sync jobs to maintain. Your CMS becomes the single source of truth for both your website and your AI.
The hardest part isn't any individual component — it's getting the details right. The chunking strategy. The system prompt. The query rewriting. The score threshold. These are the decisions that separate a demo from a product. I hope this guide gives you a head start.
Build it. Ship it. Let your content do the talking.
Have questions about this implementation? Email me at fcostoyaprograms@gmail.com.