1677 words
8 minutes
Building a RAG Chatbot for Your Obsidian Vault: Ask Questions, Get Answers with Sources

We had 1,000+ notes indexed in Qdrant. Semantic search was working beautifully. Notes were auto-linked based on similarity. But there was still friction: to find information, you had to know what to search for.

What if you could just ask?

“What did that video say about RAG architectures?” “Which notes mention LangGraph agents?” “Summarize what I know about trading psychology.”

This is the promise of RAG (Retrieval-Augmented Generation): combine vector search with LLM generation to answer questions grounded in your actual content.

We built it. It works. Here’s how.

Figure 1 - Chat interface showing a user question, AI-generated answer with inline citations, and a sources panel listing the relevant notes used to generate the response


What RAG Actually Is#

RAG isn’t magic - it’s a straightforward pipeline:

Figure 2 - RAG pipeline diagram showing four steps: Embed the question, Retrieve similar chunks from Qdrant, Augment prompt with context, Generate answer with Claude Haiku

The key insight: the LLM doesn’t hallucinate because it’s grounded in your actual notes. No making things up - only synthesizing what’s already in your vault.


Architecture: From Notes to Chat#

Building on the semantic note network we already had, we added three new components:

1. Note Chunker#

Full notes are too long for effective retrieval. A 10,000-character note would dominate search results even if only one paragraph was relevant.

Solution: split notes into overlapping chunks.

class NoteChunker:
"""Splits notes into overlapping chunks for embedding."""
DEFAULT_CHUNK_SIZE = 2000 # chars (~500 tokens)
DEFAULT_OVERLAP = 400 # chars (~100 tokens)
def chunk_note(self, content: str) -> list[str]:
frontmatter, body = self._extract_frontmatter(content)
context_prefix = self._build_context_prefix(frontmatter)
paragraphs = self._split_into_paragraphs(body)
return self._combine_into_chunks(paragraphs, context_prefix)

The first chunk includes frontmatter context (title, description, tags) so the embedding captures the note’s overall topic.

For 1,051 notes, this produced approximately 5,000 chunks - an average of 4.8 chunks per note.

2. Chunk Embeddings#

A new Qdrant collection stores chunk embeddings:

# Collection: obsidian_chat_chunks
# Vectors: 1536 dimensions (text-embedding-3-small)
# Payload: file_path, title, chunk_index, chunk_text

This is separate from our note-level collection. Note-level embeddings are optimized for similarity between notes. Chunk-level embeddings are optimized for finding specific passages.

3. Chat Service#

The orchestrator that ties everything together:

class ObsidianChatService:
"""RAG-based chat service for Obsidian vault."""
DEFAULT_SEARCH_LIMIT = 5
DEFAULT_SCORE_THRESHOLD = 0.50
MAX_CONTEXT_CHARS = 8000
async def chat(self, question: str) -> ChatResponse:
# Step 1: Search for relevant chunks
sources = self.search(
query=question,
limit=self.DEFAULT_SEARCH_LIMIT,
threshold=self.DEFAULT_SCORE_THRESHOLD,
)
# Step 2: Build context from sources
context = self._build_context(sources)
# Step 3: Generate response with Haiku
answer = await self._generate_response(question, context)
return ChatResponse(
answer=answer,
sources=sources,
success=True,
)

The Chunking Strategy#

Chunking is where most RAG implementations go wrong. Chunk too large, and you miss specific passages. Chunk too small, and you lose context.

We chose 2,000 characters (~500 tokens) with 400-character overlap. Here’s why:

Chunk SizeProsCons
500 charsVery specific retrievalLoses context, many chunks
2000 charsGood context, reasonable countSlightly less precise
5000 charsFull contextMay retrieve irrelevant content

The overlap ensures that if important information spans a chunk boundary, at least one chunk will contain it completely.

Paragraph-Aware Splitting#

We don’t just cut at character boundaries - we split at paragraph breaks:

def _split_into_paragraphs(self, text: str) -> list[str]:
"""Split on double newlines while preserving structure."""
text = text.replace("\r\n", "\n")
paragraphs = re.split(r"\n\n+", text)
return [p.strip() for p in paragraphs if p.strip()]

This produces chunks that end at natural boundaries, making them more coherent for both retrieval and generation.

Context-Rich First Chunks#

The first chunk of each note includes frontmatter metadata:

Title: Building RAG Systems | Description: A guide to retrieval-augmented
generation | Tags: ai/rag, ai/llm, coding/python
RAG combines the power of large language models with external knowledge
retrieval. Instead of relying solely on the model's training data...

This ensures the embedding captures the note’s overall topic, not just the opening paragraph.


Retrieval: Finding What Matters#

Search is straightforward - embed the question and find similar chunks:

def search(
self,
query: str,
limit: int = 5,
threshold: float = 0.50,
) -> list[ChatSource]:
"""Search for relevant chunks."""
sources = self.embeddings.search_by_text(
query_text=query,
limit=limit,
score_threshold=threshold,
)
# Deduplicate by file_path, keeping highest score per note
unique_sources = {}
for source in sources:
if source.file_path not in unique_sources:
unique_sources[source.file_path] = source
elif source.score > unique_sources[source.file_path].score:
unique_sources[source.file_path] = source
return sorted(unique_sources.values(), key=lambda x: x.score, reverse=True)

The Deduplication Problem#

Without deduplication, a single very relevant note might return multiple chunks, drowning out other perspectives. We keep only the highest-scoring chunk per note, ensuring diverse sources.

Threshold Tuning#

ThresholdResults
0.70+Very few matches, misses relevant content
0.60Conservative, high precision
0.50Balanced - catches more relevant content
0.40More results, some noise

We use 0.50 for chat (lower than our 0.70 for note linking) because:

  • Chat benefits from more context
  • The LLM can filter out less relevant passages
  • Users can click sources to verify

Generation: The Prompt That Works#

The system prompt sets the tone:

system_prompt = """You are a helpful assistant that answers questions
about the user's Obsidian vault notes.
Use the provided context from their notes to answer questions accurately.
If the context doesn't contain relevant information, say so honestly.
Keep answers concise but informative.
When referencing information, mention which source note it came from."""

The key instructions:

  1. Use the context - Don’t make things up
  2. Be honest - Say when you don’t know
  3. Cite sources - Attribution builds trust

The user prompt includes the retrieved context:

user_message = f"""Context from your notes:
[Source 1: Building RAG Systems]
RAG combines the power of large language models with...
[Source 2: Vector Database Comparison]
For semantic search, Qdrant provides...
Question: {question}
Please answer based on the context above."""

The Frontend Experience#

The chat interface has three main areas:

Figure 3 - Chat dashboard layout mockup showing chat panel on left with conversation history, note preview panel on right showing selected source, and sources list with similarity scores

Source Attribution#

Every answer includes a collapsible list of sources. Click any source to preview the full note in the right panel.

function SourceList({ sources, onSourceClick }) {
return (
<Collapsible>
<CollapsibleTrigger>
Sources ({sources.length})
</CollapsibleTrigger>
<CollapsibleContent>
{sources.map((source) => (
<button
key={source.file_path}
onClick={() => onSourceClick(source)}
className="text-left hover:bg-muted"
>
{source.title || source.file_name}
<span className="text-muted">
({Math.round(source.score * 100)}%)
</span>
</button>
))}
</CollapsibleContent>
</Collapsible>
);
}

The percentage shows the similarity score - users quickly learn that 80%+ means highly relevant, while 50-60% might be tangential.

Open in Obsidian#

The killer feature: click “Open in Obsidian” to jump directly to the source note:

function getObsidianUri(filePath: string, vaultName: string): string {
const pathParts = filePath.replace(/\\/g, '/').split('/');
const fileName = pathParts[pathParts.length - 1].replace(/\.md$/, '');
const subFolder = pathParts[pathParts.length - 2] || '';
const fileRef = subFolder ? `${subFolder}/${fileName}` : fileName;
return `obsidian://open?vault=${encodeURIComponent(vaultName)}&file=${encodeURIComponent(fileRef)}`;
}

Clicking it opens Obsidian and navigates directly to the note. The chatbot becomes a discovery tool, not a replacement for your actual notes.


Auto-Indexing: Keeping the Index Fresh#

New notes should be searchable immediately. We hook into the save process:

@app.post("/api/save-note")
async def save_note(note_data: dict):
# Save the file
file_path = save_note_to_disk(note_data)
# Index for similarity search
index_note_for_similarity(file_path)
# Index for chat
chat_service = ObsidianChatService()
chat_service.index_note(file_path)

No manual re-indexing required. Ask a question about a video you just processed, and it’s already in the index.


Bulk Indexing: The Initial Setup#

For existing vaults, we provide a bulk indexing script:

def main():
service = ObsidianChatService()
stats = service.index_all_notes()
print(f"Indexed {stats['indexed_count']} notes")
print(f"Total chunks: {stats['total_chunks']}")
print(f"Errors: {len(stats['errors'])}")

Running it on our vault:

Terminal window
python scripts/index_vault_for_chat.py --verbose
Indexing vault: C:\Users\...\Obsidian\Vault\Notes
Found 1051 markdown files
...
Indexed 1051 notes
Total chunks: 5047
Errors: 0
Time elapsed: 8m 32s

~5,000 chunks indexed in under 9 minutes. That’s the entire vault ready for questions.


Performance Characteristics#

OperationTimeNotes
Embed question~100msOpenAI API call
Search Qdrant~10ms5,000 vectors, local
Generate answer~2sHaiku is fast
Total chat latency~2.5sFeels instant
Index one note~500msEmbed + upsert
Index full vault~9 minutes1,051 notes

The bottleneck is LLM generation, not retrieval. Qdrant’s sub-millisecond search means the experience feels snappy.


The Complete Stack#

Everything working together:

Figure 4 - Complete chat architecture diagram showing the full flow from user question through FastAPI, OpenAI embedding, Qdrant search, context building, Claude Haiku generation, and response display with source attribution


What Makes This Work#

1. Proper Chunking#

2,000-character chunks with overlap strike the balance between context and precision. Paragraph-aware splitting produces coherent text.

2. Frontmatter Context#

First chunks include title, description, and tags. Embeddings capture what the note is about, not just what it says.

3. Deduplication#

One chunk per note in results prevents a single document from dominating. Users see diverse perspectives.

4. Low Threshold#

0.50 retrieves more content than traditional search would. The LLM filters noise; users verify via sources.

5. Source Attribution#

Every answer comes with clickable sources. Trust is built through transparency, not black-box responses.

6. Obsidian Integration#

“Open in Obsidian” completes the loop. The chatbot finds information; Obsidian lets you act on it.


Sample Interactions#

User: “What techniques help with trading discipline?”

Assistant: Based on your notes, several techniques can help with trading discipline:

  1. Pre-trade checklists - From “Trading Psychology Fundamentals”: Having a written checklist before each trade forces you to verify your setup meets all criteria.

  2. Journal reviews - “My Trading Journal Process” describes reviewing each trade weekly to identify emotional patterns.

  3. Position sizing rules - “Risk Management 101” emphasizes that fixed position sizing removes emotional decision-making.

Sources: Trading Psychology Fundamentals (78%), My Trading Journal Process (72%), Risk Management 101 (68%)


User: “How does LangGraph differ from CrewAI?”

Assistant: According to your notes, the key differences are:

LangGraph (from “LangGraph Architecture Deep Dive”):

  • Graph-based workflow definition
  • Explicit state management
  • Fine-grained control over agent transitions

CrewAI (from “CrewAI Framework Overview”):

  • Role-based agent definition
  • Automatic task delegation
  • Simpler setup for common patterns

Your note “Comparing Agent Frameworks” summarizes: “LangGraph when you need control, CrewAI when you need speed.”

Sources: LangGraph Architecture Deep Dive (82%), CrewAI Framework Overview (79%), Comparing Agent Frameworks (71%)


Key Takeaways#

  1. Chunking is crucial. 2,000 chars with 400 overlap works well. Include frontmatter in first chunk.

  2. Lower your threshold. 0.50 retrieves more than note-linking’s 0.70. The LLM handles noise.

  3. Deduplicate results. One chunk per note prevents monopolization.

  4. Show sources. Attribution builds trust. Let users verify.

  5. Complete the loop. “Open in Obsidian” makes the chatbot useful, not just interesting.

  6. Auto-index on save. Manual re-indexing kills the user experience.


The Full Journey#

This chatbot is the culmination of our knowledge management system:

  1. YouTube transcript extraction
  2. AI-powered note generation
  3. Semantic indexing in Qdrant
  4. Auto-linking related notes
  5. Tag curation and taxonomy
  6. Batch processing with Anthropic API
  7. Modern React frontend
  8. And now: RAG-powered chat

Each phase built on the previous. The semantic index that powers note similarity also powers chunk retrieval. The frontmatter that organizes tags also enriches embeddings. The infrastructure that serves the UI also serves the chatbot.

This is what compound progress looks like.



This article is part of our series on building AI-powered knowledge management tools. Written with assistance from Claude Code.

Building a RAG Chatbot for Your Obsidian Vault: Ask Questions, Get Answers with Sources
https://fuwari.vercel.app/articles/obsidian-vault-chatbot/
Author
Katrina Dotzlaw, Ryan Dotzlaw
Published at
2025-12-20
License
CC BY-NC-SA 4.0
← Back to Articles