Building a RAG Chatbot for Your Obsidian Vault: Ask Questions, Get Answers with Sources

We had 1,000+ notes indexed in Qdrant. Semantic search was working beautifully. Notes were auto-linked based on similarity. But there was still friction: to find information, you had to know what to search for.

What if you could just ask?

“What did that video say about RAG architectures?” “Which notes mention LangGraph agents?” “Summarize what I know about trading psychology.”

This is the promise of RAG (Retrieval-Augmented Generation): combine vector search with LLM generation to answer questions grounded in your actual content.

We built it. It works. Here’s how.

Figure 1 - Chat interface showing a user question, AI-generated answer with inline citations, and a sources panel listing the relevant notes used to generate the response

What RAG Actually Is#

RAG isn’t magic - it’s a straightforward pipeline:

Figure 2 - RAG pipeline diagram showing four steps: Embed the question, Retrieve similar chunks from Qdrant, Augment prompt with context, Generate answer with Claude Haiku

The key insight: the LLM doesn’t hallucinate because it’s grounded in your actual notes. No making things up - only synthesizing what’s already in your vault.

Architecture: From Notes to Chat#

Building on the semantic note network we already had, we added three new components:

1. Note Chunker#

Full notes are too long for effective retrieval. A 10,000-character note would dominate search results even if only one paragraph was relevant.

Solution: split notes into overlapping chunks.

1
class NoteChunker:
2
    """Splits notes into overlapping chunks for embedding."""
3

4
    DEFAULT_CHUNK_SIZE = 2000   # chars (~500 tokens)
5
    DEFAULT_OVERLAP = 400      # chars (~100 tokens)
6

7
    def chunk_note(self, content: str) -> list[str]:
8
        frontmatter, body = self._extract_frontmatter(content)
9
        context_prefix = self._build_context_prefix(frontmatter)
10
        paragraphs = self._split_into_paragraphs(body)
11
        return self._combine_into_chunks(paragraphs, context_prefix)

The first chunk includes frontmatter context (title, description, tags) so the embedding captures the note’s overall topic.

For 1,051 notes, this produced approximately 5,000 chunks - an average of 4.8 chunks per note.

2. Chunk Embeddings#

A new Qdrant collection stores chunk embeddings:

1
# Collection: obsidian_chat_chunks
2
# Vectors: 1536 dimensions (text-embedding-3-small)
3
# Payload: file_path, title, chunk_index, chunk_text

This is separate from our note-level collection. Note-level embeddings are optimized for similarity between notes. Chunk-level embeddings are optimized for finding specific passages.

3. Chat Service#

The orchestrator that ties everything together:

1
class ObsidianChatService:
2
    """RAG-based chat service for Obsidian vault."""
3

4
    DEFAULT_SEARCH_LIMIT = 5
5
    DEFAULT_SCORE_THRESHOLD = 0.50
6
    MAX_CONTEXT_CHARS = 8000
7

8
    async def chat(self, question: str) -> ChatResponse:
9
        # Step 1: Search for relevant chunks
10
        sources = self.search(
11
            query=question,
12
            limit=self.DEFAULT_SEARCH_LIMIT,
13
            threshold=self.DEFAULT_SCORE_THRESHOLD,
14
        )
15

16
        # Step 2: Build context from sources
17
        context = self._build_context(sources)
18

19
        # Step 3: Generate response with Haiku
20
        answer = await self._generate_response(question, context)
21

22
        return ChatResponse(
23
            answer=answer,
24
            sources=sources,
25
            success=True,
26
        )

The Chunking Strategy#

Chunking is where most RAG implementations go wrong. Chunk too large, and you miss specific passages. Chunk too small, and you lose context.

We chose 2,000 characters (~500 tokens) with 400-character overlap. Here’s why:

Chunk Size	Pros	Cons
500 chars	Very specific retrieval	Loses context, many chunks
2000 chars	Good context, reasonable count	Slightly less precise
5000 chars	Full context	May retrieve irrelevant content

The overlap ensures that if important information spans a chunk boundary, at least one chunk will contain it completely.

Paragraph-Aware Splitting#

We don’t just cut at character boundaries - we split at paragraph breaks:

1
def _split_into_paragraphs(self, text: str) -> list[str]:
2
    """Split on double newlines while preserving structure."""
3
    text = text.replace("\r\n", "\n")
4
    paragraphs = re.split(r"\n\n+", text)
5
    return [p.strip() for p in paragraphs if p.strip()]

This produces chunks that end at natural boundaries, making them more coherent for both retrieval and generation.

Context-Rich First Chunks#

The first chunk of each note includes frontmatter metadata:

1
Title: Building RAG Systems | Description: A guide to retrieval-augmented
2
generation | Tags: ai/rag, ai/llm, coding/python
3

4
RAG combines the power of large language models with external knowledge
5
retrieval. Instead of relying solely on the model's training data...

This ensures the embedding captures the note’s overall topic, not just the opening paragraph.

Retrieval: Finding What Matters#

Search is straightforward - embed the question and find similar chunks:

1
def search(
2
    self,
3
    query: str,
4
    limit: int = 5,
5
    threshold: float = 0.50,
6
) -> list[ChatSource]:
7
    """Search for relevant chunks."""
8

9
    sources = self.embeddings.search_by_text(
10
        query_text=query,
11
        limit=limit,
12
        score_threshold=threshold,
13
    )
14

15
    # Deduplicate by file_path, keeping highest score per note
16
    unique_sources = {}
17
    for source in sources:
18
        if source.file_path not in unique_sources:
19
            unique_sources[source.file_path] = source
20
        elif source.score > unique_sources[source.file_path].score:
21
            unique_sources[source.file_path] = source
22

23
    return sorted(unique_sources.values(), key=lambda x: x.score, reverse=True)

The Deduplication Problem#

Without deduplication, a single very relevant note might return multiple chunks, drowning out other perspectives. We keep only the highest-scoring chunk per note, ensuring diverse sources.

Threshold Tuning#

Threshold	Results
0.70+	Very few matches, misses relevant content
0.60	Conservative, high precision
0.50	Balanced - catches more relevant content
0.40	More results, some noise

We use 0.50 for chat (lower than our 0.70 for note linking) because:

Chat benefits from more context
The LLM can filter out less relevant passages
Users can click sources to verify

Generation: The Prompt That Works#

The system prompt sets the tone:

1
system_prompt = """You are a helpful assistant that answers questions
2
about the user's Obsidian vault notes.
3

4
Use the provided context from their notes to answer questions accurately.
5
If the context doesn't contain relevant information, say so honestly.
6
Keep answers concise but informative.
7
When referencing information, mention which source note it came from."""

The key instructions:

Use the context - Don’t make things up
Be honest - Say when you don’t know
Cite sources - Attribution builds trust

The user prompt includes the retrieved context:

1
user_message = f"""Context from your notes:
2

3
[Source 1: Building RAG Systems]
4
RAG combines the power of large language models with...
5

6
[Source 2: Vector Database Comparison]
7
For semantic search, Qdrant provides...
8

9
Question: {question}
10

11
Please answer based on the context above."""

The Frontend Experience#

The chat interface has three main areas:

Figure 3 - Chat dashboard layout mockup showing chat panel on left with conversation history, note preview panel on right showing selected source, and sources list with similarity scores

Source Attribution#

Every answer includes a collapsible list of sources. Click any source to preview the full note in the right panel.

1
function SourceList({ sources, onSourceClick }) {
2
  return (
3
    <Collapsible>
4
      <CollapsibleTrigger>
5
        Sources ({sources.length})
6
      </CollapsibleTrigger>
7
      <CollapsibleContent>
8
        {sources.map((source) => (
9
          <button
10
            key={source.file_path}
11
            onClick={() => onSourceClick(source)}
12
            className="text-left hover:bg-muted"
13
          >
14
            {source.title || source.file_name}
15
            <span className="text-muted">
16
              ({Math.round(source.score * 100)}%)
17
            </span>
18
          </button>
19
        ))}
20
      </CollapsibleContent>
21
    </Collapsible>
22
  );
23
}

The percentage shows the similarity score - users quickly learn that 80%+ means highly relevant, while 50-60% might be tangential.

Open in Obsidian#

The killer feature: click “Open in Obsidian” to jump directly to the source note:

1
function getObsidianUri(filePath: string, vaultName: string): string {
2
  const pathParts = filePath.replace(/\\/g, '/').split('/');
3
  const fileName = pathParts[pathParts.length - 1].replace(/\.md$/, '');
4
  const subFolder = pathParts[pathParts.length - 2] || '';
5
  const fileRef = subFolder ? `${subFolder}/${fileName}` : fileName;
6

7
  return `obsidian://open?vault=${encodeURIComponent(vaultName)}&file=${encodeURIComponent(fileRef)}`;
8
}

Clicking it opens Obsidian and navigates directly to the note. The chatbot becomes a discovery tool, not a replacement for your actual notes.

Auto-Indexing: Keeping the Index Fresh#

New notes should be searchable immediately. We hook into the save process:

1
@app.post("/api/save-note")
2
async def save_note(note_data: dict):
3
    # Save the file
4
    file_path = save_note_to_disk(note_data)
5

6
    # Index for similarity search
7
    index_note_for_similarity(file_path)
8

9
    # Index for chat
10
    chat_service = ObsidianChatService()
11
    chat_service.index_note(file_path)

No manual re-indexing required. Ask a question about a video you just processed, and it’s already in the index.

Bulk Indexing: The Initial Setup#

For existing vaults, we provide a bulk indexing script:

1
def main():
2
    service = ObsidianChatService()
3
    stats = service.index_all_notes()
4

5
    print(f"Indexed {stats['indexed_count']} notes")
6
    print(f"Total chunks: {stats['total_chunks']}")
7
    print(f"Errors: {len(stats['errors'])}")

Running it on our vault:

1
python scripts/index_vault_for_chat.py --verbose
2

3
Indexing vault: C:\Users\...\Obsidian\Vault\Notes
4
Found 1051 markdown files
5
...
6
Indexed 1051 notes
7
Total chunks: 5047
8
Errors: 0
9
Time elapsed: 8m 32s

~5,000 chunks indexed in under 9 minutes. That’s the entire vault ready for questions.

Performance Characteristics#

Operation	Time	Notes
Embed question	~100ms	OpenAI API call
Search Qdrant	~10ms	5,000 vectors, local
Generate answer	~2s	Haiku is fast
Total chat latency	~2.5s	Feels instant
Index one note	~500ms	Embed + upsert
Index full vault	~9 minutes	1,051 notes

The bottleneck is LLM generation, not retrieval. Qdrant’s sub-millisecond search means the experience feels snappy.

The Complete Stack#

Everything working together:

Figure 4 - Complete chat architecture diagram showing the full flow from user question through FastAPI, OpenAI embedding, Qdrant search, context building, Claude Haiku generation, and response display with source attribution

What Makes This Work#

1. Proper Chunking#

2,000-character chunks with overlap strike the balance between context and precision. Paragraph-aware splitting produces coherent text.

2. Frontmatter Context#

First chunks include title, description, and tags. Embeddings capture what the note is about, not just what it says.

3. Deduplication#

One chunk per note in results prevents a single document from dominating. Users see diverse perspectives.

4. Low Threshold#

0.50 retrieves more content than traditional search would. The LLM filters noise; users verify via sources.

5. Source Attribution#

Every answer comes with clickable sources. Trust is built through transparency, not black-box responses.

6. Obsidian Integration#

“Open in Obsidian” completes the loop. The chatbot finds information; Obsidian lets you act on it.

Sample Interactions#

User: “What techniques help with trading discipline?”

Assistant: Based on your notes, several techniques can help with trading discipline:

Pre-trade checklists - From “Trading Psychology Fundamentals”: Having a written checklist before each trade forces you to verify your setup meets all criteria.
Journal reviews - “My Trading Journal Process” describes reviewing each trade weekly to identify emotional patterns.
Position sizing rules - “Risk Management 101” emphasizes that fixed position sizing removes emotional decision-making.

Sources: Trading Psychology Fundamentals (78%), My Trading Journal Process (72%), Risk Management 101 (68%)

User: “How does LangGraph differ from CrewAI?”

Assistant: According to your notes, the key differences are:

LangGraph (from “LangGraph Architecture Deep Dive”):

Graph-based workflow definition
Explicit state management
Fine-grained control over agent transitions

CrewAI (from “CrewAI Framework Overview”):

Role-based agent definition
Automatic task delegation
Simpler setup for common patterns

Your note “Comparing Agent Frameworks” summarizes: “LangGraph when you need control, CrewAI when you need speed.”

Sources: LangGraph Architecture Deep Dive (82%), CrewAI Framework Overview (79%), Comparing Agent Frameworks (71%)

Key Takeaways#

Chunking is crucial. 2,000 chars with 400 overlap works well. Include frontmatter in first chunk.
Lower your threshold. 0.50 retrieves more than note-linking’s 0.70. The LLM handles noise.
Deduplicate results. One chunk per note prevents monopolization.
Show sources. Attribution builds trust. Let users verify.
Complete the loop. “Open in Obsidian” makes the chatbot useful, not just interesting.
Auto-index on save. Manual re-indexing kills the user experience.

The Full Journey#

This chatbot is the culmination of our knowledge management system:

YouTube transcript extraction
AI-powered note generation
Semantic indexing in Qdrant
Auto-linking related notes
Tag curation and taxonomy
Batch processing with Anthropic API
Modern React frontend
And now: RAG-powered chat

Each phase built on the previous. The semantic index that powers note similarity also powers chunk retrieval. The frontmatter that organizes tags also enriches embeddings. The infrastructure that serves the UI also serves the chatbot.

This is what compound progress looks like.

From YouTube to Knowledge Graph - System overview
Building a Semantic Note Network - Vector search foundation
Anthropic Batch API in Production - Cost-effective processing
Modern Python Web Stack 2025 - Architecture patterns

This article is part of our series on building AI-powered knowledge management tools. Written with assistance from Claude Code.