Here’s the dirty secret of personal knowledge management: most notes are islands.
You create a note about RAG architectures. Another about vector databases. A third about LangChain. They’re clearly related - but unless you manually link them, they exist in isolation. Your “knowledge base” is really just a fancy folder of disconnected files.
Manual linking doesn’t scale. After 100 notes, you can’t remember what you’ve written. After 1,000 notes, you’re drowning.
We built a system that links notes automatically.
Figure 1 - Knowledge graph visualization showing the Obsidian vault after automated linking - yellow dots represent individual notes, blue lines show semantic connections discovered automatically
The Solution: Semantic Similarity
Two notes are related if they’re about similar things. “Similar things” is fuzzy for humans but precise for math: we convert notes to vectors (embeddings) and measure the distance between them.
Note A: "RAG architectures combine retrieval and generation..." ↓ embed [0.23, -0.45, 0.78, 0.12, ...] (1536 dimensions)
Note B: "Vector databases enable semantic search..." ↓ embed [0.21, -0.42, 0.75, 0.15, ...] (1536 dimensions)
Cosine similarity: 0.87 ← These notes are related!If two notes have embeddings that point in similar directions (high cosine similarity), they’re about similar topics. Simple concept, powerful results.
Architecture Overview
The system has three phases: Index, Search, and Link.
Figure 2 - Note similarity pipeline architecture showing the indexing flow (note → extract metadata → build embedding text → OpenAI API → store in PostgreSQL and Qdrant) and search/linking flow
The Indexing Pipeline
Every note gets indexed when saved:
- Extract embedding text from frontmatter and body
- Generate embedding via OpenAI’s text-embedding-3-small
- Store in Qdrant for vector search
- Store metadata in PostgreSQL for relational queries
The Search Pipeline
When finding related notes:
- Query Qdrant with the note’s embedding
- Filter by threshold (0.70 minimum similarity)
- Return ranked results by similarity score
The Linking Pipeline
When linking notes:
- Find similar notes above threshold
- Add bidirectional links - if A relates to B, B relates to A
- Update Related Notes section in markdown
Phase 1: Building the Index
Step 1: Extract Embedding Text
We don’t embed the entire note - that would be wasteful and noisy. Instead, we extract the most semantically meaningful parts:
def build_embedding_text(frontmatter: dict, body: str) -> str: """Build text for embedding from note components."""
parts = []
# Title carries heavy semantic weight if title := frontmatter.get("title"): parts.append(f"Title: {title}")
# Description is a concise summary if desc := frontmatter.get("description"): parts.append(f"Description: {desc}")
# Tags indicate topic areas if tags := frontmatter.get("tags"): tag_str = ", ".join(tags) parts.append(f"Topics: {tag_str}")
# First 500 chars of body for context if body: preview = body[:500].strip() parts.append(f"Content: {preview}")
return " | ".join(parts)This produces text like:
Title: Building RAG Systems | Description: A guide to retrieval-augmentedgeneration | Topics: ai/rag, ai/llm, coding/python | Content: RAG combinesthe power of large language models with external knowledge retrieval...Step 2: Generate Embedding
We use OpenAI’s text-embedding-3-small model (1536 dimensions):
from openai import OpenAI
client = OpenAI()
def generate_embedding(text: str) -> list[float]: """Generate embedding vector for text."""
response = client.embeddings.create( model="text-embedding-3-small", input=text, )
return response.data[0].embeddingStep 3: Store in Qdrant
Qdrant is our vector database. Self-hosted, fast, and purpose-built for similarity search:
from qdrant_client import QdrantClientfrom qdrant_client.models import PointStruct
qdrant = QdrantClient(host="192.168.1.110", port=6333)
def index_note(file_path: str, embedding: list[float], metadata: dict): """Store note embedding in Qdrant."""
point_id = hash_path(file_path)
qdrant.upsert( collection_name="obsidian_vault_notes", points=[ PointStruct( id=point_id, vector=embedding, payload={ "file_path": file_path, "title": metadata.get("title"), "tags": metadata.get("tags", []), "content_hash": metadata.get("content_hash"), } ) ] )We also store metadata in PostgreSQL for relational queries. The dual storage lets us use the right tool for each job.
Incremental Indexing
We track content hashes to avoid re-indexing unchanged notes:
def should_reindex(file_path: str, current_hash: str) -> bool: """Check if note needs re-indexing."""
existing = get_note_metadata(file_path)
if not existing: return True # New note
return existing.content_hash != current_hash # Content changedThis makes re-indexing the entire vault fast - only changed notes get processed.
Phase 2: Searching for Similar Notes
When we need to find related notes, we query Qdrant:
def search_similar( embedding: list[float], exclude_path: str = None, threshold: float = 0.70, limit: int = 10,) -> list[SimilarNote]: """Find notes similar to the given embedding."""
filter_condition = None if exclude_path: filter_condition = Filter( must_not=[ FieldCondition( key="file_path", match=MatchValue(value=exclude_path), ) ] )
results = qdrant.search( collection_name="obsidian_vault_notes", query_vector=embedding, query_filter=filter_condition, limit=limit, score_threshold=threshold, )
return [ SimilarNote( file_path=r.payload["file_path"], title=r.payload["title"], score=r.score, ) for r in results ]The 0.70 Threshold
We use 0.70 as our similarity threshold. This was tuned empirically:
| Threshold | Results |
|---|---|
| 0.90+ | Almost no matches (too strict) |
| 0.80-0.90 | Very high quality, few matches |
| 0.70-0.80 | Good balance of quality and quantity |
| 0.60-0.70 | More matches, some noise |
| 0.60- | Too many irrelevant matches |
At 0.70, notes that match are genuinely related. You won’t get a note about Docker linked to a note about philosophy.
Figure 3 - Threshold tuning visualization showing the trade-off between precision and recall at different similarity thresholds
Phase 3: Bidirectional Linking
Here’s where things get interesting. When we link Note A to Note B, we also link Note B back to Note A:
def update_backlinks(source_path: str, similar_notes: list[SimilarNote]): """Add bidirectional links between notes."""
source_name = Path(source_path).stem
for similar in similar_notes: target_path = similar.file_path target_name = Path(target_path).stem
# Add link from source to target add_related_note(source_path, target_name)
# Add link from target back to source (bidirectional) add_related_note(target_path, source_name)The add_related_note function finds or creates a “Related Notes” section:
def add_related_note(file_path: str, link_name: str): """Add a wiki link to the Related Notes section."""
content = read_file(file_path)
# Check if link already exists if f"[[{link_name}]]" in content: return # Already linked
# Find or create Related Notes section if "## Related Notes" in content: content = insert_after_header(content, "## Related Notes", f"- [[{link_name}]]") else: content += f"\n\n## Related Notes\n\n- [[{link_name}]]\n"
write_file(file_path, content)The result in markdown:
## Related Notes
- [[Building RAG Systems]]- [[Vector Database Comparison]]- [[LangChain Deep Dive]]These are standard Obsidian wiki links. Click one, and you navigate directly to the related note.
Self-Reference Detection
A subtle but important edge case: notes shouldn’t link to themselves.
This sounds obvious, but it’s tricky in practice. When processing a YouTube video called “Building RAG Systems”, the output file might be named Building-RAG-Systems.md or Building_RAG_Systems.md or Building RAG Systems.md (depending on sanitization rules).
If we search for similar notes, the note itself might come back as the #1 match!
We handle this with multiple checks:
def is_self_reference(source_title: str, link_name: str) -> bool: """Check if a link would be a self-reference."""
# Exact match if link_name == source_title: return True
# Sanitized match (special chars removed) sanitized_source = sanitize_filename(source_title) sanitized_link = sanitize_filename(link_name) if sanitized_link == sanitized_source: return True
# Truncated match (filenames might be shortened) if sanitized_link[:50] == sanitized_source[:50]: return True
return FalseBatch Linking
For the initial vault setup, we ran a batch script to add links to all existing notes:
def batch_add_related_notes(vault_path: str, threshold: float = 0.70): """Add related notes to all files in vault."""
files = list(Path(vault_path).glob("**/*.md")) total_links_added = 0
for file_path in files: embedding = get_or_create_embedding(file_path)
similar = search_similar( embedding=embedding, exclude_path=str(file_path), threshold=threshold, )
similar = [s for s in similar if not is_self_reference( get_title(file_path), s.title )]
for note in similar: add_related_note(str(file_path), Path(note.file_path).stem) total_links_added += 1
return total_links_addedResults for our vault:
| Metric | Value |
|---|---|
| Files processed | 1,024 |
| Links added | 2,757 |
| Average links per note | 2.7 |
| Processing time | ~15 minutes |
Auto-Linking on Save
New notes get linked automatically when saved:
@app.post("/api/save-note")async def save_note(note: NoteInput): """Save note and auto-link to similar content."""
save_note_to_disk(note)
embedding = generate_embedding(build_embedding_text(note)) index_note(note.file_path, embedding, note.metadata)
similar = search_similar(embedding, exclude_path=note.file_path) update_backlinks(note.file_path, similar)
return {"status": "saved", "related_notes": len(similar)}Figure 4 - Auto-linking workflow showing the save flow: user saves note → index in Qdrant → find similar → add bidirectional links → update Related Notes section
The Knowledge Graph in Action
Some examples from our vault:
| Note | Auto-Linked To |
|---|---|
| ”RAG Architecture Deep Dive" | "Vector Database Comparison”, “LangChain Retrieval”, “Embedding Models" |
| "Trading Psychology" | "Risk Management”, “Journal Review Process”, “Emotional Discipline" |
| "Docker Compose Patterns" | "Container Orchestration”, “Development Environment”, “PostgreSQL Setup” |
The system finds connections humans would miss. It links across topic boundaries - a note about “Testing AI Agents” might link to “Integration Testing Best Practices” even though they’re in different folders.
Why Qdrant?
We chose Qdrant for several reasons:
- Self-hosted - Runs on our Proxmox server, no cloud dependency
- Fast - Sub-millisecond searches even with thousands of vectors
- Purpose-built - Designed specifically for vector similarity search
- Simple API - Python client is clean and well-documented
- Payload storage - Can store metadata alongside vectors
- Filtering - Exclude specific documents, filter by tags, etc.
The self-hosted aspect matters. Our notes contain personal and business information. Keeping the vector database on-premise means the data never leaves our network.
Key Takeaways
-
Embeddings make similarity searchable. Convert text to vectors, and “related content” becomes a mathematical operation.
-
Build embedding text carefully. Title + description + tags + preview works better than raw body text.
-
0.70 is a good threshold. High enough to avoid noise, low enough to find meaningful connections.
-
Bidirectional links matter. If A relates to B, B relates to A. Link both directions.
-
Watch for self-references. Sanitization and truncation can make self-detection tricky.
-
Dual storage works well. Vectors in Qdrant, metadata in PostgreSQL. Use the right tool for each job.
-
Auto-link on save. Don’t make users manually trigger linking. Make it automatic.
What’s Next
The note similarity system sets the stage for something bigger: a RAG chatbot over your vault.
We have the indexed notes. We have the embeddings. We have the search infrastructure. The next step is adding a chat interface that:
- Takes a user question
- Retrieves relevant note chunks
- Generates an answer with source attribution
That’s our RAG chatbot - the final piece of the knowledge management puzzle.
Related Articles
- From YouTube to Knowledge Graph - System overview
- Anthropic Batch API in Production - Cost-effective processing
- Obsidian Vault Curation at Scale - Cleaning up the vault first
- Building a RAG Chatbot for Your Obsidian Vault - What comes next
This article is part of our series on building AI-powered knowledge management tools. Written with assistance from Claude Code.