1531 words
8 minutes
Building a Semantic Note Network: How Vector Search Turns Isolated Notes into a Knowledge Graph

Here’s the dirty secret of personal knowledge management: most notes are islands.

You create a note about RAG architectures. Another about vector databases. A third about LangChain. They’re clearly related - but unless you manually link them, they exist in isolation. Your “knowledge base” is really just a fancy folder of disconnected files.

Manual linking doesn’t scale. After 100 notes, you can’t remember what you’ve written. After 1,000 notes, you’re drowning.

We built a system that links notes automatically.

Figure 1 - Knowledge graph visualization showing the Obsidian vault after automated linking - yellow dots represent individual notes, blue lines show semantic connections discovered automatically


The Solution: Semantic Similarity#

Two notes are related if they’re about similar things. “Similar things” is fuzzy for humans but precise for math: we convert notes to vectors (embeddings) and measure the distance between them.

Note A: "RAG architectures combine retrieval and generation..."
↓ embed
[0.23, -0.45, 0.78, 0.12, ...] (1536 dimensions)
Note B: "Vector databases enable semantic search..."
↓ embed
[0.21, -0.42, 0.75, 0.15, ...] (1536 dimensions)
Cosine similarity: 0.87 ← These notes are related!

If two notes have embeddings that point in similar directions (high cosine similarity), they’re about similar topics. Simple concept, powerful results.


Architecture Overview#

The system has three phases: Index, Search, and Link.

Figure 2 - Note similarity pipeline architecture showing the indexing flow (note → extract metadata → build embedding text → OpenAI API → store in PostgreSQL and Qdrant) and search/linking flow

The Indexing Pipeline#

Every note gets indexed when saved:

  1. Extract embedding text from frontmatter and body
  2. Generate embedding via OpenAI’s text-embedding-3-small
  3. Store in Qdrant for vector search
  4. Store metadata in PostgreSQL for relational queries

The Search Pipeline#

When finding related notes:

  1. Query Qdrant with the note’s embedding
  2. Filter by threshold (0.70 minimum similarity)
  3. Return ranked results by similarity score

The Linking Pipeline#

When linking notes:

  1. Find similar notes above threshold
  2. Add bidirectional links - if A relates to B, B relates to A
  3. Update Related Notes section in markdown

Phase 1: Building the Index#

Step 1: Extract Embedding Text#

We don’t embed the entire note - that would be wasteful and noisy. Instead, we extract the most semantically meaningful parts:

def build_embedding_text(frontmatter: dict, body: str) -> str:
"""Build text for embedding from note components."""
parts = []
# Title carries heavy semantic weight
if title := frontmatter.get("title"):
parts.append(f"Title: {title}")
# Description is a concise summary
if desc := frontmatter.get("description"):
parts.append(f"Description: {desc}")
# Tags indicate topic areas
if tags := frontmatter.get("tags"):
tag_str = ", ".join(tags)
parts.append(f"Topics: {tag_str}")
# First 500 chars of body for context
if body:
preview = body[:500].strip()
parts.append(f"Content: {preview}")
return " | ".join(parts)

This produces text like:

Title: Building RAG Systems | Description: A guide to retrieval-augmented
generation | Topics: ai/rag, ai/llm, coding/python | Content: RAG combines
the power of large language models with external knowledge retrieval...

Step 2: Generate Embedding#

We use OpenAI’s text-embedding-3-small model (1536 dimensions):

from openai import OpenAI
client = OpenAI()
def generate_embedding(text: str) -> list[float]:
"""Generate embedding vector for text."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return response.data[0].embedding

Step 3: Store in Qdrant#

Qdrant is our vector database. Self-hosted, fast, and purpose-built for similarity search:

from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
qdrant = QdrantClient(host="192.168.1.110", port=6333)
def index_note(file_path: str, embedding: list[float], metadata: dict):
"""Store note embedding in Qdrant."""
point_id = hash_path(file_path)
qdrant.upsert(
collection_name="obsidian_vault_notes",
points=[
PointStruct(
id=point_id,
vector=embedding,
payload={
"file_path": file_path,
"title": metadata.get("title"),
"tags": metadata.get("tags", []),
"content_hash": metadata.get("content_hash"),
}
)
]
)

We also store metadata in PostgreSQL for relational queries. The dual storage lets us use the right tool for each job.

Incremental Indexing#

We track content hashes to avoid re-indexing unchanged notes:

def should_reindex(file_path: str, current_hash: str) -> bool:
"""Check if note needs re-indexing."""
existing = get_note_metadata(file_path)
if not existing:
return True # New note
return existing.content_hash != current_hash # Content changed

This makes re-indexing the entire vault fast - only changed notes get processed.


Phase 2: Searching for Similar Notes#

When we need to find related notes, we query Qdrant:

def search_similar(
embedding: list[float],
exclude_path: str = None,
threshold: float = 0.70,
limit: int = 10,
) -> list[SimilarNote]:
"""Find notes similar to the given embedding."""
filter_condition = None
if exclude_path:
filter_condition = Filter(
must_not=[
FieldCondition(
key="file_path",
match=MatchValue(value=exclude_path),
)
]
)
results = qdrant.search(
collection_name="obsidian_vault_notes",
query_vector=embedding,
query_filter=filter_condition,
limit=limit,
score_threshold=threshold,
)
return [
SimilarNote(
file_path=r.payload["file_path"],
title=r.payload["title"],
score=r.score,
)
for r in results
]

The 0.70 Threshold#

We use 0.70 as our similarity threshold. This was tuned empirically:

ThresholdResults
0.90+Almost no matches (too strict)
0.80-0.90Very high quality, few matches
0.70-0.80Good balance of quality and quantity
0.60-0.70More matches, some noise
0.60-Too many irrelevant matches

At 0.70, notes that match are genuinely related. You won’t get a note about Docker linked to a note about philosophy.

Figure 3 - Threshold tuning visualization showing the trade-off between precision and recall at different similarity thresholds


Phase 3: Bidirectional Linking#

Here’s where things get interesting. When we link Note A to Note B, we also link Note B back to Note A:

def update_backlinks(source_path: str, similar_notes: list[SimilarNote]):
"""Add bidirectional links between notes."""
source_name = Path(source_path).stem
for similar in similar_notes:
target_path = similar.file_path
target_name = Path(target_path).stem
# Add link from source to target
add_related_note(source_path, target_name)
# Add link from target back to source (bidirectional)
add_related_note(target_path, source_name)

The add_related_note function finds or creates a “Related Notes” section:

def add_related_note(file_path: str, link_name: str):
"""Add a wiki link to the Related Notes section."""
content = read_file(file_path)
# Check if link already exists
if f"[[{link_name}]]" in content:
return # Already linked
# Find or create Related Notes section
if "## Related Notes" in content:
content = insert_after_header(content, "## Related Notes", f"- [[{link_name}]]")
else:
content += f"\n\n## Related Notes\n\n- [[{link_name}]]\n"
write_file(file_path, content)

The result in markdown:

## Related Notes
- [[Building RAG Systems]]
- [[Vector Database Comparison]]
- [[LangChain Deep Dive]]

These are standard Obsidian wiki links. Click one, and you navigate directly to the related note.


Self-Reference Detection#

A subtle but important edge case: notes shouldn’t link to themselves.

This sounds obvious, but it’s tricky in practice. When processing a YouTube video called “Building RAG Systems”, the output file might be named Building-RAG-Systems.md or Building_RAG_Systems.md or Building RAG Systems.md (depending on sanitization rules).

If we search for similar notes, the note itself might come back as the #1 match!

We handle this with multiple checks:

def is_self_reference(source_title: str, link_name: str) -> bool:
"""Check if a link would be a self-reference."""
# Exact match
if link_name == source_title:
return True
# Sanitized match (special chars removed)
sanitized_source = sanitize_filename(source_title)
sanitized_link = sanitize_filename(link_name)
if sanitized_link == sanitized_source:
return True
# Truncated match (filenames might be shortened)
if sanitized_link[:50] == sanitized_source[:50]:
return True
return False

Batch Linking#

For the initial vault setup, we ran a batch script to add links to all existing notes:

def batch_add_related_notes(vault_path: str, threshold: float = 0.70):
"""Add related notes to all files in vault."""
files = list(Path(vault_path).glob("**/*.md"))
total_links_added = 0
for file_path in files:
embedding = get_or_create_embedding(file_path)
similar = search_similar(
embedding=embedding,
exclude_path=str(file_path),
threshold=threshold,
)
similar = [s for s in similar if not is_self_reference(
get_title(file_path), s.title
)]
for note in similar:
add_related_note(str(file_path), Path(note.file_path).stem)
total_links_added += 1
return total_links_added

Results for our vault:

MetricValue
Files processed1,024
Links added2,757
Average links per note2.7
Processing time~15 minutes

Auto-Linking on Save#

New notes get linked automatically when saved:

@app.post("/api/save-note")
async def save_note(note: NoteInput):
"""Save note and auto-link to similar content."""
save_note_to_disk(note)
embedding = generate_embedding(build_embedding_text(note))
index_note(note.file_path, embedding, note.metadata)
similar = search_similar(embedding, exclude_path=note.file_path)
update_backlinks(note.file_path, similar)
return {"status": "saved", "related_notes": len(similar)}

Figure 4 - Auto-linking workflow showing the save flow: user saves note → index in Qdrant → find similar → add bidirectional links → update Related Notes section


The Knowledge Graph in Action#

Some examples from our vault:

NoteAuto-Linked To
”RAG Architecture Deep Dive""Vector Database Comparison”, “LangChain Retrieval”, “Embedding Models"
"Trading Psychology""Risk Management”, “Journal Review Process”, “Emotional Discipline"
"Docker Compose Patterns""Container Orchestration”, “Development Environment”, “PostgreSQL Setup”

The system finds connections humans would miss. It links across topic boundaries - a note about “Testing AI Agents” might link to “Integration Testing Best Practices” even though they’re in different folders.


Why Qdrant?#

We chose Qdrant for several reasons:

  1. Self-hosted - Runs on our Proxmox server, no cloud dependency
  2. Fast - Sub-millisecond searches even with thousands of vectors
  3. Purpose-built - Designed specifically for vector similarity search
  4. Simple API - Python client is clean and well-documented
  5. Payload storage - Can store metadata alongside vectors
  6. Filtering - Exclude specific documents, filter by tags, etc.

The self-hosted aspect matters. Our notes contain personal and business information. Keeping the vector database on-premise means the data never leaves our network.


Key Takeaways#

  1. Embeddings make similarity searchable. Convert text to vectors, and “related content” becomes a mathematical operation.

  2. Build embedding text carefully. Title + description + tags + preview works better than raw body text.

  3. 0.70 is a good threshold. High enough to avoid noise, low enough to find meaningful connections.

  4. Bidirectional links matter. If A relates to B, B relates to A. Link both directions.

  5. Watch for self-references. Sanitization and truncation can make self-detection tricky.

  6. Dual storage works well. Vectors in Qdrant, metadata in PostgreSQL. Use the right tool for each job.

  7. Auto-link on save. Don’t make users manually trigger linking. Make it automatic.


What’s Next#

The note similarity system sets the stage for something bigger: a RAG chatbot over your vault.

We have the indexed notes. We have the embeddings. We have the search infrastructure. The next step is adding a chat interface that:

  1. Takes a user question
  2. Retrieves relevant note chunks
  3. Generates an answer with source attribution

That’s our RAG chatbot - the final piece of the knowledge management puzzle.



This article is part of our series on building AI-powered knowledge management tools. Written with assistance from Claude Code.

Building a Semantic Note Network: How Vector Search Turns Isolated Notes into a Knowledge Graph
https://fuwari.vercel.app/articles/semantic-note-network/
Author
Katrina Dotzlaw, Ryan Dotzlaw
Published at
2025-12-17
License
CC BY-NC-SA 4.0
← Back to Articles