Building a Semantic Note Network: How Vector Search Turns Isolated Notes into a Knowledge Graph

Here’s the dirty secret of personal knowledge management: most notes are islands.

You create a note about RAG architectures. Another about vector databases. A third about LangChain. They’re clearly related - but unless you manually link them, they exist in isolation. Your “knowledge base” is really just a fancy folder of disconnected files.

Manual linking doesn’t scale. After 100 notes, you can’t remember what you’ve written. After 1,000 notes, you’re drowning.

We built a system that links notes automatically.

Figure 1 - Knowledge graph visualization showing the Obsidian vault after automated linking - yellow dots represent individual notes, blue lines show semantic connections discovered automatically

The Solution: Semantic Similarity#

Two notes are related if they’re about similar things. “Similar things” is fuzzy for humans but precise for math: we convert notes to vectors (embeddings) and measure the distance between them.

1
Note A: "RAG architectures combine retrieval and generation..."
2
         ↓ embed
3
      [0.23, -0.45, 0.78, 0.12, ...]  (1536 dimensions)
4

5
Note B: "Vector databases enable semantic search..."
6
         ↓ embed
7
      [0.21, -0.42, 0.75, 0.15, ...]  (1536 dimensions)
8

9
Cosine similarity: 0.87  ← These notes are related!

If two notes have embeddings that point in similar directions (high cosine similarity), they’re about similar topics. Simple concept, powerful results.

Architecture Overview#

The system has three phases: Index, Search, and Link.

Figure 2 - Note similarity pipeline architecture showing the indexing flow (note → extract metadata → build embedding text → OpenAI API → store in PostgreSQL and Qdrant) and search/linking flow

The Indexing Pipeline#

Every note gets indexed when saved:

Extract embedding text from frontmatter and body
Generate embedding via OpenAI’s text-embedding-3-small
Store in Qdrant for vector search
Store metadata in PostgreSQL for relational queries

The Search Pipeline#

When finding related notes:

Query Qdrant with the note’s embedding
Filter by threshold (0.70 minimum similarity)
Return ranked results by similarity score

The Linking Pipeline#

When linking notes:

Find similar notes above threshold
Add bidirectional links - if A relates to B, B relates to A
Update Related Notes section in markdown

Phase 1: Building the Index#

Step 1: Extract Embedding Text#

We don’t embed the entire note - that would be wasteful and noisy. Instead, we extract the most semantically meaningful parts:

1
def build_embedding_text(frontmatter: dict, body: str) -> str:
2
    """Build text for embedding from note components."""
3

4
    parts = []
5

6
    # Title carries heavy semantic weight
7
    if title := frontmatter.get("title"):
8
        parts.append(f"Title: {title}")
9

10
    # Description is a concise summary
11
    if desc := frontmatter.get("description"):
12
        parts.append(f"Description: {desc}")
13

14
    # Tags indicate topic areas
15
    if tags := frontmatter.get("tags"):
16
        tag_str = ", ".join(tags)
17
        parts.append(f"Topics: {tag_str}")
18

19
    # First 500 chars of body for context
20
    if body:
21
        preview = body[:500].strip()
22
        parts.append(f"Content: {preview}")
23

24
    return " | ".join(parts)

This produces text like:

1
Title: Building RAG Systems | Description: A guide to retrieval-augmented
2
generation | Topics: ai/rag, ai/llm, coding/python | Content: RAG combines
3
the power of large language models with external knowledge retrieval...

Step 2: Generate Embedding#

We use OpenAI’s text-embedding-3-small model (1536 dimensions):

1
from openai import OpenAI
2

3
client = OpenAI()
4

5
def generate_embedding(text: str) -> list[float]:
6
    """Generate embedding vector for text."""
7

8
    response = client.embeddings.create(
9
        model="text-embedding-3-small",
10
        input=text,
11
    )
12

13
    return response.data[0].embedding

Step 3: Store in Qdrant#

Qdrant is our vector database. Self-hosted, fast, and purpose-built for similarity search:

1
from qdrant_client import QdrantClient
2
from qdrant_client.models import PointStruct
3

4
qdrant = QdrantClient(host="192.168.1.110", port=6333)
5

6
def index_note(file_path: str, embedding: list[float], metadata: dict):
7
    """Store note embedding in Qdrant."""
8

9
    point_id = hash_path(file_path)
10

11
    qdrant.upsert(
12
        collection_name="obsidian_vault_notes",
13
        points=[
14
            PointStruct(
15
                id=point_id,
16
                vector=embedding,
17
                payload={
18
                    "file_path": file_path,
19
                    "title": metadata.get("title"),
20
                    "tags": metadata.get("tags", []),
21
                    "content_hash": metadata.get("content_hash"),
22
                }
23
            )
24
        ]
25
    )

We also store metadata in PostgreSQL for relational queries. The dual storage lets us use the right tool for each job.

Incremental Indexing#

We track content hashes to avoid re-indexing unchanged notes:

1
def should_reindex(file_path: str, current_hash: str) -> bool:
2
    """Check if note needs re-indexing."""
3

4
    existing = get_note_metadata(file_path)
5

6
    if not existing:
7
        return True  # New note
8

9
    return existing.content_hash != current_hash  # Content changed

This makes re-indexing the entire vault fast - only changed notes get processed.

Phase 2: Searching for Similar Notes#

When we need to find related notes, we query Qdrant:

1
def search_similar(
2
    embedding: list[float],
3
    exclude_path: str = None,
4
    threshold: float = 0.70,
5
    limit: int = 10,
6
) -> list[SimilarNote]:
7
    """Find notes similar to the given embedding."""
8

9
    filter_condition = None
10
    if exclude_path:
11
        filter_condition = Filter(
12
            must_not=[
13
                FieldCondition(
14
                    key="file_path",
15
                    match=MatchValue(value=exclude_path),
16
                )
17
            ]
18
        )
19

20
    results = qdrant.search(
21
        collection_name="obsidian_vault_notes",
22
        query_vector=embedding,
23
        query_filter=filter_condition,
24
        limit=limit,
25
        score_threshold=threshold,
26
    )
27

28
    return [
29
        SimilarNote(
30
            file_path=r.payload["file_path"],
31
            title=r.payload["title"],
32
            score=r.score,
33
        )
34
        for r in results
35
    ]

The 0.70 Threshold#

We use 0.70 as our similarity threshold. This was tuned empirically:

Threshold	Results
0.90+	Almost no matches (too strict)
0.80-0.90	Very high quality, few matches
0.70-0.80	Good balance of quality and quantity
0.60-0.70	More matches, some noise
0.60-	Too many irrelevant matches

At 0.70, notes that match are genuinely related. You won’t get a note about Docker linked to a note about philosophy.

Figure 3 - Threshold tuning visualization showing the trade-off between precision and recall at different similarity thresholds

Phase 3: Bidirectional Linking#

Here’s where things get interesting. When we link Note A to Note B, we also link Note B back to Note A:

1
def update_backlinks(source_path: str, similar_notes: list[SimilarNote]):
2
    """Add bidirectional links between notes."""
3

4
    source_name = Path(source_path).stem
5

6
    for similar in similar_notes:
7
        target_path = similar.file_path
8
        target_name = Path(target_path).stem
9

10
        # Add link from source to target
11
        add_related_note(source_path, target_name)
12

13
        # Add link from target back to source (bidirectional)
14
        add_related_note(target_path, source_name)

The add_related_note function finds or creates a “Related Notes” section:

1
def add_related_note(file_path: str, link_name: str):
2
    """Add a wiki link to the Related Notes section."""
3

4
    content = read_file(file_path)
5

6
    # Check if link already exists
7
    if f"[[{link_name}]]" in content:
8
        return  # Already linked
9

10
    # Find or create Related Notes section
11
    if "## Related Notes" in content:
12
        content = insert_after_header(content, "## Related Notes", f"- [[{link_name}]]")
13
    else:
14
        content += f"\n\n## Related Notes\n\n- [[{link_name}]]\n"
15

16
    write_file(file_path, content)

The result in markdown:

1
## Related Notes
2

3
- [[Building RAG Systems]]
4
- [[Vector Database Comparison]]
5
- [[LangChain Deep Dive]]

These are standard Obsidian wiki links. Click one, and you navigate directly to the related note.

Self-Reference Detection#

A subtle but important edge case: notes shouldn’t link to themselves.

This sounds obvious, but it’s tricky in practice. When processing a YouTube video called “Building RAG Systems”, the output file might be named Building-RAG-Systems.md or Building_RAG_Systems.md or Building RAG Systems.md (depending on sanitization rules).

If we search for similar notes, the note itself might come back as the #1 match!

We handle this with multiple checks:

1
def is_self_reference(source_title: str, link_name: str) -> bool:
2
    """Check if a link would be a self-reference."""
3

4
    # Exact match
5
    if link_name == source_title:
6
        return True
7

8
    # Sanitized match (special chars removed)
9
    sanitized_source = sanitize_filename(source_title)
10
    sanitized_link = sanitize_filename(link_name)
11
    if sanitized_link == sanitized_source:
12
        return True
13

14
    # Truncated match (filenames might be shortened)
15
    if sanitized_link[:50] == sanitized_source[:50]:
16
        return True
17

18
    return False

Batch Linking#

For the initial vault setup, we ran a batch script to add links to all existing notes:

1
def batch_add_related_notes(vault_path: str, threshold: float = 0.70):
2
    """Add related notes to all files in vault."""
3

4
    files = list(Path(vault_path).glob("**/*.md"))
5
    total_links_added = 0
6

7
    for file_path in files:
8
        embedding = get_or_create_embedding(file_path)
9

10
        similar = search_similar(
11
            embedding=embedding,
12
            exclude_path=str(file_path),
13
            threshold=threshold,
14
        )
15

16
        similar = [s for s in similar if not is_self_reference(
17
            get_title(file_path), s.title
18
        )]
19

20
        for note in similar:
21
            add_related_note(str(file_path), Path(note.file_path).stem)
22
            total_links_added += 1
23

24
    return total_links_added

Results for our vault:

Metric	Value
Files processed	1,024
Links added	2,757
Average links per note	2.7
Processing time	~15 minutes

Auto-Linking on Save#

New notes get linked automatically when saved:

1
@app.post("/api/save-note")
2
async def save_note(note: NoteInput):
3
    """Save note and auto-link to similar content."""
4

5
    save_note_to_disk(note)
6

7
    embedding = generate_embedding(build_embedding_text(note))
8
    index_note(note.file_path, embedding, note.metadata)
9

10
    similar = search_similar(embedding, exclude_path=note.file_path)
11
    update_backlinks(note.file_path, similar)
12

13
    return {"status": "saved", "related_notes": len(similar)}

Figure 4 - Auto-linking workflow showing the save flow: user saves note → index in Qdrant → find similar → add bidirectional links → update Related Notes section

The Knowledge Graph in Action#

Some examples from our vault:

Note	Auto-Linked To
”RAG Architecture Deep Dive"	"Vector Database Comparison”, “LangChain Retrieval”, “Embedding Models"
"Trading Psychology"	"Risk Management”, “Journal Review Process”, “Emotional Discipline"
"Docker Compose Patterns"	"Container Orchestration”, “Development Environment”, “PostgreSQL Setup”

The system finds connections humans would miss. It links across topic boundaries - a note about “Testing AI Agents” might link to “Integration Testing Best Practices” even though they’re in different folders.

Why Qdrant?#

We chose Qdrant for several reasons:

Self-hosted - Runs on our Proxmox server, no cloud dependency
Fast - Sub-millisecond searches even with thousands of vectors
Purpose-built - Designed specifically for vector similarity search
Simple API - Python client is clean and well-documented
Payload storage - Can store metadata alongside vectors
Filtering - Exclude specific documents, filter by tags, etc.

The self-hosted aspect matters. Our notes contain personal and business information. Keeping the vector database on-premise means the data never leaves our network.

Key Takeaways#

Embeddings make similarity searchable. Convert text to vectors, and “related content” becomes a mathematical operation.
Build embedding text carefully. Title + description + tags + preview works better than raw body text.
0.70 is a good threshold. High enough to avoid noise, low enough to find meaningful connections.
Bidirectional links matter. If A relates to B, B relates to A. Link both directions.
Watch for self-references. Sanitization and truncation can make self-detection tricky.
Dual storage works well. Vectors in Qdrant, metadata in PostgreSQL. Use the right tool for each job.
Auto-link on save. Don’t make users manually trigger linking. Make it automatic.

What’s Next#

The note similarity system sets the stage for something bigger: a RAG chatbot over your vault.

We have the indexed notes. We have the embeddings. We have the search infrastructure. The next step is adding a chat interface that:

Takes a user question
Retrieves relevant note chunks
Generates an answer with source attribution

That’s our RAG chatbot - the final piece of the knowledge management puzzle.

From YouTube to Knowledge Graph - System overview
Anthropic Batch API in Production - Cost-effective processing
Obsidian Vault Curation at Scale - Cleaning up the vault first
Building a RAG Chatbot for Your Obsidian Vault - What comes next

This article is part of our series on building AI-powered knowledge management tools. Written with assistance from Claude Code.