1334 words
7 minutes
Obsidian Vault Curation at Scale: How We Transformed 1,000+ Notes in Under an Hour

Everyone loves the idea of a personal knowledge base. The reality? After three years of religiously dumping notes into Obsidian, we had accumulated over 1,000 notes and 1,280 tags. What started as an organized system had devolved into something that would make Marie Kondo weep:

#ai
#AI
#artificialIntelligence
#agentic-systems
#agenticFrameworks
#3498db ← A hex color somehow became a tag
#5year ← What does this even mean?

The inconsistencies ran deeper than just tags. Our YAML frontmatter was a mess:

---
Title: Some Note # Should be lowercase
Created: 2024-12-27T07:02:00 # ISO format breaks some tools
tags:
- ai
- AI # Duplicate!
---

Some older notes used Obsidian’s legacy inline property format. And about 80 notes had no frontmatter at all.

The manual fix would have taken weeks. We needed automation.


Why Batch API?#

When you’re processing 1,000+ items, API costs add up fast. Anthropic’s Batch API offers a compelling alternative:

FeatureStandard APIBatch API
CostFull price50% discount
Rate limitsPer-minute limitsUp to 100,000 requests
ProcessingImmediateWithin 24 hours
Use caseInteractiveBackground processing

For our vault, the math was simple:

  • Standard API: ~$3.00 estimated cost
  • Batch API: ~$1.50 estimated cost

Since we didn’t need immediate results, the 24-hour processing window was perfect for an overnight job. We kicked it off before dinner and had results by morning coffee.


Architecture: A Three-Stage Pipeline#

We built a Python pipeline with three distinct stages:

Figure 1 - Batch processing pipeline diagram showing three stages: PREPARE (scan vault, build prompts, output JSONL), SUBMIT & POLL (submit to API, poll for completion, receive results), APPLY (load results, apply frontmatter, update files)

Stage 1: Intelligent File Analysis#

The preparation script analyzes each markdown file and determines what needs fixing:

def needs_processing(file_path: Path, frontmatter: str) -> tuple[bool, str]:
"""Determine if a file needs tag curation."""
if 'title:' not in frontmatter.lower():
return (True, "missing_title")
if 'description:' not in frontmatter.lower():
return (True, "missing_description")
# Check for uppercase property names
for line in frontmatter.split('\n'):
if ':' in line:
key = line.split(':')[0]
if key and key[0].isupper():
return (True, "uppercase_property")
# Check for ISO date format (T separator)
if 'T' in frontmatter and 'created:' in frontmatter.lower():
return (True, "iso_date_format")
return (False, "tags_ok")

This produced a sobering breakdown of issues:

Issue TypeCountDescription
missing_title~400No title field
missing_description~600No description
uppercase_property~150Title: instead of title:
iso_date_format~200ISO timestamps
inline_to_yaml~50Legacy format conversion
no_frontmatter~80Needs frontmatter created

Stage 2: Format-Specific Prompts#

Here’s where the real insight came: one-size-fits-all prompts produce mediocre results. Different file formats need different prompts.

Prompt 1: Files with YAML Frontmatter

prompt = f"""You are a tag curation assistant for an Obsidian vault.
## Current Frontmatter
```yaml
{existing_frontmatter}

IMPORTANT RULES#

  1. ALL property names MUST be lowercase
  2. Date format MUST be: YYYY-MM-DD HH
  3. Preserve ALL emojis and special characters

Required Fields (add if missing)#

  1. title: - Generate from filename
  2. created: - Convert existing or use current date
  3. tags: - Array with hierarchical format
  4. description: - 1-2 sentence summary

Return ONLY valid YAML (without --- markers). """

**Prompt 2: Files with Inline Properties**
```python
prompt = f"""This file uses OLD Obsidian inline property format.
## Current Inline Properties

Type:: #note/seedling Tags:: #area/personal Links:: [[2022-11-23]]

## Task
Convert to proper YAML frontmatter, preserving:
- Emoji values exactly
- Wikilinks as quoted strings
- ALL existing properties (lowercase keys)
Add: title, created, description, and curated tags.
"""

Prompt 3: Files with No Frontmatter

prompt = f"""This file has NO frontmatter.
## Content Preview
```markdown
{content_preview[:8000]}

Task#

Create YAML frontmatter including:

  1. Descriptive title: based on content
  2. tags: array with 3-5 relevant hierarchical tags
  3. Brief description:
  4. created: date if found in content """
### The Batch Request Format
Each request follows Anthropic's JSONL format:
```json
{
"custom_id": "file_00042_My_Note_Title",
"params": {
"model": "claude-3-5-haiku-20241022",
"max_tokens": 2048,
"messages": [
{
"role": "user",
"content": "You are a tag curation assistant..."
}
]
}
}

The custom_id is crucial - it’s how you match results back to source files.


The Progressive Testing Strategy#

Never trust your first batch. We used a progressive testing approach:

RunFilesPurpose
12Validate pipeline works
26Test edge cases
352First real folder (Trading)
472Parallel test (AI folder)
5122Multi-batch test (100+22)
6782Full production (8 batches)

This approach caught two critical bugs:

Bug 1: Multi-Batch Custom ID Mismatch#

When processing 122 files in two batches, the second batch’s metadata indexed files 0-21, but the API results had custom IDs file_00100 through file_00121.

The Fix: Store the global file index in metadata, not batch-local indices.

Bug 2: Unicode Console Output on Windows#

Filenames containing emojis crashed the console output:

UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4da'

The Fix: Graceful ASCII fallback for console output.

Both bugs would have caused silent data corruption at scale. Progressive testing is essential.

Figure 2 - Progressive testing visualization showing the ramp-up from 2 files to 782 files across 6 test runs


Submitting to the Batch API#

For 782 files, this created 8 batches:

Submitted batch 1/8: msgbatch_01SdR31... (100 requests)
Submitted batch 2/8: msgbatch_01UhFNq... (100 requests)
...
Submitted batch 8/8: msgbatch_01XGJWLe... (82 requests)

All 8 batches completed in approximately 25 minutes with 100% success rate.


Sample Transformations#

Before: Inline Properties#

Type:: #note/seedling
Tags:: #area/personal , #area/health
Links:: [[2022-11-23]]
up:: [[Try This Before You Get Angry]]
Content starts here...

After: Clean YAML Frontmatter#

---
title: Anger Only Makes Things Worse
created: 2022-11-23 10:30
type: "#note/seedling"
links: "[[2022-11-23]]"
up: "[[Try This Before You Get Angry]]"
tags:
- "personal/health"
- "philosophy/stoicism"
description: A reflection on managing anger and its negative consequences.
---
Content starts here...

Before: Messy YAML#

---
Title: 2023 Money Ideas
Created: 2022-10-21T06:45:00
tags:
- money
- ideas
---

After: Standardized Format#

---
title: 2023 Money Making Ideas
created: 2022-10-21 06:45
tags:
- "personal/finance"
- "business/ideas"
- "business/entrepreneurship"
description: A brainstorm list of money-making ideas and business opportunities.
---

Figure 3 - Side-by-side comparison showing before/after transformation of a note with messy YAML to clean standardized format


The Tag Taxonomy#

This is where Claude Code truly shined. We handed it a CSV file with 1,280 chaotic tags, and within minutes, it had:

  1. Analyzed the entire tag list for patterns and relationships
  2. Identified logical categories (AI, trading, coding, personal, etc.)
  3. Designed a 4-level hierarchy that made semantic sense
  4. Generated a complete tag_mapping.csv with old-to-new mappings

The result? A beautiful nested taxonomy of 1,040 hierarchical tags:

#ai/ # Level 1: Top-level category
#ai/agents/ # Level 2: Subcategory
#ai/agents/frameworks/ # Level 3: Specific area
#ai/agents/frameworks/crewai # Level 4: Individual item
#trading/ # Another top-level
#trading/patterns/ # Trading patterns
#trading/patterns/abcd # Specific pattern
#trading/psychology/ # Trading psychology
#trading/psychology/discipline
#infrastructure/ # DevOps & Infrastructure
#infrastructure/docker/ # Container tech
#infrastructure/docker/compose
#infrastructure/kubernetes/

This hierarchy enables powerful Obsidian searches:

  • #ai/ - All AI-related notes (200+ notes)
  • #ai/agents/ - Just agent-related notes
  • #ai/agents/frameworks/ - Notes about specific frameworks

Results Summary#

MetricValue
Total Files Processed1,028
API Success Rate100%
Files Successfully Updated1,023 (5 were renamed during processing)
Tags Curated1,280 → 1,040 hierarchical
Processing Time~30 minutes
Total Cost~$1.50
Development Time~4 hours

The Magic Moment#

After the final batch completed and all files were updated, we opened Obsidian.

The tag pane, which had been an unusable mess of 1,280 unrelated tags, was now a beautifully organized hierarchy:

▼ ai (247 notes)
▼ agents (89 notes)
▼ frameworks (34 notes)
crewai (5)
langgraph (8)
autogen (4)
patterns (12 notes)
workflows (15 notes)
▼ llm (67 notes)
claude (23)
gpt (18)
prompting (26)
▼ trading (186 notes)
▼ patterns (45 notes)
abcd (8)
harmonics (12)
candlesticks (15)
▼ psychology (28 notes)
discipline (9)
emotions (11)

Every document was now linked to the right tags. Clicking on #ai/agents/frameworks/ showed exactly the 34 notes about agent frameworks.

Three years of accumulated tag chaos, fixed in under an hour.

Figure 4 - Obsidian tag pane showing the organized hierarchical taxonomy with expandable nested tags and note counts


Key Takeaways#

  1. Start small, scale up. Progressive testing (2 → 6 → 52 → 782) catches bugs before they affect your full dataset.

  2. Use custom_id wisely. Include enough information to debug issues.

  3. Design format-specific prompts. One-size-fits-all prompts produce mediocre results.

  4. Handle multi-batch carefully. Track global indices, not batch-local ones.

  5. The 50% discount is real. For background processing tasks, Batch API makes large-scale LLM processing economically viable.

  6. AI-assisted development is a force multiplier. Having an AI architect the solution, write the code, and debug issues in real-time turned a multi-day project into a few hours of collaborative work.


What We Built Next#

With clean, hierarchical tags, this vault curation became the foundation for a larger system:

The Batch API turned a weeks-long manual task into a 30-minute automated job. For anyone managing large content repositories, it’s a game-changer.


This article is part of our series on building AI-powered knowledge management tools. Written with assistance from Claude Code.

Obsidian Vault Curation at Scale: How We Transformed 1,000+ Notes in Under an Hour
https://fuwari.vercel.app/articles/obsidian-vault-curation/
Author
Katrina Dotzlaw, Ryan Dotzlaw
Published at
2025-12-19
License
CC BY-NC-SA 4.0
← Back to Articles