Obsidian Vault Curation at Scale: How We Transformed 1,000+ Notes in Under an Hour

Everyone loves the idea of a personal knowledge base. The reality? After three years of religiously dumping notes into Obsidian, we had accumulated over 1,000 notes and 1,280 tags. What started as an organized system had devolved into something that would make Marie Kondo weep:

1
#ai
2
#AI
3
#artificialIntelligence
4
#agentic-systems
5
#agenticFrameworks
6
#3498db  ← A hex color somehow became a tag
7
#5year   ← What does this even mean?

The inconsistencies ran deeper than just tags. Our YAML frontmatter was a mess:

1
---
2
Title: Some Note          # Should be lowercase
3
Created: 2024-12-27T07:02:00  # ISO format breaks some tools
4
tags:
5
  - ai
6
  - AI                    # Duplicate!
7
---

Some older notes used Obsidian’s legacy inline property format. And about 80 notes had no frontmatter at all.

The manual fix would have taken weeks. We needed automation.

Why Batch API?#

When you’re processing 1,000+ items, API costs add up fast. Anthropic’s Batch API offers a compelling alternative:

Feature	Standard API	Batch API
Cost	Full price	50% discount
Rate limits	Per-minute limits	Up to 100,000 requests
Processing	Immediate	Within 24 hours
Use case	Interactive	Background processing

For our vault, the math was simple:

Standard API: ~$3.00 estimated cost
Batch API: ~$1.50 estimated cost

Since we didn’t need immediate results, the 24-hour processing window was perfect for an overnight job. We kicked it off before dinner and had results by morning coffee.

Architecture: A Three-Stage Pipeline#

We built a Python pipeline with three distinct stages:

Figure 1 - Batch processing pipeline diagram showing three stages: PREPARE (scan vault, build prompts, output JSONL), SUBMIT & POLL (submit to API, poll for completion, receive results), APPLY (load results, apply frontmatter, update files)

Stage 1: Intelligent File Analysis#

The preparation script analyzes each markdown file and determines what needs fixing:

1
def needs_processing(file_path: Path, frontmatter: str) -> tuple[bool, str]:
2
    """Determine if a file needs tag curation."""
3

4
    if 'title:' not in frontmatter.lower():
5
        return (True, "missing_title")
6
    if 'description:' not in frontmatter.lower():
7
        return (True, "missing_description")
8

9
    # Check for uppercase property names
10
    for line in frontmatter.split('\n'):
11
        if ':' in line:
12
            key = line.split(':')[0]
13
            if key and key[0].isupper():
14
                return (True, "uppercase_property")
15

16
    # Check for ISO date format (T separator)
17
    if 'T' in frontmatter and 'created:' in frontmatter.lower():
18
        return (True, "iso_date_format")
19

20
    return (False, "tags_ok")

This produced a sobering breakdown of issues:

Issue Type	Count	Description
missing_title	~400	No title field
missing_description	~600	No description
uppercase_property	~150	`Title:` instead of `title:`
iso_date_format	~200	ISO timestamps
inline_to_yaml	~50	Legacy format conversion
no_frontmatter	~80	Needs frontmatter created

Stage 2: Format-Specific Prompts#

Here’s where the real insight came: one-size-fits-all prompts produce mediocre results. Different file formats need different prompts.

Prompt 1: Files with YAML Frontmatter

1
prompt = f"""You are a tag curation assistant for an Obsidian vault.
2

3
## Current Frontmatter
4
```yaml
5
{existing_frontmatter}

IMPORTANT RULES#

ALL property names MUST be lowercase
Date format MUST be: YYYY-MM-DD HH
Preserve ALL emojis and special characters

Required Fields (add if missing)#

title: - Generate from filename
created: - Convert existing or use current date
tags: - Array with hierarchical format
description: - 1-2 sentence summary

Return ONLY valid YAML (without --- markers). """

1
**Prompt 2: Files with Inline Properties**
2
```python
3
prompt = f"""This file uses OLD Obsidian inline property format.
4

5
## Current Inline Properties

Type:: #note/seedling Tags:: #area/personal Links:: [[2022-11-23]]

1
## Task
2
Convert to proper YAML frontmatter, preserving:
3
- Emoji values exactly
4
- Wikilinks as quoted strings
5
- ALL existing properties (lowercase keys)
6

7
Add: title, created, description, and curated tags.
8
"""

Prompt 3: Files with No Frontmatter

1
prompt = f"""This file has NO frontmatter.
2

3
## Content Preview
4
```markdown
5
{content_preview[:8000]}

Task#

Create YAML frontmatter including:

Descriptive title: based on content
tags: array with 3-5 relevant hierarchical tags
Brief description:
created: date if found in content """

1
### The Batch Request Format
2

3
Each request follows Anthropic's JSONL format:
4

5
```json
6
{
7
  "custom_id": "file_00042_My_Note_Title",
8
  "params": {
9
    "model": "claude-3-5-haiku-20241022",
10
    "max_tokens": 2048,
11
    "messages": [
12
      {
13
        "role": "user",
14
        "content": "You are a tag curation assistant..."
15
      }
16
    ]
17
  }
18
}

The custom_id is crucial - it’s how you match results back to source files.

The Progressive Testing Strategy#

Never trust your first batch. We used a progressive testing approach:

Run	Files	Purpose
1	2	Validate pipeline works
2	6	Test edge cases
3	52	First real folder (Trading)
4	72	Parallel test (AI folder)
5	122	Multi-batch test (100+22)
6	782	Full production (8 batches)

This approach caught two critical bugs:

Bug 1: Multi-Batch Custom ID Mismatch#

When processing 122 files in two batches, the second batch’s metadata indexed files 0-21, but the API results had custom IDs file_00100 through file_00121.

The Fix: Store the global file index in metadata, not batch-local indices.

Bug 2: Unicode Console Output on Windows#

Filenames containing emojis crashed the console output:

1
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4da'

The Fix: Graceful ASCII fallback for console output.

Both bugs would have caused silent data corruption at scale. Progressive testing is essential.

Figure 2 - Progressive testing visualization showing the ramp-up from 2 files to 782 files across 6 test runs

Submitting to the Batch API#

For 782 files, this created 8 batches:

1
Submitted batch 1/8: msgbatch_01SdR31... (100 requests)
2
Submitted batch 2/8: msgbatch_01UhFNq... (100 requests)
3
...
4
Submitted batch 8/8: msgbatch_01XGJWLe... (82 requests)

All 8 batches completed in approximately 25 minutes with 100% success rate.

Sample Transformations#

Before: Inline Properties#

1
Type:: #note/seedling
2
Tags:: #area/personal , #area/health
3
Links:: [[2022-11-23]]
4
up:: [[Try This Before You Get Angry]]
5

6
Content starts here...

After: Clean YAML Frontmatter#

1
---
2
title: Anger Only Makes Things Worse
3
created: 2022-11-23 10:30
4
type: "#note/seedling"
5
links: "[[2022-11-23]]"
6
up: "[[Try This Before You Get Angry]]"
7
tags:
8
  - "personal/health"
9
  - "philosophy/stoicism"
10
description: A reflection on managing anger and its negative consequences.
11
---
12

13
Content starts here...

Before: Messy YAML#

1
---
2
Title: 2023 Money Ideas
3
Created: 2022-10-21T06:45:00
4
tags:
5
  - money
6
  - ideas
7
---

After: Standardized Format#

1
---
2
title: 2023 Money Making Ideas
3
created: 2022-10-21 06:45
4
tags:
5
  - "personal/finance"
6
  - "business/ideas"
7
  - "business/entrepreneurship"
8
description: A brainstorm list of money-making ideas and business opportunities.
9
---

Figure 3 - Side-by-side comparison showing before/after transformation of a note with messy YAML to clean standardized format

The Tag Taxonomy#

This is where Claude Code truly shined. We handed it a CSV file with 1,280 chaotic tags, and within minutes, it had:

Analyzed the entire tag list for patterns and relationships
Identified logical categories (AI, trading, coding, personal, etc.)
Designed a 4-level hierarchy that made semantic sense
Generated a complete tag_mapping.csv with old-to-new mappings

The result? A beautiful nested taxonomy of 1,040 hierarchical tags:

1
#ai/                              # Level 1: Top-level category
2
#ai/agents/                       # Level 2: Subcategory
3
#ai/agents/frameworks/            # Level 3: Specific area
4
#ai/agents/frameworks/crewai      # Level 4: Individual item
5

6
#trading/                         # Another top-level
7
#trading/patterns/                # Trading patterns
8
#trading/patterns/abcd            # Specific pattern
9
#trading/psychology/              # Trading psychology
10
#trading/psychology/discipline
11

12
#infrastructure/                  # DevOps & Infrastructure
13
#infrastructure/docker/           # Container tech
14
#infrastructure/docker/compose
15
#infrastructure/kubernetes/

This hierarchy enables powerful Obsidian searches:

#ai/ - All AI-related notes (200+ notes)
#ai/agents/ - Just agent-related notes
#ai/agents/frameworks/ - Notes about specific frameworks

Results Summary#

Metric	Value
Total Files Processed	1,028
API Success Rate	100%
Files Successfully Updated	1,023 (5 were renamed during processing)
Tags Curated	1,280 → 1,040 hierarchical
Processing Time	~30 minutes
Total Cost	~$1.50
Development Time	~4 hours

The Magic Moment#

After the final batch completed and all files were updated, we opened Obsidian.

The tag pane, which had been an unusable mess of 1,280 unrelated tags, was now a beautifully organized hierarchy:

1
▼ ai (247 notes)
2
  ▼ agents (89 notes)
3
    ▼ frameworks (34 notes)
4
      crewai (5)
5
      langgraph (8)
6
      autogen (4)
7
    patterns (12 notes)
8
    workflows (15 notes)
9
  ▼ llm (67 notes)
10
    claude (23)
11
    gpt (18)
12
    prompting (26)
13

14
▼ trading (186 notes)
15
  ▼ patterns (45 notes)
16
    abcd (8)
17
    harmonics (12)
18
    candlesticks (15)
19
  ▼ psychology (28 notes)
20
    discipline (9)
21
    emotions (11)

Every document was now linked to the right tags. Clicking on #ai/agents/frameworks/ showed exactly the 34 notes about agent frameworks.

Three years of accumulated tag chaos, fixed in under an hour.

Figure 4 - Obsidian tag pane showing the organized hierarchical taxonomy with expandable nested tags and note counts

Key Takeaways#

Start small, scale up. Progressive testing (2 → 6 → 52 → 782) catches bugs before they affect your full dataset.
Use custom_id wisely. Include enough information to debug issues.
Design format-specific prompts. One-size-fits-all prompts produce mediocre results.
Handle multi-batch carefully. Track global indices, not batch-local ones.
The 50% discount is real. For background processing tasks, Batch API makes large-scale LLM processing economically viable.
AI-assisted development is a force multiplier. Having an AI architect the solution, write the code, and debug issues in real-time turned a multi-day project into a few hours of collaborative work.

What We Built Next#

With clean, hierarchical tags, this vault curation became the foundation for a larger system:

Building a Semantic Note Network - Using vector embeddings to automatically discover and link related notes
Anthropic Batch API in Production - Deep dive into our dual-API architecture
From YouTube to Knowledge Graph - The complete system overview

The Batch API turned a weeks-long manual task into a 30-minute automated job. For anyone managing large content repositories, it’s a game-changer.

This article is part of our series on building AI-powered knowledge management tools. Written with assistance from Claude Code.