6 Defensive Strategies for Building Reliable LLM Applications: Lessons from pdf2anki

Authors
  • avatar
    Name
    Nino
    Occupation
    Senior Tech Editor

In the world of LLM integration, there is a hard truth every developer eventually faces: LLMs are probabilistic, not deterministic. When I set out to build pdf2anki—a CLI tool designed to transform dense academic PDFs into high-quality Anki flashcards—I quickly realized that a simple API call wasn't enough. To create something truly useful, I had to build a fortress of defensive code around the Claude API.

This tutorial breaks down the six core defenses I implemented. Whether you are using n1n.ai to access Claude 3.5 Sonnet or OpenAI's latest models, these patterns will help you build more resilient AI-powered tools.

1. The "Partial Success" Parser

When you ask an LLM to return JSON, it usually complies. However, at scale, things break. Claude might include markdown code fences (json ... ), add conversational filler, or hallucinate a trailing comma that breaks standard parsers.

Initially, my tool used an "all-or-nothing" approach. If a batch of 10 cards had one formatting error, the entire batch failed, and I had to retry the whole request. This was a waste of tokens and money.

The Defense: Granular Validation

Instead of parsing the whole response, I implemented a loop that validates individual items using Pydantic. If one card fails, we log it and keep the rest.

from pydantic import BaseModel, ValidationError
import json

class AnkiCard(BaseModel):
    front: str
    back: str
    tags: list[str]

def robust_parse(raw_output: str):
    # Strip markdown fences using regex
    cleaned_json = re.sub(r'^```json\s*|\s*```$', '', raw_output.strip(), flags=re.MULTILINE)
    data = json.loads(cleaned_json)

    valid_cards = []
    for i, item in enumerate(data):
        try:
            # Validate individual object
            card = AnkiCard.model_validate(item)
            valid_cards.append(card)
        except (ValidationError, TypeError) as e:
            print(f"Skipping invalid card at index {i}: {e}")
    return valid_cards

Pro Tip: Always assume the LLM will fail at some point. By using n1n.ai, you can easily switch between models like Claude 3.5 Sonnet and Haiku to test which one follows your JSON schema more strictly under load.

2. Heuristic Quality Filtering (The Cost-Quality Tradeoff)

LLMs sometimes generate "lazy" content. For flashcards, this means cards that are too long, too vague, or contain multiple concepts. Asking the LLM to "self-critique" every card is an option, but it doubles your API costs.

The Defense: Code-Based Scoring

I built a multi-layer filtering system. Layer 1 is pure Python—no LLM involved. It scores cards based on heuristics:

AxisWeightLogic
Atomicity25%Penalize if the 'back' has > 3 sentences or uses conjunctions like 'furthermore'.
Length25%Ideal length is 10–200 characters.
Formatting25%Check for question marks or specific keywords.
Uniqueness25%Use Jaccard similarity to detect duplicate cards in a batch.

Only cards that score below a 0.90 threshold are sent back to the LLM for a "critique and rewrite" step. This reduced my API usage by nearly 60%.

3. Financial Guardrails: Cost Transparency

API costs can spiral out of control, especially when processing 500-page PDFs. If your tool doesn't show the user the price before execution, it's a liability.

The Defense: Pre-flight Estimates and Hard Caps

I implemented a CostTracker that calculates estimated costs based on character counts before any API call is made.

@dataclass(frozen=True)
class CostTracker:
    budget_limit: float = 1.00
    current_spend: float = 0.0

    def check_budget(self, additional_cost: float):
        if self.current_spend + additional_cost > self.budget_limit:
            raise Exception("Budget exceeded! Stopping process.")

By leveraging the high-speed infrastructure of n1n.ai, developers can get consistent latency for these checks, ensuring that the budget logic doesn't slow down the user experience.

4. Semantic Document Splitting

Feeding a whole PDF into an LLM is a recipe for context loss. However, splitting a PDF every 2000 characters is equally bad because it might cut a sentence or a chapter in half.

The Defense: Markdown Heading Breadcrumbs

I switched to a heading-aware splitting strategy. The tool tracks the current # H1, ## H2, and ### H3. Every chunk of text sent to the LLM is prepended with a "breadcrumb" string: Context: Chapter 2 > Section 2.1 > Topic A.

This ensures that even if a chunk is small, the LLM knows exactly where it sits in the document's hierarchy, leading to much more accurate flashcards.

5. Pragmatic Vision Integration

Vision models (like Claude 3.5 Sonnet) are powerful but expensive. Converting every page of a PDF to an image can increase costs by 7x compared to text extraction.

The Defense: The 20% Coverage Rule

My tool uses pymupdf to analyze the page layout first.

  1. If a page contains < 20% image area, we only extract text.
  2. If it exceeds 20%, we render the page at 150 DPI (a sweet spot for legibility vs. token cost) and send it to the Vision API.
  3. We limit the system to a maximum of 5 images per page to prevent token bloat.

6. Automated Evals: Measuring Prompt Changes

If you change your prompt, how do you know it's actually better? "Vibes" are not a metric.

The Defense: Keyword-Based Eval Framework

I created a YAML-based dataset of "Gold Standard" examples. When I update a prompt, the tool runs an automated evaluation that compares the LLM output against the expected keywords using Recall and Precision metrics.

- id: 'concept-01'
  text: 'Photosynthesis is the process by which plants use sunlight to synthesize nutrients...'
  expected_keywords: ['sunlight', 'nutrients', 'chlorophyll']

Even a simple keyword match is enough to detect if a prompt change caused a regression in quality.

Conclusion

Building pdf2anki taught me that the LLM is just the engine; the surrounding code is the chassis, brakes, and dashboard. By implementing validation, heuristic filtering, and semantic splitting, you can turn a fragile script into a robust professional tool.

Ready to build your own resilient LLM applications? Get a free API key at n1n.ai.