Extracting Structured JSON from Incompatible LLM APIs
- Authors

- Name
- Nino
- Occupation
- Senior Tech Editor
Building production-grade applications on top of Large Language Models (LLMs) requires moving beyond simple text completion. When your system relies on the model's output to drive UI components, trigger CI/CD pipelines, or populate databases, you need a strict contract. You need structured JSON. However, the current landscape of LLM providers is a fragmented ecosystem where every vendor has a different idea of how 'structure' should be enforced.
At n1n.ai, we see developers struggling with this fragmentation daily. Whether you are using Claude 3.5 Sonnet, OpenAI o3, or DeepSeek-V3, the underlying mechanism to guarantee a specific JSON shape varies wildly. This guide explores how to build a unified pipeline that targets a single schema across five (or more) incompatible APIs, and how to handle the inevitable moments when those models ignore your instructions.
The Single Source of Truth: The Schema
Before diving into the API-specific implementations, you must define a rigid internal schema. In our code review system, every finding must adhere to a specific Go struct. This ensures that downstream consumers—like a card renderer or a JSON reporter—receive predictable data.
type Finding struct {
Severity Severity `json:"severity"` // critical, high, medium, low, info
File string `json:"file"`
Line int `json:"line"`
LineEnd int `json:"line_end,omitempty"`
Title string `json:"title"`
Description string `json:"description"`
Suggestion string `json:"suggestion"`
Language string `json:"language,omitempty"`
Snippet string `json:"snippet,omitempty"`
}
The envelope for this data is always {"findings": [ ... ]}. This fixed vocabulary is the wire contract. While a user might customize the logic of the review via a config file, the shape of the output is non-negotiable. This is where the complexity begins: forcing different LLMs to honor this shape.
The Four Dialects of Structured Output
Implementing this schema across multiple providers requires mapping it to four distinct 'dialects'.
1. Anthropic: Forced Tool Use
Anthropic’s Claude models do not have a native 'JSON Mode' in the same way OpenAI does. Instead, the most reliable way to get structured data is to define the schema as a tool and then use tool_choice to force the model to call it.
In this approach, the model isn't 'answering' a question; it is 'executing a function' where the arguments are your findings. By setting the tool as mandatory, you eliminate the preamble text (e.g., "Here are your findings:") that often breaks parsers.
2. OpenAI: Strict JSON Schema
OpenAI provides the gold standard with their Strict mode. When strict: true is set, the API validates the model's output against your schema server-side during generation. If the model attempts to output a field not in the schema, the generation is constrained.
func buildResponseFormat() sdk.ChatCompletionNewParamsResponseFormatUnion {
return sdk.ChatCompletionNewParamsResponseFormatUnion{
OfJSONSchema: &shared.ResponseFormatJSONSchemaParam{
JSONSchema: shared.ResponseFormatJSONSchemaJSONSchemaParam{
Name: "code_review_report",
Strict: sdk.Bool(true),
Schema: responseSchema,
},
},
}
}
3. Gemini: Response Schema & MIME Types
Google’s Gemini API takes a middle ground. You provide a ResponseSchema (a *genai.Schema) and set the ResponseMIMEType to application/json. This is generally robust, though it lacks some of the strict enforcement seen in OpenAI’s implementation.
4. Ollama and Others: Syntax-Only Constraints
Local providers like Ollama often support a format: "json" flag. This guarantees that the output will be valid JSON syntax, but it offers zero guarantees regarding the keys or the data types. For these models, and for others like DeepSeek or Mistral accessed via generic endpoints, the schema must be enforced via the system prompt.
The Reliability Spectrum
It is helpful to visualize these providers on a spectrum of reliability:
| Mechanism | Constraint Level | Primary Providers |
|---|---|---|
| Forced Tool / Strict Schema | Exact Shape Enforced | Anthropic, OpenAI, Gemini |
| format: "json" | Syntax Only | Ollama |
| Prompt Instruction | No API Enforcement | DeepSeek, Mistral, Cohere |
When using n1n.ai, you can toggle between these providers to find the right balance between cost and structural reliability.
Building the Parser Backstop
Since not all providers guarantee the schema, your application needs a central validation function. We call this ParseFindings. It doesn't just check if the output is JSON; it validates the business logic of the data.
If a model returns a severity of "urgent" (which is not in our Severity enum) or forgets the file field, the parser must reject it. This unified validation ensures that downstream logic—like a CI gate that fails on high severity findings—works identically regardless of whether the model was Claude 3.5 Sonnet or a local Qwen instance.
Resilience: Retry and Degrade
What happens when the model fails? A robust pipeline should never crash. We implement a "Retry-Once-then-Degrade" strategy.
- First Attempt: Request structured output.
- Validation: Run
ParseFindings. If it passes, cache and return. - Retry: If validation fails, trigger one more attempt. This second attempt often fixes hallucinations or formatting glitches.
- Degradation: If the second attempt also fails, do not return an error. Instead, treat the raw text as Markdown. The UI will render the text as a fallback, and the system logs a warning.
This approach ensures that the user always gets their code review results, even if the structured 'cards' aren't available. When using n1n.ai, this multi-model resilience becomes even more powerful, as you can switch providers on the fly if one consistently fails to meet the schema.
Pro Tip: Schema vs. Correctness
Always remember: A schema makes output parseable, not correct. A model can perfectly adhere to your JSON structure while inventing line numbers that don't exist or hallucinating bugs in the code. Structured output is the foundation of machine readability, but it is not a substitute for human judgment or rigorous automated testing.
To measure the actual quality of the findings, you should implement an evaluation harness that compares model output against a ground-truth corpus. Only then can you determine if a model is truly production-ready for your specific use case.
Get a free API key at n1n.ai