The Ontological Gap: Why Error Type Matters More Than Error Count in AI Code Generation

Florinel Chis · March 2026


Code generation evaluation obsesses over how often models fail — pass@k, syntax validity, test pass rates. But there's a dimension nobody measures: how they fail.

We found that shifting from natural language prompts to structured JSON specifications didn't reduce our error count (5 → 4). But it fundamentally changed the error type — from semantic hallucinations that require runtime debugging to mechanical gaps caught by a compiler in under 1 millisecond.

That shift matters more than the count.

The Setup

We fine-tuned Qwen2.5-Coder-7B-Instruct (4-bit quantized) with LoRA on an Apple M2 Pro with 16GB RAM to generate Laravel 13.x PHP code. Two pipelines, same model, same 3-app benchmark (26 PHP files):

Natural Language → PHP BuildSpec JSON → PHP
Training examples 308 49
PHP syntax valid 26/26 26/26
Manual fixes needed 5 4
Error type Semantic hallucination Mechanical/spec gap

The numbers look similar. The errors are completely different.

What Went Wrong with Natural Language

With NL prompts like "Create a Post model with an author relationship and soft deletes", the model produced 5 bugs:

  1. Invented a closure-based eager loading pattern in EventController that doesn't exist in the codebase
  2. Dropped a BelongsTo relationship on Book model despite being explicitly asked for it
  3. Used ->load(['user']) on a model with no user relationship — hallucinated a relationship from pretraining
  4. Generated ->withHttpStatus() — a method that doesn't exist in Laravel
  5. Missing JsonResource import in SubscriberResource

Every one of these is a semantic hallucination: the model generated something that doesn't match the developer's intent, and the only way to catch it is to run the code and debug the failure.

What Went Wrong with BuildSpec

With structured specs like this:

{
  "artifact": "model",
  "class": "Book",
  "table": "books",
  "has_factory": true,
  "fillable": ["title", "isbn", "year", "author_id"],
  "relationships": [
    {"type": "BelongsTo", "model": "Author", "method": "author"}
  ]
}

The model produced 4 bugs:

  1. Used string-based unique:books,isbn,... instead of Rule::unique()->ignore() — wrong PHP pattern for a correct concept
  2. Excluded author_id from Book::create() unnecessarily — wrong code pattern
  3. Migration spec said published_year, model spec said year — our spec was inconsistent
  4. Factory included max_attendees but migration didn't have that column — our test harness was wrong

Zero semantic hallucinations. The model never invented a relationship, never used a nonexistent method, never hallucinated a pattern. Bugs 1-2 are wrong code patterns for correct concepts. Bugs 3-4 are our own spec inconsistencies.

Why? The Ontological Gap

Here's our hypothesis:

Semantic hallucinations are caused by ontological misalignment. The model has its own implicit "ontology" — a domain model learned from pretraining on millions of PHP files. When prompted with natural language, gaps in the prompt are filled from this implicit ontology. Where it diverges from the developer's intent, hallucinations occur.

The model's pretraining ontology says: - "Models usually have a user() relationship" - "Validation includes 'optional'" (not a real Laravel rule) - "Controllers use closure-based eager loading"

The developer's ontology says: - "Posts belong to Authors, not Users" - "Validation uses 'nullable'" - "Simple array-based eager loading suffices"

BuildSpec closes this gap. When you write "relationships": [{"type": "BelongsTo", "model": "Author"}], there's no room for the model to substitute its own prior about what relationships a Post "should" have.

The Spec Compiler: An Ontological Reasoner

Before any code is generated, the spec compiler validates every spec in <1ms:

$ python3 spec_compiler.py event_request.json

SpecCompileError: rules['venue_id'] contains conditional token
'required_on_post'. Use 'conditional_rules' dict instead.
Example: {"conditional_rules": {"venue_id": {"POST": ["required"],
"PUT": ["sometimes"]}}}

The compiler catches ontological violations — wrong field names, missing required properties, invalid constraint expressions — before the model ever sees the spec. Generation takes ~30 seconds per file. Validation takes <1ms. Validate aggressively, generate only validated specs.

Data Efficiency: 49 vs 308 Examples

The spec pipeline needed 6x fewer training examples for equivalent results. Why?

With natural language, the model must learn two things: 1. What to generate (the domain ontology — which entities, relationships, rules) 2. How to generate it (the code mapping — PHP syntax, Laravel patterns)

With BuildSpec, the ontology is given. The model only learns the mapping. Half the learning problem is eliminated by making the input explicit.

The Takeaway

If you're building domain-specific code generation:

  1. Measure error type, not just error count. 4 mechanical bugs are better than 4 semantic hallucinations.
  2. Make the developer's ontology explicit. Structured specs remove the model's ability to hallucinate about what to generate.
  3. Validate inputs, not just outputs. A spec compiler catches errors 30,000x faster than generating code and running tests.
  4. You need fewer examples than you think. Structured input is more data-efficient because the model isn't learning domain concepts — just code patterns.

Try It

pip install mlx-lm
python3 pipeline_spec.py "Create a REST API for managing blog posts with tags"

Runs entirely on Apple Silicon. No cloud GPU needed.


This post summarizes findings from: "The Ontological Gap: Why Error Type Matters More Than Error Count in AI Code Generation" (Chis, 2026).