Florinel Chis · March 2026
Code generation evaluation obsesses over how often models fail — pass@k, syntax validity, test pass rates. But there's a dimension nobody measures: how they fail.
We found that shifting from natural language prompts to structured JSON specifications didn't reduce our error count (5 → 4). But it fundamentally changed the error type — from semantic hallucinations that require runtime debugging to mechanical gaps caught by a compiler in under 1 millisecond.
That shift matters more than the count.
We fine-tuned Qwen2.5-Coder-7B-Instruct (4-bit quantized) with LoRA on an Apple M2 Pro with 16GB RAM to generate Laravel 13.x PHP code. Two pipelines, same model, same 3-app benchmark (26 PHP files):
| Natural Language → PHP | BuildSpec JSON → PHP | |
|---|---|---|
| Training examples | 308 | 49 |
| PHP syntax valid | 26/26 | 26/26 |
| Manual fixes needed | 5 | 4 |
| Error type | Semantic hallucination | Mechanical/spec gap |
The numbers look similar. The errors are completely different.
With NL prompts like "Create a Post model with an author relationship and soft deletes", the model produced 5 bugs:
->load(['user']) on a model with no user relationship — hallucinated a relationship from pretraining->withHttpStatus() — a method that doesn't exist in LaravelEvery one of these is a semantic hallucination: the model generated something that doesn't match the developer's intent, and the only way to catch it is to run the code and debug the failure.
With structured specs like this:
{
"artifact": "model",
"class": "Book",
"table": "books",
"has_factory": true,
"fillable": ["title", "isbn", "year", "author_id"],
"relationships": [
{"type": "BelongsTo", "model": "Author", "method": "author"}
]
}
The model produced 4 bugs:
unique:books,isbn,... instead of Rule::unique()->ignore() — wrong PHP pattern for a correct conceptauthor_id from Book::create() unnecessarily — wrong code patternpublished_year, model spec said year — our spec was inconsistentmax_attendees but migration didn't have that column — our test harness was wrongZero semantic hallucinations. The model never invented a relationship, never used a nonexistent method, never hallucinated a pattern. Bugs 1-2 are wrong code patterns for correct concepts. Bugs 3-4 are our own spec inconsistencies.
Here's our hypothesis:
Semantic hallucinations are caused by ontological misalignment. The model has its own implicit "ontology" — a domain model learned from pretraining on millions of PHP files. When prompted with natural language, gaps in the prompt are filled from this implicit ontology. Where it diverges from the developer's intent, hallucinations occur.
The model's pretraining ontology says:
- "Models usually have a user() relationship"
- "Validation includes 'optional'" (not a real Laravel rule)
- "Controllers use closure-based eager loading"
The developer's ontology says:
- "Posts belong to Authors, not Users"
- "Validation uses 'nullable'"
- "Simple array-based eager loading suffices"
BuildSpec closes this gap. When you write "relationships": [{"type": "BelongsTo", "model": "Author"}], there's no room for the model to substitute its own prior about what relationships a Post "should" have.
Before any code is generated, the spec compiler validates every spec in <1ms:
$ python3 spec_compiler.py event_request.json
SpecCompileError: rules['venue_id'] contains conditional token
'required_on_post'. Use 'conditional_rules' dict instead.
Example: {"conditional_rules": {"venue_id": {"POST": ["required"],
"PUT": ["sometimes"]}}}
The compiler catches ontological violations — wrong field names, missing required properties, invalid constraint expressions — before the model ever sees the spec. Generation takes ~30 seconds per file. Validation takes <1ms. Validate aggressively, generate only validated specs.
The spec pipeline needed 6x fewer training examples for equivalent results. Why?
With natural language, the model must learn two things: 1. What to generate (the domain ontology — which entities, relationships, rules) 2. How to generate it (the code mapping — PHP syntax, Laravel patterns)
With BuildSpec, the ontology is given. The model only learns the mapping. Half the learning problem is eliminated by making the input explicit.
If you're building domain-specific code generation:
pip install mlx-lm
python3 pipeline_spec.py "Create a REST API for managing blog posts with tags"
Runs entirely on Apple Silicon. No cloud GPU needed.
This post summarizes findings from: "The Ontological Gap: Why Error Type Matters More Than Error Count in AI Code Generation" (Chis, 2026).