Skip to main content

Public R&D Journal

What does it take to make AI-generated software systematically reliable?

I'm Stephen Deslate. Over 93 trials and four experiments, I've been testing whether AI-generated software can be made systematically reliable. The methodology is called Convergence Engineering Development. The data is public.

CED Research Program

93+

Trials

218+

Failure Modes Tracked

4

Experiments

Problem

AI-generated code fails unpredictably.

AI-generated code that looks right but breaks in production. No tests. Wrong patterns. Security holes. The industry treats this as the cost of doing business. I wanted to know if it had to be.

LLM vs Tools

99.2%

LLM says

vs

46.7%

Tools say

Method

Specify first, then generate, then converge.

Convergence Engineering Development specifies contracts, interfaces, and integration points before generating any code — then scores the output, tracks every failure mode that slips through, and iterates the methodology until convergence. Each failure discovered in one trial is permanently eliminated in the next.

Calibration
Test Exec
Container
Runtime
Edge Cases

5 Phases Converged, 24 Dimensions

Result

Four experiments. Two open questions.

Phases 1–3 focused on measurement: layered convergence across 10 layers, deterministic tool scoring, and 5 epistemic layers mapped to ISO/IEC 25010 — surfacing a Goodhart’s Law problem. Phase 4 shifted to architecture: can a guided agent loop with 99.7% local compute match a 19K-line orchestration pipeline? Both approaches continue.

Layered Convergence — 44 Trials

Loading chart...

Latest Experiment

Two Roads to Deployment

In Progress

Phase 4 of the CED research program. Comparing two architectures for AI code generation: a gated pipeline (19K lines, 8 phases, 40 scoring dimensions) and a guided agent loop (833 new lines, model cascade). 13 pipeline runs produced no complete output. 6 agent loop runs produced 1 working app (54 files, 33 tests) in 3.5 hours with 99.7% local compute. Both approaches continue development.

19

Trials

43

Failure Modes

99.7%

Local Compute

1/2

Approaches Tested

View full experiment →

Normative Convergence

Complete

Phase 3 of the CED research program. 5-layer epistemic scorer mapped to ISO/IEC 25010:2023 across 40 quality dimensions. Trial01 archived (9 scorer bugs discovered). Trial02 scored 9.79/10.0 across all 5 layers. Central finding: Goodhart's Law — the scorer and the code co-evolved, raising the question of whether convergence proves quality or gaming.

2

Trials

40

Dimensions

9

Scorer Bugs

5/5

Layers Passed

View full experiment →

Discrete Convergence

Complete

Phase 2 of the SDD research program. Replaced LLM-based scoring with automated tool measurement across 24 dimensions and 5 progressive phases. All phases converged across 28 trials, discovering 64 failure modes.

28

Trials

64

Failure Modes

5/5

Phases Converged

View full experiment →

Layered Convergence

Complete

All 10 layers converged. 102 failure modes tracked and resolved across 44 trials, progressing from backend API through cross-layer integration.

44

Trials

102

Failure Modes

10/10

Layers Converged

Convergence Curve

Loading chart...
View full experiment →

Production work

Applications built with evolving methodology — from hand-architected to CED-generated.

About

Stephen Deslate

Engineer and applied researcher. I run a public research program testing whether AI-generated code can be made systematically reliable — using Claude as the primary engineering tool. Four experiments, 93 trials, and open questions about measurement and architecture.

Full bio →

Role

Engineer & Applied Researcher

Focus

CED Research

Process

AI-First Development

Project

SJD Labs