Public R&D Journal
What does it take to make AI-generated software systematically reliable?
I'm Stephen Deslate. Over 93 trials and four experiments, I've been testing whether AI-generated software can be made systematically reliable. The methodology is called Convergence Engineering Development. The data is public.
Normative Convergence — Complete
Score Trajectory — Trial 02, 6 Iterations
CED Research Program
93+
Trials
218+
Failure Modes Tracked
4
Experiments
Problem
AI-generated code fails unpredictably.
AI-generated code that looks right but breaks in production. No tests. Wrong patterns. Security holes. The industry treats this as the cost of doing business. I wanted to know if it had to be.
LLM vs Tools
99.2%
LLM says
46.7%
Tools say
Method
Specify first, then generate, then converge.
Convergence Engineering Development specifies contracts, interfaces, and integration points before generating any code — then scores the output, tracks every failure mode that slips through, and iterates the methodology until convergence. Each failure discovered in one trial is permanently eliminated in the next.
5 Phases Converged, 24 Dimensions
Result
Four experiments. Two open questions.
Phases 1–3 focused on measurement: layered convergence across 10 layers, deterministic tool scoring, and 5 epistemic layers mapped to ISO/IEC 25010 — surfacing a Goodhart’s Law problem. Phase 4 shifted to architecture: can a guided agent loop with 99.7% local compute match a 19K-line orchestration pipeline? Both approaches continue.
Layered Convergence — 44 Trials
Latest Experiment
Two Roads to Deployment
In ProgressPhase 4 of the CED research program. Comparing two architectures for AI code generation: a gated pipeline (19K lines, 8 phases, 40 scoring dimensions) and a guided agent loop (833 new lines, model cascade). 13 pipeline runs produced no complete output. 6 agent loop runs produced 1 working app (54 files, 33 tests) in 3.5 hours with 99.7% local compute. Both approaches continue development.
19
Trials
43
Failure Modes
99.7%
Local Compute
1/2
Approaches Tested
Normative Convergence
CompletePhase 3 of the CED research program. 5-layer epistemic scorer mapped to ISO/IEC 25010:2023 across 40 quality dimensions. Trial01 archived (9 scorer bugs discovered). Trial02 scored 9.79/10.0 across all 5 layers. Central finding: Goodhart's Law — the scorer and the code co-evolved, raising the question of whether convergence proves quality or gaming.
2
Trials
40
Dimensions
9
Scorer Bugs
5/5
Layers Passed
Discrete Convergence
CompletePhase 2 of the SDD research program. Replaced LLM-based scoring with automated tool measurement across 24 dimensions and 5 progressive phases. All phases converged across 28 trials, discovering 64 failure modes.
28
Trials
64
Failure Modes
5/5
Phases Converged
Layered Convergence
CompleteAll 10 layers converged. 102 failure modes tracked and resolved across 44 trials, progressing from backend API through cross-layer integration.
44
Trials
102
Failure Modes
10/10
Layers Converged
Convergence Curve
Production work
Applications built with evolving methodology — from hand-architected to CED-generated.
About
Stephen Deslate
Engineer and applied researcher. I run a public research program testing whether AI-generated code can be made systematically reliable — using Claude as the primary engineering tool. Four experiments, 93 trials, and open questions about measurement and architecture.
Full bio →Role
Engineer & Applied Researcher
Focus
CED Research
Process
AI-First Development
Project
SJD Labs