Experiment
CompleteNormative Convergence
Can a 5-layer epistemic scorer, mapped to ISO/IEC 25010:2023, measure real code quality — or does the model just learn to pass the scorer?
Hypothesis
Phase 2 (Discrete Convergence) proved that automated tools catch what LLMs miss. But its scoring was binary — pass or fail, per tool. It couldn't answer how well the code handled adversarial input, survived infrastructure faults, or maintained stability under sustained load.
The hypothesis: a multi-layer scoring architecture — where each layer adds epistemic depth and later layers can only lower scores, never raise them — will expose quality gaps that tool-based pass/fail scoring misses. Mapping all dimensions to ISO/IEC 25010:2023 provides a normative standard against which convergence can be objectively measured.
Method
Five epistemic layers, applied sequentially. Each layer adds runtime depth. The final score for any dimension is the minimum across all applicable layers — a monotonic constraint that prevents early layers from masking deficiencies found later.
40 scoring dimensions derived from ISO/IEC 25010:2023 cover 9 quality characteristics: functional suitability, performance efficiency, compatibility, interaction capability, reliability, security, maintainability, flexibility, and safety — plus cross-cutting quality assurance.
Phases gate which layers are active. Phase 0 runs only Layer 1 (structural). Phase 4 runs all five layers including a 30-minute endurance test. Each phase must converge before the next begins.
Results
2
Trials
40
Dimensions
9
Scorer Bugs Found
5/5
Layers Passed
Trial01 — Archived
Trial01 revealed 9 systematic bugs in the scorer itself: fake proof templates, grep fallback violations, dimension mapping mismatches, phantom penalties, and non-deterministic fault injection. The same codebase scored anywhere from 6.4 to 7.5 across iterations with no code changes. Trial01 data is archived as unreliable.
Trial01 Score
6.88
Trial02 Score
9.79
Trial02 Score Trajectory
Layer Score Breakdown
Trial02 achieved 9.79/10.0 across all 40 dimensions and 5 layers. All dimensions pass. The lowest scores came from the adversarial layer (minor logging gaps) and the endurance layer (latency variance under sustained load).
The 5-Layer Architecture
Each layer adds epistemic depth. Later layers can only lower scores (monotonic constraint). The final score for each dimension is the minimum across all applicable layers.
| Layer | Name | Focus | Status |
|---|---|---|---|
| 1 | Structural | Static analysis: extract claims from source tree. No runtime. | Passed |
| 2 | Behavioral | Runtime proof of each claim. Unproven claims capped at 6/10. | Passed |
| 3 | Adversarial | Blind attacks, no manifest access. Discovers unclaimed vulnerabilities. | Passed |
| 4 | Resilience | Fault injection: kill DB, exhaust pools, crash processes. | Passed |
| 5 | Endurance | Sustained load over time. Measures drift, leaks, and stability. | Passed |
The Goodhart's Law Problem
The central finding of this experiment is not the 9.79 score. It's the question the score raises: did the code get better, or did it just learn to pass the scorer?
Between Trial01 and Trial02, three things changed simultaneously: the scorer was fixed (9 bugs), the scoring formula was rewritten, and the application code was regenerated. The 2.91-point improvement cannot be cleanly decomposed into these three factors. They are confounded.
Evidence of Scorer-Application Co-Evolution
100% behavioral proof rate. 641/641 claims proven in Trial02. A perfect proof rate across 40 dimensions is more suspicious than reassuring — it suggests the proofs are testing what the code was built to pass.
Structural claim inflation. Claims grew from 316 (Trial01) to 641 (Trial02). The cross-cutting extractor maps common NestJS patterns to many dimensions simultaneously, inflating coverage metrics.
Proof templates accept weak evidence. Some behavioral proofs accept HTTP 401/403 as “proven” (the endpoint exists and is auth-protected), which conflates deployment with behavioral verification.
Single-subject design. One project (telehealth-booking), one tech stack (NestJS/Prisma/PostgreSQL), one AI system (Claude). No cross-project or cross-model validation.
This is Goodhart's Law applied to code quality measurement: “When a measure becomes a target, it ceases to be a good measure.” The scorer and the application co-evolved — each iteration of the scorer shaped what the next generation of code optimized for. The question this experiment leaves open is how to design a scorer that resists gaming by the system it measures.
Transparency
Honest accounting of what this experiment can and cannot claim.
Protocol Deviations
Endurance test was 30 minutes, not multi-hour
The methodology specifies multi-hour sustained load for Layer 5. Trial02 ran 30 minutes. Memory leaks and connection exhaustion often manifest only after hours of operation.
Single trial, not multiple consecutive clean trials
Convergence criteria require multiple consecutive clean trials. Only one trial (Trial02) has been scored with the corrected scorer.
Single project, not three across different domains
The methodology calls for three independent projects across different business domains. Only one project (telehealth-booking) was scored.
Known Limitations
Confounded variables between trials
Scorer fixes, formula changes, and code regeneration all happened between Trial01 and Trial02. The 2.91-point improvement cannot be attributed to any single factor.
Adversarial layer has limited attack surface
The adversarial layer tested 53 endpoints with fuzzing, concurrency, and input attacks. A real penetration test would be broader and more creative.
Same tech stack as all prior experiments
NestJS + Prisma + PostgreSQL remains the only stack tested. ISO/IEC 25010 dimensions may manifest differently in other ecosystems.
Relationship to Prior Phases
Builds on Phase 1 and Phase 2 methodology
The CED methodology from Phase 1 (layered convergence) and Phase 2 (discrete convergence) is the foundation. Phase 3 does not start from scratch — it starts from a methodology that converged across 10 layers and 5 phases, and asks whether deeper runtime measurement changes the picture.
Scorer is a new artifact, not inherited
The 5-layer scorer was built for this experiment. Unlike Phase 2's tool-based scoring, this scorer uses a claims-proof-attack architecture that introduces its own complexity and potential for bugs — as Trial01 demonstrated.
Data Access
Trial application source code is public. The scoring methodology, per-dimension breakdowns, and failure mode definitions remain private.