Skip to main content

Experiment

Complete

Discrete Convergence

Does code that an LLM thinks is high-quality actually pass real tools? Phase 2 replaced subjective LLM scoring with automated tool measurement.

Hypothesis

Phase 1 (Layered Convergence) demonstrated that a specification-first methodology could converge across 10 quality layers using LLM-based scoring. But LLM scoring is subjective — three Claude sessions evaluating the same code may agree it “looks right” while the code fails to compile, tests don't pass, or Docker builds break.

The hypothesis: LLM-based scoring systematically overstates quality. Automated tools — compilers, linters, test runners, container builds, security scanners — will reveal gaps that subjective evaluation missed. The delta between “LLM says yes” and “tools say yes” is the measure of methodology blind spots.

Method

Five progressive phases, each requiring the infrastructure of prior phases. For each trial, three independent full-stack applications are built from the methodology across different business domains (analytics, event management, field dispatch). The trial score for each dimension is the minimum across all three projects — a rule that works in one domain but fails in another does not converge.

24 scoring dimensions are measured by concrete tools: ESLint, TypeScript compiler, Prisma validator, Docker build, and more. Each dimension produces a discrete score from tool output — no subjective interpretation.

Phases are sequential. Each must converge before the next begins, progressively building from static analysis through build verification, container orchestration, runtime validation, and finally edge-case completeness.

Results

28

Trials

64

Failure Modes

5/5

Phases Converged

Calibration Finding

Trial 0 established the calibration baseline. The methodology's builder (LLM) self-assessed near-perfect quality. Discrete tool scoring measured actual quality at less than half that level — confirming the core hypothesis that LLM scoring systematically overstates code quality.

LLM Self-Assessment

99.2%

Discrete Tool Score

46.7%

Score Trajectory

Loading chart...

Failure Mode Discovery

Loading chart...

64 failure modes discovered across 28 trials. All 5 phases converged — 24 dimensions passing across all phases. Trial 27 (methodology v3.0-dc) scores 9.29 avg — second consecutive clean Phase 4 trial, converging the final phase (Edge Case & Completeness). Experiment complete.

Phase Progression

Each phase adds infrastructure requirements and activates new scoring dimensions. Phases are sequential — each must converge before the next begins.

PhaseNameFocusDimensionsStatus
0CalibrationStatic analysis + type checking14Converged
1Test ExecutionTests must pass, coverage thresholds2Converged
2Container VerificationDocker builds, healthchecks, security scans1Converged
3Runtime ValidationPerformance, accessibility, active security5Converged
4Edge Case & CompletenessBusiness logic, feature completeness, UX2Converged

Transparency

Honest accounting of the experiment's design choices, limitations, and relationship to Phase 1.

Design Choices

Minimum-across-projects scoring

Each dimension score is the minimum across all three project domains. This is deliberately conservative — a methodology rule that works for analytics but fails for event management does not pass. This means scores are lower than averages would suggest.

Scorer bug fixes applied retroactively

Trial 0 was re-scored after 4 scorer implementation bugs were fixed. The re-scored results are canonical. This is expected during calibration — the scorer itself is being validated alongside the methodology.

Known Limitations

Same tech stack as Phase 1

Only the NestJS + Next.js + Prisma + PostgreSQL + Turborepo stack is tested. Results may not generalize to other frameworks or languages.

Single AI system

All building is performed by Claude. The scorer is automated tooling, but the builder remains a single AI system without cross-model validation.

Progressive infrastructure dependency

Later phases require Docker, running containers, and network access. Environment differences between scoring runs could affect reproducibility.

Relationship to Phase 1

Builds on layered-convergence methodology

The master methodology from Phase 1 (v1.0) is the starting point. Phase 2 does not start from scratch — it starts from the methodology that converged across 10 layers with LLM scoring, and measures how well that methodology holds up under tool-based verification.

Data Access

Trial source code and aggregate results are open source. Scoring tooling, methodology documents, and per-dimension breakdowns are not published.

Trial Timeline

Trial 0Calibration
4.67+6 failures
Trial 1Calibration
8.83+3 failures
Trial 2Calibration
8.50+1 failures
Trial 3Calibration
5.67+0 failures
Trial 4Calibration
5.29+0 failures
Trial 5Calibration
4.17+0 failures
Trial 6Calibration
5.71+2 failures
Trial 7Calibration
5.50+0 failures
Trial 8Calibration
5.54+6 failures
Trial 9Calibration
6.88+1 failures
Trial 10Calibration
6.17+5 failures
Trial 11Calibration
6.50+2 failures
Trial 12Calibration
6.33+2 failures
Trial 13Calibration
6.88+0 failures
Trial 14Calibration
6.79+0 failures
Trial 15Test Execution
9.31+1 failures
Trial 16Test Execution
9.31+0 failures
Trial 17Container Verification
8.94+6 failures
Trial 18Container Verification
8.94+0 failures
Trial 19Runtime Validation
6.55+5 failures
Trial 20Runtime Validation
6.32+3 failures
Trial 21Runtime Validation
7.36+0 failures
Trial 22Runtime Validation
7.64+10 failures
Trial 23Runtime Validation
4.17+11 failures
Trial 24Runtime Validation
9.21+0 failures
Trial 25Runtime Validation
9.13+0 failures
Trial 26Edge Case & Completeness
9.17+0 failures
Trial 27Edge Case & Completeness
9.29+0 failures