Experiment

Complete

Discrete Convergence

Does code that an LLM thinks is high-quality actually pass real tools? Phase 2 replaced subjective LLM scoring with automated tool measurement.

Hypothesis

Phase 1 (Layered Convergence) demonstrated that a specification-first methodology could converge across 10 quality layers using LLM-based scoring. But LLM scoring is subjective — three Claude sessions evaluating the same code may agree it “looks right” while the code fails to compile, tests don't pass, or Docker builds break.

The hypothesis: LLM-based scoring systematically overstates quality. Automated tools — compilers, linters, test runners, container builds, security scanners — will reveal gaps that subjective evaluation missed. The delta between “LLM says yes” and “tools say yes” is the measure of methodology blind spots.

Method

Five progressive phases, each requiring the infrastructure of prior phases. For each trial, three independent full-stack applications are built from the methodology across different business domains (analytics, event management, field dispatch). The trial score for each dimension is the minimum across all three projects — a rule that works in one domain but fails in another does not converge.

24 scoring dimensions are measured by concrete tools: ESLint, TypeScript compiler, Prisma validator, Docker build, and more. Each dimension produces a discrete score from tool output — no subjective interpretation.

Phases are sequential. Each must converge before the next begins, progressively building from static analysis through build verification, container orchestration, runtime validation, and finally edge-case completeness.

Results

Trials

Failure Modes

5/5

Phases Converged

Calibration Finding

Trial 0 established the calibration baseline. The methodology's builder (LLM) self-assessed near-perfect quality. Discrete tool scoring measured actual quality at less than half that level — confirming the core hypothesis that LLM scoring systematically overstates code quality.

LLM Self-Assessment

99.2%

Discrete Tool Score

46.7%

Score Trajectory

Failure Mode Discovery

64 failure modes discovered across 28 trials. All 5 phases converged — 24 dimensions passing across all phases. Trial 27 (methodology v3.0-dc) scores 9.29 avg — second consecutive clean Phase 4 trial, converging the final phase (Edge Case & Completeness). Experiment complete.

Phase Progression

Each phase adds infrastructure requirements and activates new scoring dimensions. Phases are sequential — each must converge before the next begins.

Phase	Name	Focus	Dimensions	Status
0	Calibration	Static analysis + type checking	14	Converged
1	Test Execution	Tests must pass, coverage thresholds	2	Converged
2	Container Verification	Docker builds, healthchecks, security scans	1	Converged
3	Runtime Validation	Performance, accessibility, active security	5	Converged
4	Edge Case & Completeness	Business logic, feature completeness, UX	2	Converged

Transparency

Honest accounting of the experiment's design choices, limitations, and relationship to Phase 1.

Design Choices

Minimum-across-projects scoring

Each dimension score is the minimum across all three project domains. This is deliberately conservative — a methodology rule that works for analytics but fails for event management does not pass. This means scores are lower than averages would suggest.

Scorer bug fixes applied retroactively

Trial 0 was re-scored after 4 scorer implementation bugs were fixed. The re-scored results are canonical. This is expected during calibration — the scorer itself is being validated alongside the methodology.

Known Limitations

Same tech stack as Phase 1

Only the NestJS + Next.js + Prisma + PostgreSQL + Turborepo stack is tested. Results may not generalize to other frameworks or languages.

Single AI system

All building is performed by Claude. The scorer is automated tooling, but the builder remains a single AI system without cross-model validation.

Progressive infrastructure dependency

Later phases require Docker, running containers, and network access. Environment differences between scoring runs could affect reproducibility.

Relationship to Phase 1

Builds on layered-convergence methodology

The master methodology from Phase 1 (v1.0) is the starting point. Phase 2 does not start from scratch — it starts from the methodology that converged across 10 layers with LLM scoring, and measures how well that methodology holds up under tool-based verification.

Data Access

Trial source code and aggregate results are open source. Scoring tooling, methodology documents, and per-dimension breakdowns are not published.

View on GitHub View Phase 1: Layered Convergence

Trial Timeline

Trial 0Calibration

4.67+6 failures

Trial 1Calibration

8.83+3 failures

Trial 2Calibration

8.50+1 failures

Trial 3Calibration

5.67+0 failures

Trial 4Calibration

5.29+0 failures

Trial 5Calibration

4.17+0 failures

Trial 6Calibration

5.71+2 failures

Trial 7Calibration

5.50+0 failures

Trial 8Calibration

5.54+6 failures

Trial 9Calibration

6.88+1 failures

Trial 10Calibration

6.17+5 failures

Trial 11Calibration

6.50+2 failures

Trial 12Calibration

6.33+2 failures

Trial 13Calibration

6.88+0 failures

Trial 14Calibration

6.79+0 failures

Trial 15Test Execution

9.31+1 failures

Trial 16Test Execution

9.31+0 failures

Trial 17Container Verification

8.94+6 failures

Trial 18Container Verification

8.94+0 failures

Trial 19Runtime Validation

6.55+5 failures

Trial 20Runtime Validation

6.32+3 failures

Trial 21Runtime Validation

7.36+0 failures

Trial 22Runtime Validation

7.64+10 failures

Trial 23Runtime Validation

4.17+11 failures

Trial 24Runtime Validation

9.21+0 failures

Trial 25Runtime Validation

9.13+0 failures

Trial 26Edge Case & Completeness

9.17+0 failures

Trial 27Edge Case & Completeness

9.29+0 failures