Skip to the content.

Target Repositories

Selection throughline: recognizable brand + acknowledged engineering excellence + mature test suite + commits after the model cutoff. v1 also requires Python + pytest (single-command harness, isolable deps) and — importantly — rich, idiosyncratic test infrastructure, because that is exactly where generic grep struggles and semantic tools should shine. A repo with trivial tests will show no gap no matter how good the tooling.

Test-infrastructure depth (the independent variable)

Test-file discoverability is the variable the gradient is built on, so it is worth stating concretely. “Root” is the repo’s test directory (tests/ for pydantic and dbt-core, test/ for SQLAlchemy). “Depth” counts directory levels below that root, so depth 0 means a test file lives directly in the root.

Repo Test root Dirs under root Max test-file depth Test files Depth distribution
pydantic tests/ 18 (17 subdirs) 1 90 depth 0: 71 · depth 1: 19
dbt-core tests/ 105 † — † — † — †
SQLAlchemy test/ 34 (33 subdirs) 2 225 depth 1: 152 · depth 2: 73

Reading the rows:

dbt-core is not re-measurable at HEAD. The “105 test directories” figure is what was recorded during the study. Since then dbt-core’s main has been rewritten in Rust (crates/, lib/) — no Python tests/ tree remains — and src/atw/config.py clones HEAD rather than pinning a commit SHA, and the study’s data/ (including data/commits/dbt-core/) is git-ignored and not retained. The depth/file figures therefore can’t be reconstructed without the study-era SHA. This is a reproducibility gap, disclosed rather than papered over; pinning a SHA per repo in config.py is the fix for any re-run.

Measured at current HEAD for pydantic and SQLAlchemy via find <root> -type f -name 'test_*.py' -o -name '*_test.py', bucketed by path depth below the root.

v1 default (set in config.py)

Fallback if deps are painful: Pydantic (immaculate tests, trivial deps). Other strong Python/pytest options, swap via one config line: Sentry backend (Sentry), scikit-learn (NumFOCUS), FastAPI/Starlette (Encode).

Phase-2 diversity portfolio (one marquee repo per language)

Supports the strongest claim — “consistent across repos and languages.” Each non-Python repo needs its own runner adapter.

Language Repo Org
Python scikit-learn / Sentry NumFOCUS / Sentry
TypeScript/JS TypeScript / React+Jest Microsoft / Meta
Go Terraform or Vault HashiCorp
Java Guava or Elasticsearch Google / Elastic
Rust Polars or ripgrep Polars / BurntSushi

All have prod+test co-change commits after the cutoff, so contamination control holds.

Runner notes