How we measured
This page exists so you don’t have to take our word for the firewall numbers. It documents what was measured, how, on what corpus, with what assumptions, and what we are explicitly not claiming. The 57.6% average and 93.4% strict-mode peak quoted on the landing page come from the analysis below.
One paragraph
We took 20 real production codebases (12,346 commit/range pairs, 232,242 changed records), and for each change we estimated two numbers: how much existing source a competent text-tool agent would need to read to make that change, and how much existing source pandō’s handle-first firewall would disclose for the same change. The reduction between the two is what we report. These are modeled estimates against an efficient text-tool baseline, not measurements of live agent sessions. Newly-authored code is excluded from the savings — it has to cross the boundary in every system.
What “20 real repos” means
The corpus is weighted toward large, active, recognizable codebases rather than toy projects. Names you will know:
| Repository | Records | Avg baseline | Avg reduction |
|---|---|---|---|
| ReactiveX/RxJava | 59,642 | 3,105,925 | 60.48% |
| apache/iceberg | 30,750 | 1,424,775 | 59.15% |
| StackExchange/StackExchange.Redis | 20,125 | 1,207,656 | 56.27% |
| AppFlowy-IO/AppFlowy | 24,086 | 1,204,865 | 55.93% |
| AvaloniaUI/Avalonia | 14,271 | 776,249 | 55.23% |
| Azure/azure-functions-host | 14,689 | 774,267 | 55.55% |
| FirebaseExtended/flutterfire | 12,830 | 696,173 | 56.48% |
| TanStack/query | 12,287 | 674,807 | 55.41% |
| HangfireIO/Hangfire | 10,205 | 573,846 | 52.82% |
| Azure/azure-powershell | 11,401 | 522,168 | 56.38% |
| alibaba/arthas | 6,214 | 279,215 | 57.90% |
| Baseflow/flutter-plugins | 3,688 | 201,819 | 56.66% |
| OrchardCMS/OrchardCore | 2,617 | 120,846 | 62.05% |
| apache/camel | 2,251 | 118,495 | 56.71% |
| Azure/azure-sdk-for-net | 1,938 | 116,446 | 69.49% |
Other repos contributing to the corpus include apache/druid, ClickHouse/ClickHouse, apache/arrow, bitcoin/bitcoin, and GopeedLab/gopeed. Totals: 20 repositories, 12,346 commit/range pairs, 232,242 changed file/hunk records; 19 of the 20 had nonzero baseline (one was excluded as a corner case).
How much existing source never has to leave
| Band | Reduction | Read as |
|---|---|---|
| Min | 26.43% | Conservative floor: skeptical text baseline, higher pandō disclosure. |
| Avg | 57.61% | Central modeled estimate: competent text-tool agent, metadata-first pandō with selective escalation. |
| Max | 93.40% | Strict handle-first mode: raw reads blocked, mechanical ops disclose zero bodies, redacted edits charged small nonzero disclosure. |
By language
| Language | Min | Avg | Max |
|---|---|---|---|
| C++ | 27.37% | 62.12% | 92.51% |
| C# | 23.36% | 55.97% | 93.03% |
| Dart | 25.48% | 56.03% | 92.68% |
| Java | 29.70% | 59.63% | 94.04% |
| TS / JS | 23.91% | 55.75% | 93.18% |
By operation
Structural operations — rename, move, delete, signature changes, imports — reduce existing-source disclosure by about 85–100% in the avg band because they need handles, signatures, counts, and dependency metadata, not bodies. Semantic edits land lower because the model genuinely needs to inspect or transform logic.
| Operation | Min | Avg | Max |
|---|---|---|---|
| delete | 83.33% | 97.50% | 100.00% |
| import / namespace | 80.00% | 100.00% | 100.00% |
| rename / move | 77.78% | 96.43% | 100.00% |
| change-signature | 63.64% | 85.29% | 95.59% |
| replace / edit | 30.16% | 70.00% | 87.88% |
| replace-body | 16.06% | 50.00% | 92.66% |
| insert / add (whole new file) | 0.00% | 0.00% | 0.00% |
| other / no-claim | 0.00% | 0.00% | 0.00% |
insert / add here means whole new language files where the existing-source baseline is zero by definition (nothing to avoid disclosing). Within-file insertions are classified under replace / edit or replace-body.
What actually crosses the boundary
The firewall is a disclosure ladder. Operations start at the lowest tier that can satisfy them and escalate only on explicit, logged need.
| Tier | What crosses | Source exposed |
|---|---|---|
| 0 — Handle only | Opaque ID, e.g. cap_7f3a91 | None |
| 1 — Structural metadata | Kind, signature, arity, modifiers, reference and caller counts, handles of dependents | None |
| 2 — Semantic body | Source for one node, with identifiers obfuscated (e.g. parseCard → fn_a) | One node only, never its neighborhood |
Per-operation minimum tier
| Operation | Min tier | What is disclosed | Existing source sent |
|---|---|---|---|
| rename | 0 | Handle + new name | Zero; pandō rewrites refs locally |
| delete | 0 → 1 | Handle + dependency metadata if repair decisions needed | Zero by default; dependent signatures only on escalation |
| find-references / callers | 1 | Handle, kind, signature, file hint, counts | Zero; Tier 2 only for selected escalated call sites |
| find-nodes | 1 | Handles, names, kinds, signatures, arity, counts | Zero until escalation |
| change-signature | 1 | Current signature, reference count, call-site signatures | Zero for mechanical changes; targeted Tier 2 where logic changes |
| insert-code | 0 or 1 | Handle anchor, or ordered child kinds for agent-chosen anchor | Zero or structure only |
| replace-node | 0 or 2 | Handle + new code, or the one old node for transform-in-place | Zero for fresh replacement; one node for transform |
| replace-body | 1 or 2 | Signature plus authored body; old body only when modifying existing logic | Signature only, or one body |
| filter-map-reduce | 1 | Match metadata, counts, sample handles/signatures | Zero; pandō applies matches locally |
The policy file
A .pando-policy.toml at the project root can mark paths or handles that may never escalate to Tier 2 — enforced by the firewall, not requested of the model:
never_escalate = ["crypto/**", "secrets/**"]
Any escalation requires an explicit logged reason. The audit log records every byte that crossed the boundary and the operation that required it.
Bug-fix escalation, minimized
When the model genuinely needs to see logic (e.g. a bug fix in a single function), the firewall still minimizes disclosure:
- Send one function body, not the file.
- Keep callees as handles instead of inlining their source.
- Do not send neighboring code.
- Obfuscate identifiers before sending —
parseCard → fn_a,customerLedger → v_7. - Preserve type structure; strip or tokenize strings and comments unless the bug is plausibly inside them.
- Map the returned patch back locally using pandō’s symbol map.
In the worst case the model sees one obfuscated function body with types preserved and names tokenized. Domain meaning does not need to cross.
How the numbers were produced
The baseline
The comparison is not against the original human diff — human authors have prior knowledge, IDE state, and review context that the model doesn’t. The baseline is a competent coding agent with shell/text tools only: rg, git grep, git diff, sed, nl, head / tail, and targeted test/build commands. It is assumed to search efficiently and inspect targeted source windows, not dump whole files. This is a deliberately realistic and capable baseline — we are not measuring against a strawman.
What is estimated per change
text_agent_existing_source— existing source a competent text-tool agent would need to inspect.pando_firewall_existing_source— existing source disclosed by pandō under handle-first behavior with progressive escalation.existing_source_avoided— the difference. Newly authored code is excluded.
Pipeline
- Select task — start from the pre-change commit; use commit, PR, or issue text as the prompt. The final human diff is used only as oracle evidence for classification, not shown to the modeled agent.
- Classify each changed unit — via
git diff,--name-status --find-renames,--numstat, and AST diff where possible. Buckets:insert/add,replace/edit,replace-body,delete,rename/move,change-signature,import/namespace,filter-map-reduce,discovery,other/no-claim. - Estimate text-agent inspection — targeted source windows (e.g. 3–10 lines per relevant reference; declaration plus representative call-site windows for renames; import blocks per affected file).
- Estimate pandō inspection — handle-first: opaque handle, kind/name, signature/arity, file hint, reference counts, structural outline. Mechanical structural operations disclose zero body lines by default; semantic edits escalate only as needed.
- Compare —
avoided = text_agent − pando; aggregate by language, repo, operation, confidence, sample, and repo-size bucket. - Report metrics separately — existing-source reduction, total token reduction, newly authored volume, diagnostics volume, and metadata volume are not collapsed into one number.
- Assign confidence labels — high = deterministic file/AST evidence; medium = strong diff/message evidence without AST proof; low = ambiguous, classified as
other/no-claim.
What the three bands mean
The min / avg / max bands are deterministic sensitivity bands over the same classified corpus — not statistical confidence intervals. They model what happens when you vary the disclosure-policy aggressiveness, not what happens when you resample the corpus.
| Band | Policy assumption |
|---|---|
| Min | Skeptical text baseline; higher pandō disclosure per operation. |
| Avg | Competent text-tool agent; metadata-first pandō with selective escalation. |
| Max | Strict handle-first mode: raw reads blocked, mechanical ops disclose zero bodies, redacted edits charged small nonzero source-equivalent cost. |
Corpus filtering
Excluded before classification: generated outputs, reference baselines, snapshots, fixtures and examples, vendor drops, temporary/output paths, generated bundles, hashed static assets, bundled/minified JS, and generated C#/Dart convention files. Normal test source is retained and reported separately from non-test source.
What we are not claiming
- These are modeled existing-source exposure estimates, not direct measurements of live agent sessions.
- The comparison estimates what a competent non-pandō agent would inspect — not what the original human author inspected. Human prior knowledge is not counted.
- The text baseline is deliberately realistic and efficient; results would look better against a strawman, but we don’t measure against one.
- Savings apply to existing-source disclosure only. Newly authored code crosses the boundary in every system, including pandō.
- Handle-first firewall mode is a policy model unless explicitly measured in a current implementation.
- Some semantic edits require source disclosure; those are counted as progressive escalation, not as zero.
- The min / avg / max bands are sensitivity bands, not statistical confidence intervals.
Re-run the analysis
The full rebuild entrypoint is:
results/repro_full_v5_command.sh
It reconstructs the corpus from public git history rather than expanding a prior normalized dataset. Saved manifests, commit ranges, raw diffs, logs, scripts, JSONL, and per-shard SQLite ledgers are sufficient for reviewer replay without storing full repository clones.
Questions about the methodology?
Numbers should survive scrutiny. If something here doesn’t add up, we want to hear about it.
Contact