How we measured

pandō Firewall · Results & Methodology

This page exists so you don’t have to take our word for the firewall numbers. It documents what was measured, how, on what corpus, with what assumptions, and what we are explicitly not claiming. The 57.6% average and 93.4% strict-mode peak quoted on the landing page come from the analysis below.

One paragraph

We took 20 real production codebases (12,346 commit/range pairs, 232,242 changed records), and for each change we estimated two numbers: how much existing source a competent text-tool agent would need to read to make that change, and how much existing source pandō’s handle-first firewall would disclose for the same change. The reduction between the two is what we report. These are modeled estimates against an efficient text-tool baseline, not measurements of live agent sessions. Newly-authored code is excluded from the savings — it has to cross the boundary in every system.

the corpus

What “20 real repos” means

The corpus is weighted toward large, active, recognizable codebases rather than toy projects. Names you will know:

Largest repo contributions, avg band
Repository	Records	Avg baseline	Avg reduction
ReactiveX/RxJava	59,642	3,105,925	60.48%
apache/iceberg	30,750	1,424,775	59.15%
StackExchange/StackExchange.Redis	20,125	1,207,656	56.27%
AppFlowy-IO/AppFlowy	24,086	1,204,865	55.93%
AvaloniaUI/Avalonia	14,271	776,249	55.23%
Azure/azure-functions-host	14,689	774,267	55.55%
FirebaseExtended/flutterfire	12,830	696,173	56.48%
TanStack/query	12,287	674,807	55.41%
HangfireIO/Hangfire	10,205	573,846	52.82%
Azure/azure-powershell	11,401	522,168	56.38%
alibaba/arthas	6,214	279,215	57.90%
Baseflow/flutter-plugins	3,688	201,819	56.66%
OrchardCMS/OrchardCore	2,617	120,846	62.05%
apache/camel	2,251	118,495	56.71%
Azure/azure-sdk-for-net	1,938	116,446	69.49%

Other repos contributing to the corpus include apache/druid, ClickHouse/ClickHouse, apache/arrow, bitcoin/bitcoin, and GopeedLab/gopeed. Totals: 20 repositories, 12,346 commit/range pairs, 232,242 changed file/hunk records; 19 of the 20 had nonzero baseline (one was excluded as a corner case).

headline results

How much existing source never has to leave

Reduction in existing-source disclosure vs. text-tool baseline
Band	Reduction	Read as
Min	26.43%	Conservative floor: skeptical text baseline, higher pandō disclosure.
Avg	57.61%	Central modeled estimate: competent text-tool agent, metadata-first pandō with selective escalation.
Max	93.40%	Strict handle-first mode: raw reads blocked, mechanical ops disclose zero bodies, redacted edits charged small nonzero disclosure.

By language

Language	Min	Avg	Max
C++	27.37%	62.12%	92.51%
C#	23.36%	55.97%	93.03%
Dart	25.48%	56.03%	92.68%
Java	29.70%	59.63%	94.04%
TS / JS	23.91%	55.75%	93.18%

By operation

Structural operations — rename, move, delete, signature changes, imports — reduce existing-source disclosure by about 85–100% in the avg band because they need handles, signatures, counts, and dependency metadata, not bodies. Semantic edits land lower because the model genuinely needs to inspect or transform logic.

Operation	Min	Avg	Max
delete	83.33%	97.50%	100.00%
import / namespace	80.00%	100.00%	100.00%
rename / move	77.78%	96.43%	100.00%
change-signature	63.64%	85.29%	95.59%
replace / edit	30.16%	70.00%	87.88%
replace-body	16.06%	50.00%	92.66%
insert / add (whole new file)	0.00%	0.00%	0.00%
other / no-claim	0.00%	0.00%	0.00%

insert / add here means whole new language files where the existing-source baseline is zero by definition (nothing to avoid disclosing). Within-file insertions are classified under replace / edit or replace-body.

the mechanism

What actually crosses the boundary

The firewall is a disclosure ladder. Operations start at the lowest tier that can satisfy them and escalate only on explicit, logged need.

Disclosure tiers
Tier	What crosses	Source exposed
0 — Handle only	Opaque ID, e.g. `cap_7f3a91`	None
1 — Structural metadata	Kind, signature, arity, modifiers, reference and caller counts, handles of dependents	None
2 — Semantic body	Source for one node, with identifiers obfuscated (e.g. `parseCard → fn_a`)	One node only, never its neighborhood

Per-operation minimum tier

Operation	Min tier	What is disclosed	Existing source sent
rename	0	Handle + new name	Zero; pandō rewrites refs locally
delete	0 → 1	Handle + dependency metadata if repair decisions needed	Zero by default; dependent signatures only on escalation
find-references / callers	1	Handle, kind, signature, file hint, counts	Zero; Tier 2 only for selected escalated call sites
find-nodes	1	Handles, names, kinds, signatures, arity, counts	Zero until escalation
change-signature	1	Current signature, reference count, call-site signatures	Zero for mechanical changes; targeted Tier 2 where logic changes
insert-code	0 or 1	Handle anchor, or ordered child kinds for agent-chosen anchor	Zero or structure only
replace-node	0 or 2	Handle + new code, or the one old node for transform-in-place	Zero for fresh replacement; one node for transform
replace-body	1 or 2	Signature plus authored body; old body only when modifying existing logic	Signature only, or one body
filter-map-reduce	1	Match metadata, counts, sample handles/signatures	Zero; pandō applies matches locally

The policy file

A .pando-policy.toml at the project root can mark paths or handles that may never escalate to Tier 2 — enforced by the firewall, not requested of the model:

never_escalate = ["crypto/**", "secrets/**"]

Any escalation requires an explicit logged reason. The audit log records every byte that crossed the boundary and the operation that required it.

Bug-fix escalation, minimized

When the model genuinely needs to see logic (e.g. a bug fix in a single function), the firewall still minimizes disclosure:

Send one function body, not the file.
Keep callees as handles instead of inlining their source.
Do not send neighboring code.
Obfuscate identifiers before sending — parseCard → fn_a, customerLedger → v_7.
Preserve type structure; strip or tokenize strings and comments unless the bug is plausibly inside them.
Map the returned patch back locally using pandō’s symbol map.

In the worst case the model sees one obfuscated function body with types preserved and names tokenized. Domain meaning does not need to cross.

methodology

How the numbers were produced

The baseline

The comparison is not against the original human diff — human authors have prior knowledge, IDE state, and review context that the model doesn’t. The baseline is a competent coding agent with shell/text tools only: rg, git grep, git diff, sed, nl, head / tail, and targeted test/build commands. It is assumed to search efficiently and inspect targeted source windows, not dump whole files. This is a deliberately realistic and capable baseline — we are not measuring against a strawman.

What is estimated per change

text_agent_existing_source — existing source a competent text-tool agent would need to inspect.
pando_firewall_existing_source — existing source disclosed by pandō under handle-first behavior with progressive escalation.
existing_source_avoided — the difference. Newly authored code is excluded.

Pipeline

Select task — start from the pre-change commit; use commit, PR, or issue text as the prompt. The final human diff is used only as oracle evidence for classification, not shown to the modeled agent.
Classify each changed unit — via git diff, --name-status --find-renames, --numstat, and AST diff where possible. Buckets: insert/add, replace/edit, replace-body, delete, rename/move, change-signature, import/namespace, filter-map-reduce, discovery, other/no-claim.
Estimate text-agent inspection — targeted source windows (e.g. 3–10 lines per relevant reference; declaration plus representative call-site windows for renames; import blocks per affected file).
Estimate pandō inspection — handle-first: opaque handle, kind/name, signature/arity, file hint, reference counts, structural outline. Mechanical structural operations disclose zero body lines by default; semantic edits escalate only as needed.
Compare — avoided = text_agent − pando; aggregate by language, repo, operation, confidence, sample, and repo-size bucket.
Report metrics separately — existing-source reduction, total token reduction, newly authored volume, diagnostics volume, and metadata volume are not collapsed into one number.
Assign confidence labels — high = deterministic file/AST evidence; medium = strong diff/message evidence without AST proof; low = ambiguous, classified as other/no-claim.

What the three bands mean

The min / avg / max bands are deterministic sensitivity bands over the same classified corpus — not statistical confidence intervals. They model what happens when you vary the disclosure-policy aggressiveness, not what happens when you resample the corpus.

Band	Policy assumption
Min	Skeptical text baseline; higher pandō disclosure per operation.
Avg	Competent text-tool agent; metadata-first pandō with selective escalation.
Max	Strict handle-first mode: raw reads blocked, mechanical ops disclose zero bodies, redacted edits charged small nonzero source-equivalent cost.

Corpus filtering

Excluded before classification: generated outputs, reference baselines, snapshots, fixtures and examples, vendor drops, temporary/output paths, generated bundles, hashed static assets, bundled/minified JS, and generated C#/Dart convention files. Normal test source is retained and reported separately from non-test source.

standing caveats

What we are not claiming

These are modeled existing-source exposure estimates, not direct measurements of live agent sessions.
The comparison estimates what a competent non-pandō agent would inspect — not what the original human author inspected. Human prior knowledge is not counted.
The text baseline is deliberately realistic and efficient; results would look better against a strawman, but we don’t measure against one.
Savings apply to existing-source disclosure only. Newly authored code crosses the boundary in every system, including pandō.
Handle-first firewall mode is a policy model unless explicitly measured in a current implementation.
Some semantic edits require source disclosure; those are counted as progressive escalation, not as zero.
The min / avg / max bands are sensitivity bands, not statistical confidence intervals.

reproduce it

Re-run the analysis

The full rebuild entrypoint is:

results/repro_full_v5_command.sh

It reconstructs the corpus from public git history rather than expanding a prior normalized dataset. Saved manifests, commit ranges, raw diffs, logs, scripts, JSONL, and per-shard SQLite ledgers are sufficient for reviewer replay without storing full repository clones.

Questions about the methodology?

Numbers should survive scrutiny. If something here doesn’t add up, we want to hear about it.

Contact