Maximize Pwn, Minimize Tokens
how we built a per-prompt model evaluation system for a continuous, real-world security testing framework

Authors: Sitaraman Subramanian (LinkedIn, GitHub) · Aditya Peela (LinkedIn, GitHub) · Dhruva Goyal (LinkedIn, GitHub)
TL;DR
We're building Pentest Copilot: AI agents for real-world penetration testing. Behind the scenes, the system is not one big LLM call. It is a set of distinct prompt types: validating possible vulnerabilities, reasoning about authentication, summarising evidence, and so on.
Those prompts should not all use the same model.
A prompt that makes a final vulnerability call has a high cost of being wrong. A prompt that runs inside a test-generation loop may be called thousands of times per scan, so a small per-call cost increase becomes expensive quickly. A prompt that only formats already-known evidence mostly needs to be cheap, valid JSON, and consistent.
So the question became:
Which model should each prompt use?
People AI-maxxing would be like: "just use Claude Opus 4.6 or GPT-5.4High for everything."
The answer we ended up with: build an end-to-end eval system that chooses a model separately for each prompt type. It tests candidate models on real production traces, weights quality/cost/latency based on what that prompt actually does, grades semantic output with an LLM judge, and uses a custom Claude Code skill to turn raw promptfoo JSON into a readable ranking.
This post walks through that system. It is meant as a reproducible evaluation method, not a one-off benchmark.
The pipeline
Eight stages, two phases, one feedback loop:
Shortlist models. Use a calibrated cost gate so we do not run every model against every prompt.
Profile the scan. Pull production telemetry per prompt type.
Set prompt-specific weights. Decide how much quality, cost, latency, and throughput matter for this prompt.
Write assertions and rubrics. Define what "good" means for this prompt.
Curate the dataset. Pull production traces from Langfuse and keep rows that match the current prompt/schema.
Lock candidates. Review the candidate set and assertion mix in version-controlled config.
Run the eval. Use promptfoo plus Codex-judged rubric grading.
Analyze the output. Use a Claude Code skill to turn raw eval JSON into rankings and recommendations.
Each stage gets its own section below. The math matters, but the useful part is how the pieces work together on production prompts.
1. Shortlisting models
There are easily 25+ models worth considering across OpenAI, Anthropic, Google, xAI, DeepSeek, Qwen, Meta, and open-weights providers. Multiplied by 20 prompt types and 10 fixtures per cell, the naive eval matrix is 5,000 graded cells per refresh. At realistic per-call costs and judge overhead, that is a $300+ run and about 12 hours of wall clock.
So the first step is boring and useful: throw away models that are obviously too expensive for a given prompt. This is where we use a power-law cost gate. It is only a shortlist; quality scoring happens later.
Synthetic normalized costs. The shape matters more than the exact numbers: α = 0.85 keeps plausible challengers while cutting the expensive tail before the eval run.
The cost gate
Let B_p be the per-call cost of the current production model on prompt type p. For every candidate model m, the projected per-call cost is:
$$\text{cost}_p(m) = T^{in}_p \cdot c^{in}_m + T^{out}_p \cdot c^{out}_m$$
where T^in_p, T^out_p are average input/output tokens for the prompt type, measured from production telemetry, and c^in_m, c^out_m are the model's per-token rates.
Before applying the exponent, we normalize costs by a fixed reference cost B_ref, so the power is applied to a dimensionless number:
$$b_p = \frac{B_p}{B_{ref}}, \quad q_p(m) = \frac{\text{cost}p(m)}{B{ref}}$$
A candidate is cost-eligible for prompt type p iff:
$$\boxed{q_p(m) \le C \cdot b_p^{\alpha}}$$
There are two dials: C, which controls the overall budget, and α, which controls the shape of the curve.
α = 1gives a linear gate.α < 1gives a sublinear gate: cheap and mid-cost candidates still get through, but the high-cost tail is compressed.α > 1gives a superlinear gate, which rejected too many plausible candidates in our runs.
We use α = 0.85. This was not derived from first principles. We tuned it by running the gate over historical prompt profiles and inspecting the candidate sets. At 0.85, cheaper-than-baseline models get admitted freely, 1.5-3x candidates still get a chance, and models above roughly 5x baseline usually drop out before the expensive eval stage.
The shortlist gets sealed into config, bracketed by sentinel comments so manual edits and automated re-runs co-exist without git-merge pain.
2. Profiling the scan
Before scoring anything, we need to know what each prompt type does in production. We pull a scan profile per prompt type from our LLM gateway's trace logs.
| Field | What it tells us |
|---|---|
calls_count |
How often this prompt runs in a real workflow. Drives cost weight. |
peak_concurrency |
Is provider RPM the binding constraint? Drives whether throughput is its own axis. |
avg_input_tokens / avg_output_tokens |
Cost projection per call. Also: context-window headroom check. |
avg_latency_ms / p95_latency_ms |
How tail-heavy this prompt is. Shapes latency weight. |
wall_clock_ms |
Time-per-scan contribution: the actual seconds this prompt adds to a real assessment. |
This is the data layer that everything else conditions on. A model selection decision made without scan-profile context is a guess. A "$0.50/scan" cost projection in your spec doc is fiction unless you measured calls/scan from real traces.
Two of these fields are decision-shaping:
calls_countdrives the cost weight (a high-volume prompt makes cost dominate).peak_concurrencydecides throughput-axis include/drop (concurrency ≤ 20 means provider RPM is not binding, so drop the axis).
Everything else is supporting evidence.
3. Setting prompt-specific weights
Call frequency is a useful starting point, but it is not enough. Two prompt types in the same volume bucket can deserve very different quality/cost tradeoffs because the failure modes are different.
The bucket-default starting point
calls_count |
Bucket | Default weights (intel / cost / lat / throughput) |
|---|---|---|
< 50 |
rare | 70 / 10 / 15 / 5 |
50–500 |
middle | 55 / 20 / 20 / 5 |
500–2000 |
common | 45 / 30 / 20 / 5 |
≥ 2000 |
hot path | 35 / 40 / 15 / 10 |
The inversion is intentional: rare prompts can spend more budget on quality because they barely move the bill; high-volume prompts have to care about cost because it scales with every scan.
Synthetic prompt archetypes. The names are generic, but the point is real: two prompts can sit in the same volume bucket and still deserve opposite weight shapes.
Throughput-drop rule
When peak_concurrency ≤ 20, drop throughput and redistribute proportionally across the remaining three axes:
$$w_i' = w_i \cdot \frac{100}{100 - w_t} \quad \text{for } i \in {\text{intel}, \text{cost}, \text{lat}}$$
Without this, a 5%-weight axis that does not matter in production still affects the composite score.
Per-prompt overrides
Bucket defaults are tuned for "generic LLM workload at this call frequency." Real prompts are more specific. Two prompt types in the same bucket can have opposite priorities:
PT_α (loop-iteration idea generator, hot path)
weights: 35 / 45 / 20 / 0
why: hot path × big tokens = biggest cost amplifier in the suite.
Diversity > polish (per inverse-correlation analysis).
Throughput dropped because concurrency ≤ 20.
PT_β (security verdict-rendering)
weights: 70 / 15 / 15 / 0
why: a wrong verdict ships into a customer report.
F1 dominates. Cost is a tax we pay; latency tolerance is high.
Same bucket, inverted priorities. A bucket-only system would mis-rank both prompt types.
We store each weight tuple with a short rationale string. If someone asks, "Why is intelligence weighted 80% here?", the config answers in plain English.
The composite
Per-axis sub-scores are min-max normalized over the candidate field:
$$\widehat{\text{rubric}}_m = \frac{\bar s_m - \min_k \bar s_k}{\max_k \bar s_k - \min_k \bar s_k}$$
$$\widehat{\text{cost}}_m = \frac{\max_k \kappa_k - \kappa_m}{\max_k \kappa_k - \min_k \kappa_k}, \quad \widehat{\text{lat}}_m = \frac{\max_k \ell^{p95}_k - \ell^{p95}_m}{\max_k \ell^{p95}_k - \min_k \ell^{p95}_k}$$
Intelligence further blends rubric mean and pass rate:
$$\widehat{\text{intel}}_m = 0.7 \cdot \widehat{\text{rubric}}_m + 0.3 \cdot \rho_m$$
The 0.7 / 0.3 split keeps the continuous judge score as the main signal, while pass rate catches models that score well on average but repeatedly fail one important criterion (more on this in §4).
The composite:
$$\boxed{\Sigma_m = w_{\text{intel}} \cdot \widehat{\text{intel}}m + w{\text{cost}} \cdot \widehat{\text{cost}}m + w{\text{lat}} \cdot \widehat{\text{lat}}m + w{\text{thr}} \cdot \widehat{\text{thr}}_m}$$
The weights are stored as percentages in config (70 / 10 / 15 / 5) and normalized to sum to 1.0 before computing Σ_m. If an axis is dropped, its weight is redistributed across the remaining active axes.
Min-max normalization is local to the candidate field because the question is not "which model is globally best?" The question is "which candidate is best for this prompt, among the models we are willing to run?"
4. Designing per-prompt assertions and rubrics
Quality grading is where naive design fails hardest. We use a layered approach per prompt type:
| Layer | Assertion | What it catches | Cost |
|---|---|---|---|
| 1 | json |
"is this even valid JSON?" | Free (Python) |
| 2 | pydantic |
Schema/type/enum violations | Free (Python, against snapshotted JSON Schema) |
| 3 | Fixture comparators | Exact-match expected values from ground-truth fixtures | Free |
| 4 | LLM-judge rubric | Multi-criterion semantic grading | LLM call (judge model) |
Layers 1 through 3 are deterministic and cheap. Layer 4 is the expensive one and captures the semantic behavior that schema checks cannot see.
Multi-criterion rubrics
Each prompt type has a rubric file with 5 to 10 named criteria (e.g., INSTRUCTION_FOLLOWING, INTERNAL_CONSISTENCY, FACTUAL_ALIGNMENT, STRUCTURE_COMPLIANCE, CALIBRATION, …). The judge scores each criterion as a continuous number in [0, 1]. The rubric file declares two thresholds in YAML front-matter:
pass_threshold: 0.6
min_dimension_threshold: 0.3
A row passes the judge iff:
$$\bar r_m \ge \tau_{\text{mean}} \quad \land \quad \min_c r_{m,c} \ge \tau_{\text{min}}$$
The min-dimension floor is what makes this non-trivial. Without τ_min, a model that scores 0.95 on every criterion except a single 0.05 (e.g., it leaks raw secrets in a redaction-required field) would still pass with a mean of 0.86. The floor lets one critical criterion veto the whole row: redaction, factual alignment, format-compliance.
This is conceptually similar to a chain-of-survival check: the system is only as safe as its weakest critical assertion. Mean-based grading silently averages over those.
Fully synthetic rubric scores. The average can look fine while one critical dimension fails; the floor prevents that failure from being averaged away.
When two criteria measure opposite skills
A subtle but important case: for some prompt types, two grading axes that appear compatible turn out to be inversely correlated.
Synthetic normalized points. The observed pattern is the important part: when two useful axes move in opposite directions, one averaged score hides the tradeoff.
Concrete example from our data:
Polish (judged by an LLM rubric on one output)
Diversity (judged on a sequence of outputs from the same model, scored against repetition_avoidance / signal_utilization criteria)
Reasoning-heavy models commit to one hypothesis early and elaborate on it. They top single-iteration polish and bomb cross-iteration diversity. Lighter non-reasoning models hop around hypothesis space more freely. They top diversity and underperform on polish. Pearson correlation we measured: r ≈ -0.7. No model wins both axes.
composite = 0.5 × polish + 0.5 × diversity produces a meaningless middle that hides both signals. The right presentation is two separate top-3 tables, one per axis, with the cross-axis value shown as a context column. We let the analysis skill (§8) detect this case automatically and emit dual rankings.
Rule: before averaging multi-criterion scores, prove they measure the same thing. If two criteria correlate negatively across your field, you have two distinct objectives, not one objective with two flawed measurements.
Non-discriminative axis dropping
If every candidate scores within ε of every other on a given axis, that axis carries no information for ranking. Including it just adds a constant offset and slightly compresses the spread on the other axes.
We drop an axis when:
Intelligence:
max_m \bar s_m - min_m \bar s_m < ε_q(we useε_q = 0.02)Cost: relative spread
(max - min) / max < ε_c(we useε_c = 0.02)Latency: same relative-spread test on
p95_latency
Dropped weight is redistributed proportionally across remaining axes via the same mechanic as the throughput-drop rule. A dead axis should not keep 5 to 10% of the composite budget.
5. Curating the dataset
"test on synthetic data, ship on real data" is how production LLM systems quietly degrade.
We pull real production traces as the eval dataset. Two stages:
Stage A: raw fetch from Langfuse
Every LLM call our pipeline makes is traced into Langfuse, tagged with the prompt type. We pull the recent traces by tag and time window into a per-prompt dataset, including the full prompt sent (system, user, any prefill), the model's actual production output, the tags, and metadata.
Stage B: curate
Raw traces are noisy. A trace pulled today might have been generated against a prior version of the prompt; the input shape may not match the current Pydantic class. The curate step validates each row's original output against the current snapshotted JSON Schema (auto-generated from the production Pydantic class). Rows that don't validate are dropped because they're either stale or from a different prompt type entirely. Survivors land in the final dataset.
This gives us fixtures grounded in actual production traffic rather than synthetic test data. The eval is therefore ranking models on inputs they'd genuinely see.
Fixtures: the ground truth file
Alongside the curated dataset, we hand-author a fixtures file per prompt type that declares the expected values for fixture-comparator assertions. The skeleton is auto-derived from the original output; we then verify and tweak. This is the closest thing to "labeled training data" the eval has, and it is why deterministic comparators (Layer 3 in §4) are tractable.
6. Final candidate selection
After shortlisting (§1) and curating (§5), we lock the candidate set per prompt type in a single hand-authored config. Conceptually each prompt entry has two blocks:
candidate_models:
- <model A>
- <model B>
- ...
checks:
- json
- pydantic
- llm_judge
- <prompt-specific custom assertions>
The checks block declares which assertions run for this prompt. Every prompt can have its own assertion mix (some need a CVSS-score comparator, some need a goal-coverage check, etc.).
This config is hand-authored. We want changes to be deliberate and reviewable. The candidate-shortlist sub-block is auto-rewritten by the cost gate, but bracketed by sentinel comments so manual edits to other parts survive.
7. Evaluation setup (promptfoo + codex)
The eval harness is built on promptfoo. We chose it for three reasons:
Mature assertion ecosystem. Built-in
llm-rubric, custom Python assertions, fixture-based comparators.Caching and replay. Re-running
promptfoo evalre-uses cached cells, so fixing one assertion doesn't re-bill us for the entire grid.Scriptability. The config is YAML, the test cases are JSON, the result is JSON. Easy to wrap.
A thin translation layer turns the hand-authored config plus the curated dataset into the inputs promptfoo expects (a per-prompt promptfoo config and a test-cases file), then invokes promptfoo eval per prompt type.
The judge model
The LLM-judge for llm-rubric assertions runs through codex locally rather than a cloud proxy. Why?
Schema enforcement. Codex's structured-output mode lets us force the judge to emit a strict JSON shape: no regex parsing, no "the judge sometimes returns markdown" gotchas.
Cost. Local codex CLI runs against the user's ChatGPT auth, which is cheaper than per-token cloud invocation when you're refreshing eval grids weekly.
Reproducibility. Pinning the judge to a specific model + reasoning effort + seed makes runs comparable across time.
Promptfoo lets you swap in a custom provider for any assertion. We wrote a thin wrapper that dispatches each rubric cell to the local codex CLI, captures the structured {score, pass, reason} JSON it returns, and records it back into the promptfoo result stream.
The custom provider also let us fix an easy-to-miss judge failure mode: rubric grading that sees the model output and rubric, but not the original input the rubric is supposed to ground against. If the judge cannot see the evidence, it can false-fail correct outputs for not proving claims that were only visible in the hidden input.
Custom assertions beyond what ships with promptfoo
For deterministic checks beyond promptfoo's built-ins, we wrote a small library of assertion modules covering:
JSON-parsability
Pydantic schema conformance against the snapshotted JSON Schema
A generic fixture-comparator engine (
equals,list_unordered_equals,regex,gte,lte, …) keyed by fixture keyDomain-specific comparators (e.g., CVSS v3.1 score comparison with tolerance)
Judge-based goal-coverage scoring
Group-level cross-iteration grading (the polish-vs-diversity case from §4)
Each is a standalone module promptfoo invokes per cell. They share a common configuration shape so the eval-config entry stays declarative.
Caching the candidate calls
Two layers of caching keep refresh costs down:
Promptfoo's built-in eval cache. Same (model, prompt) pair re-uses prior result.
A "replay" mode in our wrapper. When only the rubric or assertion changed, we replay candidate outputs from the prior run and re-grade with new assertions, never re-billing the candidate model calls.
This is what makes "tweak the rubric, re-run" actually tractable. Without replay, every rubric edit costs another full eval grid.
8. Making the output readable
This is where the raw eval output becomes something a person can act on.
After promptfoo finishes, we have a giant JSON per prompt type containing every (model × fixture) cell result with per-assertion pass/fail, scores, latencies, costs, and judge reasons. Reading this file manually is the wrong interface.
We wrote a custom Claude Code skill that does the analysis. The skill:
Loads the rolled-up per-candidate aggregates plus the raw cell-level results
Reads the eval config, the rubric file, the scan profile, and the model cost/context-window catalog
Picks the discriminating intelligence assertion dynamically by walking a priority list (rubric/judge metrics first, then custom fixture-based, then schema-validity last resort), picking the first one with spread
≥ ε_qComputes the composite with per-prompt weights from §3
Detects degenerate cases automatically (non-discriminative axes, inverse correlations, single-fixture coverage gaps) and adjusts the presentation
Renders a markdown report with the top-3 ranking, per-axis breakdowns, intelligence-as-/10 grading, scan-time impact, cost deltas, and concrete swap recommendations
The skill runs entirely within Claude Code, has access to the filesystem, and emits a per-prompt markdown report dated by run. The structure is stable across reports, but the prose reflects the actual data shape. When an inverse correlation triggers, for example, the report switches to dual top-3 tables instead of forcing one composite ranking.
Why a skill, not a deterministic script?
A pure script would have to encode every special case in advance:
Detecting that an axis is non-discriminative
Choosing which assertion to display as primary
Spotting that two criteria are inversely correlated
Writing prose explanations of why the ranking came out the way it did
A skill handles these as judgment calls, informed by deterministic numbers but reasoning about them in context. The output is markdown, not JSON, so it is directly consumable by a human.
The skill is versioned as a markdown file in the repo and edited as we discover new failure modes. It is the analysis layer that makes the rest of the pipeline usable.
A second skill runs across all prompt types with a collated summary view, identifies suspicious top-model scores (where the eval is doing its job but the production prompt has issues), and surfaces concrete prompt-fix actions. Two skills, two zoom levels, same underlying data.
Findings
Running the eval end-to-end gave us a few findings that are useful beyond our own pipeline:
Per-prompt model selection beats global model selection. The same model can be the right choice for a vulnerability verdict prompt and the wrong choice for a high-volume loop prompt.
Call frequency is not enough. It is a good prior for cost sensitivity, but the prompt's failure mode decides how much quality should dominate.
Some rubric axes should not be averaged. In one prompt family, polish and diversity had Pearson correlation
r ≈ -0.7. A single composite hid the useful signal; dual rankings made the tradeoff visible.LLM judges need deterministic scaffolding. JSON checks, Pydantic validation, fixture comparators, min-dimension floors, and input-visible judging made the judge useful as one signal instead of the only source of truth.
The output has to be analyzed at the right level. Raw promptfoo JSON is too low-level; a per-prompt report with top-3 rankings, cost deltas, latency deltas, and failure reasons is the interface people actually use.
Same rank-1 frequency view with model labels shown.
Operationally, this also gave us:
A defended
current_modelchoice per prompt type, with a one-line rationale traceable to a measured ranking.Cost projections based on real calls/scan, which makes
\(/scanand\)/yearmodel swaps concrete.An overnight refresh loop over 20 prompt types and 5,000 graded cells.
A diagnostic loop that flags cases like "all candidates pass < 50%", "inverse correlation detected", and "cost axis dropped, relative spread 0.3%".
The takeaway
The main lesson is that model choice is not global.
The same model can be optimal for one prompt and wasteful for another. Once we measured prompts separately, many decisions became obvious: pay more where mistakes are expensive, save money where volume dominates, and keep the eval cheap enough to rerun.
The process is simple:
Shortlist with a calibrated cost gate.
Profile the production load before scoring.
Weight axes per prompt type, not only by volume bucket.
Layer assertions: cheap deterministic checks first, expensive LLM judge last, with min-dimension floors.
Curate from real production traces, not synthetic examples.
Lock candidate sets in version-controlled config.
Use promptfoo as the harness; pin the judge through Codex for schema-enforced outputs.
Use a Claude Code skill to analyze the raw eval output and produce the ranking.
Maximize pwn. Minimize tokens. Pick the right model for the right job.
Notation summary
| Symbol | Meaning |
|---|---|
M |
number of candidate models |
N |
number of prompt types |
K |
fixtures per (prompt type, model) cell |
B_p |
per-call cost of baseline model on prompt type p |
B_ref |
fixed reference cost used to normalize costs before the power-law gate |
b_p |
normalized baseline cost, B_p / B_ref |
q_p(m) |
normalized candidate cost, cost_p(m) / B_ref |
T^in_p, T^out_p |
avg input/output tokens for prompt type p |
c^in_m, c^out_m |
per-token cost of model m |
\bar s_m |
rubric mean score of model m on the primary intelligence assertion |
\rho_m |
overall pass rate of model m |
\bar r_m, r_{m,c} |
rubric mean and per-criterion score of model m |
τ_mean, τ_min |
rubric pass-mean threshold and per-criterion minimum |
κ_m, ℓ^p95_m |
cost per scan and p95 latency for model m |
w_intel, w_cost, w_lat, w_thr |
composite weights, summing to 1 |
Σ_m |
composite score for model m |
ε_q, ε_c, ε_l |
non-discriminative-axis ε thresholds |
Composite formula (single source of truth)
For the active-axis set A ⊆ {intel, cost, lat, thr} (after applying throughput-drop and non-discriminative-drop rules) and normalized weights w' summing to 1:
$$\Sigma_m = \sum_{a \in A} w'_a \cdot \widehat{a}_m$$
where \widehat{a}_m is the min-max normalized value on axis a for model m.
Intelligence:
$$\widehat{\text{intel}}_m = 0.7 \cdot \frac{\bar s_m - \min_k \bar s_k}{\max_k \bar s_k - \min_k \bar s_k} + 0.3 \cdot \rho_m$$
Cost, where lower cost is better:
$$\widehat{\text{cost}}_m = \frac{\max_k \kappa_k - \kappa_m}{\max_k \kappa_k - \min_k \kappa_k}$$
Latency, where lower p95 latency is better:
$$\widehat{\text{lat}}_m = \frac{\max_k \ell^{p95}_k - \ell^{p95}_m}{\max_k \ell^{p95}_k - \min_k \ell^{p95}_k}$$
Throughput is normalized analogously on time_per_scan_s.
When max_k = min_k for an axis, the axis is degenerate. We drop it and redistribute its weight across the remaining active axes.
This evaluation process backs model selection across Pentest Copilot's 20 production prompt types.
