<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Hackerbone's Blog]]></title><description><![CDATA[Hackerbone's Blog]]></description><link>https://blog.ssitaraman.com</link><generator>RSS for Node</generator><lastBuildDate>Fri, 15 May 2026 01:01:38 GMT</lastBuildDate><atom:link href="https://blog.ssitaraman.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Maximize Pwn, Minimize Tokens]]></title><description><![CDATA[Authors: Sitaraman Subramanian (LinkedIn, GitHub) · Aditya Peela (LinkedIn, GitHub) · Dhruva Goyal (LinkedIn, GitHub)

TL;DR
We're building Pentest Copilot: AI agents for real-world penetration testin]]></description><link>https://blog.ssitaraman.com/maximize-pwn-minimize-tokens</link><guid isPermaLink="true">https://blog.ssitaraman.com/maximize-pwn-minimize-tokens</guid><category><![CDATA[pentesting]]></category><category><![CDATA[LLM's ]]></category><category><![CDATA[Promptfoo]]></category><category><![CDATA[evaluation metrics]]></category><category><![CDATA[tokens]]></category><category><![CDATA[cybersecurity]]></category><category><![CDATA[AI]]></category><dc:creator><![CDATA[Sitaraman Subramanian]]></dc:creator><pubDate>Wed, 29 Apr 2026 21:10:04 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/631816f12cef8b07c2f1e630/038e3320-0949-44a5-9887-547f4d897274.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Authors:</strong> <a href="https://www.ssitaraman.com/">Sitaraman Subramanian</a> (<a href="https://in.linkedin.com/in/sitaraman-s">LinkedIn</a>, <a href="https://github.com/Hackerbone">GitHub</a>) · <a href="https://adityapeela.com/">Aditya Peela</a> (<a href="https://in.linkedin.com/in/adityapeela">LinkedIn</a>, <a href="https://github.com/adityamhn">GitHub</a>) · <a href="https://dhruvagoyal.com/">Dhruva Goyal</a> (<a href="https://www.linkedin.com/in/dhruvagoyal">LinkedIn</a>, <a href="https://github.com/shero4">GitHub</a>)</p>
<hr />
<h2>TL;DR</h2>
<p>We're building <a href="https://copilot.bugbase.ai"><strong>Pentest Copilot</strong></a>: AI agents for real-world penetration testing. Behind the scenes, the system is not one big LLM call. It is a set of distinct prompt types: validating possible vulnerabilities, reasoning about authentication, summarising evidence, and so on.</p>
<p>Those prompts should not all use the same model.</p>
<p>A prompt that makes a final vulnerability call has a high cost of being wrong. A prompt that runs inside a test-generation loop may be called thousands of times per scan, so a small per-call cost increase becomes expensive quickly. A prompt that only formats already-known evidence mostly needs to be cheap, valid JSON, and consistent.</p>
<p>So the question became:</p>
<blockquote>
<p>Which model should each prompt use?</p>
</blockquote>
<p>People AI-maxxing would be like: <em>"just use Claude Opus 4.6 or GPT-5.4High for everything."</em></p>
<p>The answer we ended up with: build an end-to-end eval system that chooses a model separately for each prompt type. It tests candidate models on real production traces, weights quality/cost/latency based on what that prompt actually does, grades semantic output with an LLM judge, and uses a custom Claude Code skill to turn raw promptfoo JSON into a readable ranking.</p>
<p>This post walks through that system. It is meant as a reproducible evaluation method, not a one-off benchmark.</p>
<hr />
<h2>The pipeline</h2>
<img src="https://cdn.hashnode.com/uploads/covers/631816f12cef8b07c2f1e630/37bde1ee-0a05-4484-bc12-abd62f077839.png" alt="" style="display:block;margin:0 auto" />

<p>Eight stages, two phases, one feedback loop:</p>
<ol>
<li><p><strong>Shortlist models.</strong> Use a calibrated cost gate so we do not run every model against every prompt.</p>
</li>
<li><p><strong>Profile the scan.</strong> Pull production telemetry per prompt type.</p>
</li>
<li><p><strong>Set prompt-specific weights.</strong> Decide how much quality, cost, latency, and throughput matter for this prompt.</p>
</li>
<li><p><strong>Write assertions and rubrics.</strong> Define what "good" means for this prompt.</p>
</li>
<li><p><strong>Curate the dataset.</strong> Pull production traces from Langfuse and keep rows that match the current prompt/schema.</p>
</li>
<li><p><strong>Lock candidates.</strong> Review the candidate set and assertion mix in version-controlled config.</p>
</li>
<li><p><strong>Run the eval.</strong> Use promptfoo plus Codex-judged rubric grading.</p>
</li>
<li><p><strong>Analyze the output.</strong> Use a Claude Code skill to turn raw eval JSON into rankings and recommendations.</p>
</li>
</ol>
<p>Each stage gets its own section below. The math matters, but the useful part is how the pieces work together on production prompts.</p>
<hr />
<h2>1. Shortlisting models</h2>
<p>There are easily 25+ models worth considering across OpenAI, Anthropic, Google, xAI, DeepSeek, Qwen, Meta, and open-weights providers. Multiplied by 20 prompt types and 10 fixtures per cell, the naive eval matrix is <strong>5,000 graded cells per refresh</strong>. At realistic per-call costs and judge overhead, that is a $300+ run and about 12 hours of wall clock.</p>
<p>So the first step is boring and useful: throw away models that are obviously too expensive for a given prompt. This is where we use a <strong>power-law cost gate</strong>. It is only a shortlist; quality scoring happens later.</p>
<img src="https://cdn.hashnode.com/uploads/covers/631816f12cef8b07c2f1e630/4870d07c-72d6-4d47-96a1-0994a77c5230.png" alt="" style="display:block;margin:0 auto" />

<img src="https://cdn.hashnode.com/uploads/covers/631816f12cef8b07c2f1e630/b87b75e8-da4a-420e-aeb1-a0adc23486b4.png" alt="" style="display:block;margin:0 auto" />

<p><em>Synthetic normalized costs. The shape matters more than the exact numbers:</em> <code>α = 0.85</code> <em>keeps plausible challengers while cutting the expensive tail before the eval run.</em></p>
<h3>The cost gate</h3>
<p>Let <code>B_p</code> be the per-call cost of the current production model on prompt type <code>p</code>. For every candidate model <code>m</code>, the projected per-call cost is:</p>
<p>$$\text{cost}_p(m) = T^{in}_p \cdot c^{in}_m + T^{out}_p \cdot c^{out}_m$$</p>
<p>where <code>T^in_p, T^out_p</code> are average input/output tokens for the prompt type, measured from production telemetry, and <code>c^in_m, c^out_m</code> are the model's per-token rates.</p>
<p>Before applying the exponent, we normalize costs by a fixed reference cost <code>B_ref</code>, so the power is applied to a dimensionless number:</p>
<p>$$b_p = \frac{B_p}{B_{ref}}, \quad q_p(m) = \frac{\text{cost}p(m)}{B{ref}}$$</p>
<p>A candidate is cost-eligible for prompt type <code>p</code> iff:</p>
<p>$$\boxed{q_p(m) \le C \cdot b_p^{\alpha}}$$</p>
<p>There are two dials: <code>C</code>, which controls the overall budget, and <code>α</code>, which controls the shape of the curve.</p>
<ul>
<li><p><code>α = 1</code> gives a linear gate.</p>
</li>
<li><p><code>α &lt; 1</code> gives a sublinear gate: cheap and mid-cost candidates still get through, but the high-cost tail is compressed.</p>
</li>
<li><p><code>α &gt; 1</code> gives a superlinear gate, which rejected too many plausible candidates in our runs.</p>
</li>
</ul>
<p>We use <code>α = 0.85</code>. This was not derived from first principles. We tuned it by running the gate over historical prompt profiles and inspecting the candidate sets. At <code>0.85</code>, cheaper-than-baseline models get admitted freely, 1.5-3x candidates still get a chance, and models above roughly 5x baseline usually drop out before the expensive eval stage.</p>
<p>The shortlist gets sealed into config, bracketed by sentinel comments so manual edits and automated re-runs co-exist without git-merge pain.</p>
<hr />
<h2>2. Profiling the scan</h2>
<p>Before scoring anything, we need to know what each prompt type does in production. We pull a <strong>scan profile</strong> per prompt type from our LLM gateway's trace logs.</p>
<table>
<thead>
<tr>
<th>Field</th>
<th>What it tells us</th>
</tr>
</thead>
<tbody><tr>
<td><code>calls_count</code></td>
<td>How often this prompt runs in a real workflow. Drives cost weight.</td>
</tr>
<tr>
<td><code>peak_concurrency</code></td>
<td>Is provider RPM the binding constraint? Drives whether throughput is its own axis.</td>
</tr>
<tr>
<td><code>avg_input_tokens</code> / <code>avg_output_tokens</code></td>
<td>Cost projection per call. Also: context-window headroom check.</td>
</tr>
<tr>
<td><code>avg_latency_ms</code> / <code>p95_latency_ms</code></td>
<td>How tail-heavy this prompt is. Shapes latency weight.</td>
</tr>
<tr>
<td><code>wall_clock_ms</code></td>
<td>Time-per-scan contribution: the actual seconds this prompt adds to a real assessment.</td>
</tr>
</tbody></table>
<p>This is the data layer that everything else conditions on. <strong>A model selection decision made without scan-profile context is a guess.</strong> A "$0.50/scan" cost projection in your spec doc is fiction unless you measured calls/scan from real traces.</p>
<p>Two of these fields are decision-shaping:</p>
<ul>
<li><p><code>calls_count</code> drives the cost weight (a high-volume prompt makes cost dominate).</p>
</li>
<li><p><code>peak_concurrency</code> decides throughput-axis include/drop (concurrency ≤ 20 means provider RPM is not binding, so drop the axis).</p>
</li>
</ul>
<p>Everything else is supporting evidence.</p>
<hr />
<h2>3. Setting prompt-specific weights</h2>
<p>Call frequency is a useful starting point, but it is not enough. Two prompt types in the same volume bucket can deserve very different quality/cost tradeoffs because the failure modes are different.</p>
<img src="https://cdn.hashnode.com/uploads/covers/631816f12cef8b07c2f1e630/e52fb91e-a658-48f1-b93c-43218e0e8ec3.png" alt="" style="display:block;margin:0 auto" />

<h3>The bucket-default starting point</h3>
<table>
<thead>
<tr>
<th><code>calls_count</code></th>
<th>Bucket</th>
<th>Default weights (intel / cost / lat / throughput)</th>
</tr>
</thead>
<tbody><tr>
<td><code>&lt; 50</code></td>
<td>rare</td>
<td><strong>70 / 10 / 15 / 5</strong></td>
</tr>
<tr>
<td><code>50–500</code></td>
<td>middle</td>
<td>55 / 20 / 20 / 5</td>
</tr>
<tr>
<td><code>500–2000</code></td>
<td>common</td>
<td>45 / 30 / 20 / 5</td>
</tr>
<tr>
<td><code>≥ 2000</code></td>
<td>hot path</td>
<td><strong>35 / 40 / 15 / 10</strong></td>
</tr>
</tbody></table>
<p>The inversion is intentional: rare prompts can spend more budget on quality because they barely move the bill; high-volume prompts have to care about cost because it scales with every scan.</p>
<img src="https://cdn.hashnode.com/uploads/covers/631816f12cef8b07c2f1e630/ec32a484-2b0f-4325-a658-94d4687a3f07.png" alt="" style="display:block;margin:0 auto" />

<p><em>Synthetic prompt archetypes. The names are generic, but the point is real: two prompts can sit in the same volume bucket and still deserve opposite weight shapes.</em></p>
<h3>Throughput-drop rule</h3>
<p>When <code>peak_concurrency ≤ 20</code>, drop throughput and <strong>redistribute proportionally</strong> across the remaining three axes:</p>
<p>$$w_i' = w_i \cdot \frac{100}{100 - w_t} \quad \text{for } i \in {\text{intel}, \text{cost}, \text{lat}}$$</p>
<p>Without this, a 5%-weight axis that does not matter in production still affects the composite score.</p>
<h3>Per-prompt overrides</h3>
<p>Bucket defaults are tuned for "generic LLM workload at this call frequency." Real prompts are more specific. Two prompt types in the same bucket can have opposite priorities:</p>
<pre><code class="language-plaintext">PT_α  (loop-iteration idea generator, hot path)
  weights: 35 / 45 / 20 / 0
  why: hot path × big tokens = biggest cost amplifier in the suite.
       Diversity &gt; polish (per inverse-correlation analysis).
       Throughput dropped because concurrency ≤ 20.

PT_β  (security verdict-rendering)
  weights: 70 / 15 / 15 / 0
  why: a wrong verdict ships into a customer report.
       F1 dominates. Cost is a tax we pay; latency tolerance is high.
</code></pre>
<p>Same bucket, inverted priorities. A bucket-only system would mis-rank both prompt types.</p>
<p>We store each weight tuple with a short rationale string. If someone asks, "Why is intelligence weighted 80% here?", the config answers in plain English.</p>
<h3>The composite</h3>
<p>Per-axis sub-scores are <strong>min-max normalized</strong> over the candidate field:</p>
<p>$$\widehat{\text{rubric}}_m = \frac{\bar s_m - \min_k \bar s_k}{\max_k \bar s_k - \min_k \bar s_k}$$</p>
<p>$$\widehat{\text{cost}}_m = \frac{\max_k \kappa_k - \kappa_m}{\max_k \kappa_k - \min_k \kappa_k}, \quad \widehat{\text{lat}}_m = \frac{\max_k \ell^{p95}_k - \ell^{p95}_m}{\max_k \ell^{p95}_k - \min_k \ell^{p95}_k}$$</p>
<p>Intelligence further blends rubric mean and pass rate:</p>
<p>$$\widehat{\text{intel}}_m = 0.7 \cdot \widehat{\text{rubric}}_m + 0.3 \cdot \rho_m$$</p>
<p>The 0.7 / 0.3 split keeps the continuous judge score as the main signal, while pass rate catches models that score well on average but repeatedly fail one important criterion (more on this in §4).</p>
<p>The composite:</p>
<p>$$\boxed{\Sigma_m = w_{\text{intel}} \cdot \widehat{\text{intel}}m + w{\text{cost}} \cdot \widehat{\text{cost}}m + w{\text{lat}} \cdot \widehat{\text{lat}}m + w{\text{thr}} \cdot \widehat{\text{thr}}_m}$$</p>
<p>The weights are stored as percentages in config (<code>70 / 10 / 15 / 5</code>) and normalized to sum to <code>1.0</code> before computing <code>Σ_m</code>. If an axis is dropped, its weight is redistributed across the remaining active axes.</p>
<p>Min-max normalization is local to the candidate field because the question is not "which model is globally best?" The question is "which candidate is best for this prompt, among the models we are willing to run?"</p>
<hr />
<h2>4. Designing per-prompt assertions and rubrics</h2>
<p>Quality grading is where naive design fails hardest. We use a <strong>layered approach</strong> per prompt type:</p>
<table>
<thead>
<tr>
<th>Layer</th>
<th>Assertion</th>
<th>What it catches</th>
<th>Cost</th>
</tr>
</thead>
<tbody><tr>
<td>1</td>
<td><code>json</code></td>
<td>"is this even valid JSON?"</td>
<td>Free (Python)</td>
</tr>
<tr>
<td>2</td>
<td><code>pydantic</code></td>
<td>Schema/type/enum violations</td>
<td>Free (Python, against snapshotted JSON Schema)</td>
</tr>
<tr>
<td>3</td>
<td>Fixture comparators</td>
<td>Exact-match expected values from ground-truth fixtures</td>
<td>Free</td>
</tr>
<tr>
<td>4</td>
<td>LLM-judge rubric</td>
<td>Multi-criterion semantic grading</td>
<td>LLM call (judge model)</td>
</tr>
</tbody></table>
<p>Layers 1 through 3 are deterministic and cheap. Layer 4 is the expensive one and captures the semantic behavior that schema checks cannot see.</p>
<h3>Multi-criterion rubrics</h3>
<p>Each prompt type has a rubric file with 5 to 10 named criteria (e.g., <code>INSTRUCTION_FOLLOWING</code>, <code>INTERNAL_CONSISTENCY</code>, <code>FACTUAL_ALIGNMENT</code>, <code>STRUCTURE_COMPLIANCE</code>, <code>CALIBRATION</code>, …). The judge scores each criterion as a continuous number in <code>[0, 1]</code>. The rubric file declares two thresholds in YAML front-matter:</p>
<pre><code class="language-yaml">pass_threshold: 0.6
min_dimension_threshold: 0.3
</code></pre>
<p>A row passes the judge iff:</p>
<p>$$\bar r_m \ge \tau_{\text{mean}} \quad \land \quad \min_c r_{m,c} \ge \tau_{\text{min}}$$</p>
<p>The <strong>min-dimension floor is what makes this non-trivial</strong>. Without <code>τ_min</code>, a model that scores 0.95 on every criterion <em>except</em> a single 0.05 (e.g., it leaks raw secrets in a redaction-required field) would still pass with a mean of 0.86. The floor lets one critical criterion veto the whole row: redaction, factual alignment, format-compliance.</p>
<p>This is conceptually similar to a chain-of-survival check: the system is only as safe as its weakest critical assertion. Mean-based grading silently averages over those.</p>
<img src="https://cdn.hashnode.com/uploads/covers/631816f12cef8b07c2f1e630/b8779086-d175-4720-85a1-c9fe6689fd5e.png" alt="" style="display:block;margin:0 auto" />

<p><em>Fully synthetic rubric scores. The average can look fine while one critical dimension fails; the floor prevents that failure from being averaged away.</em></p>
<h3>When two criteria measure opposite skills</h3>
<p>A subtle but important case: for some prompt types, two grading axes that <em>appear</em> compatible turn out to be inversely correlated.</p>
<img src="https://cdn.hashnode.com/uploads/covers/631816f12cef8b07c2f1e630/91cca075-5830-4341-b8d8-248d04bd819c.png" alt="" style="display:block;margin:0 auto" />

<img src="https://cdn.hashnode.com/uploads/covers/631816f12cef8b07c2f1e630/082594d7-80bc-422c-9357-bbcfb3c32eb3.png" alt="" style="display:block;margin:0 auto" />

<p><em>Synthetic normalized points. The observed pattern is the important part: when two useful axes move in opposite directions, one averaged score hides the tradeoff.</em></p>
<p>Concrete example from our data:</p>
<ul>
<li><p><strong>Polish</strong> (judged by an LLM rubric on <strong>one</strong> output)</p>
</li>
<li><p><strong>Diversity</strong> (judged on a <strong>sequence</strong> of outputs from the same model, scored against repetition_avoidance / signal_utilization criteria)</p>
</li>
</ul>
<p>Reasoning-heavy models commit to one hypothesis early and elaborate on it. They top single-iteration polish and bomb cross-iteration diversity. Lighter non-reasoning models hop around hypothesis space more freely. They top diversity and underperform on polish. Pearson correlation we measured: <strong>r ≈ -0.7</strong>. <strong>No model wins both axes.</strong></p>
<p><code>composite = 0.5 × polish + 0.5 × diversity</code> produces a meaningless middle that hides both signals. The right presentation is <strong>two separate top-3 tables</strong>, one per axis, with the cross-axis value shown as a context column. We let the analysis skill (§8) detect this case automatically and emit dual rankings.</p>
<blockquote>
<p><strong>Rule:</strong> before averaging multi-criterion scores, prove they measure the same thing. If two criteria correlate negatively across your field, you have <em>two distinct objectives</em>, not one objective with two flawed measurements.</p>
</blockquote>
<h3>Non-discriminative axis dropping</h3>
<p>If every candidate scores within <code>ε</code> of every other on a given axis, <strong>that axis carries no information for ranking</strong>. Including it just adds a constant offset and slightly compresses the spread on the <em>other</em> axes.</p>
<p>We drop an axis when:</p>
<ul>
<li><p>Intelligence: <code>max_m \bar s_m - min_m \bar s_m &lt; ε_q</code> (we use <code>ε_q = 0.02</code>)</p>
</li>
<li><p>Cost: relative spread <code>(max - min) / max &lt; ε_c</code> (we use <code>ε_c = 0.02</code>)</p>
</li>
<li><p>Latency: same relative-spread test on <code>p95_latency</code></p>
</li>
</ul>
<p>Dropped weight is redistributed proportionally across remaining axes via the same mechanic as the throughput-drop rule. A dead axis should not keep 5 to 10% of the composite budget.</p>
<hr />
<h2>5. Curating the dataset</h2>
<blockquote>
<p>"test on synthetic data, ship on real data" is how production LLM systems quietly degrade.</p>
</blockquote>
<img src="https://cdn.hashnode.com/uploads/covers/631816f12cef8b07c2f1e630/f36ad7fd-fbe9-43fe-9285-f5cfb06ad00a.png" alt="" style="display:block;margin:0 auto" />

<p>We pull <strong>real production traces</strong> as the eval dataset. Two stages:</p>
<h3>Stage A: raw fetch from Langfuse</h3>
<p>Every LLM call our pipeline makes is traced into Langfuse, tagged with the prompt type. We pull the recent traces by tag and time window into a per-prompt dataset, including the full prompt sent (system, user, any prefill), the model's actual production output, the tags, and metadata.</p>
<h3>Stage B: curate</h3>
<p>Raw traces are noisy. A trace pulled today might have been generated against a <em>prior version</em> of the prompt; the input shape may not match the current Pydantic class. The curate step validates each row's original output against the current snapshotted JSON Schema (auto-generated from the production Pydantic class). Rows that don't validate are dropped because they're either stale or from a different prompt type entirely. Survivors land in the final dataset.</p>
<p>This gives us <strong>fixtures grounded in actual production traffic</strong> rather than synthetic test data. The eval is therefore ranking models on inputs they'd genuinely see.</p>
<h3>Fixtures: the ground truth file</h3>
<p>Alongside the curated dataset, we hand-author a fixtures file per prompt type that declares the <strong>expected</strong> values for fixture-comparator assertions. The skeleton is auto-derived from the original output; we then verify and tweak. This is the closest thing to "labeled training data" the eval has, and it is why deterministic comparators (Layer 3 in §4) are tractable.</p>
<hr />
<h2>6. Final candidate selection</h2>
<p>After shortlisting (§1) and curating (§5), we lock the candidate set per prompt type in a single hand-authored config. Conceptually each prompt entry has two blocks:</p>
<pre><code class="language-yaml">candidate_models:
  - &lt;model A&gt;
  - &lt;model B&gt;
  - ...

checks:
  - json
  - pydantic
  - llm_judge
  - &lt;prompt-specific custom assertions&gt;
</code></pre>
<p>The <code>checks</code> block declares which assertions run for this prompt. Every prompt can have its own assertion mix (some need a CVSS-score comparator, some need a goal-coverage check, etc.).</p>
<p>This config is <strong>hand-authored</strong>. We want changes to be deliberate and reviewable. The candidate-shortlist sub-block is auto-rewritten by the cost gate, but bracketed by sentinel comments so manual edits to other parts survive.</p>
<hr />
<h2>7. Evaluation setup (promptfoo + codex)</h2>
<p>The eval harness is built on <a href="https://www.promptfoo.dev/"><strong>promptfoo</strong></a>. We chose it for three reasons:</p>
<ol>
<li><p><strong>Mature assertion ecosystem.</strong> Built-in <code>llm-rubric</code>, custom Python assertions, fixture-based comparators.</p>
</li>
<li><p><strong>Caching and replay.</strong> Re-running <code>promptfoo eval</code> re-uses cached cells, so fixing one assertion doesn't re-bill us for the entire grid.</p>
</li>
<li><p><strong>Scriptability.</strong> The config is YAML, the test cases are JSON, the result is JSON. Easy to wrap.</p>
</li>
</ol>
<p>A thin translation layer turns the hand-authored config plus the curated dataset into the inputs promptfoo expects (a per-prompt promptfoo config and a test-cases file), then invokes <code>promptfoo eval</code> per prompt type.</p>
<h3>The judge model</h3>
<p>The LLM-judge for <code>llm-rubric</code> assertions runs through <a href="https://github.com/openai/codex"><strong>codex</strong></a> locally rather than a cloud proxy. Why?</p>
<ul>
<li><p><strong>Schema enforcement.</strong> Codex's structured-output mode lets us force the judge to emit a strict JSON shape: no regex parsing, no "the judge sometimes returns markdown" gotchas.</p>
</li>
<li><p><strong>Cost.</strong> Local codex CLI runs against the user's ChatGPT auth, which is cheaper than per-token cloud invocation when you're refreshing eval grids weekly.</p>
</li>
<li><p><strong>Reproducibility.</strong> Pinning the judge to a specific model + reasoning effort + seed makes runs comparable across time.</p>
</li>
</ul>
<p>Promptfoo lets you swap in a custom provider for any assertion. We wrote a thin wrapper that dispatches each rubric cell to the local codex CLI, captures the structured <code>{score, pass, reason}</code> JSON it returns, and records it back into the promptfoo result stream.</p>
<img src="https://cdn.hashnode.com/uploads/covers/631816f12cef8b07c2f1e630/75029430-a4b6-40e6-b10b-e67269d2ba55.png" alt="" style="display:block;margin:0 auto" />

<p>The custom provider also let us fix an easy-to-miss judge failure mode: rubric grading that sees the model output and rubric, but not the original input the rubric is supposed to ground against. If the judge cannot see the evidence, it can false-fail correct outputs for not proving claims that were only visible in the hidden input.</p>
<h3>Custom assertions beyond what ships with promptfoo</h3>
<p>For deterministic checks beyond promptfoo's built-ins, we wrote a small library of assertion modules covering:</p>
<ul>
<li><p>JSON-parsability</p>
</li>
<li><p>Pydantic schema conformance against the snapshotted JSON Schema</p>
</li>
<li><p>A generic fixture-comparator engine (<code>equals</code>, <code>list_unordered_equals</code>, <code>regex</code>, <code>gte</code>, <code>lte</code>, …) keyed by fixture key</p>
</li>
<li><p>Domain-specific comparators (e.g., CVSS v3.1 score comparison with tolerance)</p>
</li>
<li><p>Judge-based goal-coverage scoring</p>
</li>
<li><p>Group-level cross-iteration grading (the polish-vs-diversity case from §4)</p>
</li>
</ul>
<p>Each is a standalone module promptfoo invokes per cell. They share a common configuration shape so the eval-config entry stays declarative.</p>
<h3>Caching the candidate calls</h3>
<p>Two layers of caching keep refresh costs down:</p>
<ol>
<li><p><strong>Promptfoo's built-in eval cache.</strong> Same (model, prompt) pair re-uses prior result.</p>
</li>
<li><p><strong>A "replay" mode</strong> in our wrapper. When only the rubric or assertion changed, we replay candidate outputs from the prior run and re-grade with new assertions, never re-billing the candidate model calls.</p>
</li>
</ol>
<p>This is what makes "tweak the rubric, re-run" actually tractable. Without replay, every rubric edit costs another full eval grid.</p>
<hr />
<h2>8. Making the output readable</h2>
<p>This is where the raw eval output becomes something a person can act on.</p>
<p>After promptfoo finishes, we have a giant JSON per prompt type containing every (model × fixture) cell result with per-assertion pass/fail, scores, latencies, costs, and judge reasons. Reading this file manually is the wrong interface.</p>
<p>We wrote a custom <strong>Claude Code skill</strong> that does the analysis. The skill:</p>
<ol>
<li><p>Loads the rolled-up per-candidate aggregates plus the raw cell-level results</p>
</li>
<li><p>Reads the eval config, the rubric file, the scan profile, and the model cost/context-window catalog</p>
</li>
<li><p><strong>Picks the discriminating intelligence assertion dynamically</strong> by walking a priority list (rubric/judge metrics first, then custom fixture-based, then schema-validity last resort), picking the first one with spread <code>≥ ε_q</code></p>
</li>
<li><p><strong>Computes the composite</strong> with per-prompt weights from §3</p>
</li>
<li><p><strong>Detects degenerate cases</strong> automatically (non-discriminative axes, inverse correlations, single-fixture coverage gaps) and adjusts the presentation</p>
</li>
<li><p><strong>Renders a markdown report</strong> with the top-3 ranking, per-axis breakdowns, intelligence-as-/10 grading, scan-time impact, cost deltas, and concrete swap recommendations</p>
</li>
</ol>
<p>The skill runs entirely within Claude Code, has access to the filesystem, and emits a per-prompt markdown report dated by run. The structure is stable across reports, but the prose reflects the actual data shape. When an inverse correlation triggers, for example, the report switches to dual top-3 tables instead of forcing one composite ranking.</p>
<h3>Why a skill, not a deterministic script?</h3>
<p>A pure script would have to encode every special case in advance:</p>
<ul>
<li><p>Detecting that an axis is non-discriminative</p>
</li>
<li><p>Choosing which assertion to display as primary</p>
</li>
<li><p>Spotting that two criteria are inversely correlated</p>
</li>
<li><p>Writing prose explanations of why the ranking came out the way it did</p>
</li>
</ul>
<p>A skill handles these as judgment calls, informed by deterministic numbers but reasoning about them in context. The output is markdown, not JSON, so it is directly consumable by a human.</p>
<p>The skill is versioned as a markdown file in the repo and edited as we discover new failure modes. It is the analysis layer that makes the rest of the pipeline usable.</p>
<p>A second skill runs <em>across all prompt types</em> with a collated summary view, identifies suspicious top-model scores (where the eval is doing its job but the production prompt has issues), and surfaces concrete prompt-fix actions. Two skills, two zoom levels, same underlying data.</p>
<hr />
<h2>Findings</h2>
<p>Running the eval end-to-end gave us a few findings that are useful beyond our own pipeline:</p>
<ul>
<li><p><strong>Per-prompt model selection beats global model selection.</strong> The same model can be the right choice for a vulnerability verdict prompt and the wrong choice for a high-volume loop prompt.</p>
</li>
<li><p><strong>Call frequency is not enough.</strong> It is a good prior for cost sensitivity, but the prompt's failure mode decides how much quality should dominate.</p>
</li>
<li><p><strong>Some rubric axes should not be averaged.</strong> In one prompt family, polish and diversity had Pearson correlation <code>r ≈ -0.7</code>. A single composite hid the useful signal; dual rankings made the tradeoff visible.</p>
</li>
<li><p><strong>LLM judges need deterministic scaffolding.</strong> JSON checks, Pydantic validation, fixture comparators, min-dimension floors, and input-visible judging made the judge useful as one signal instead of the only source of truth.</p>
</li>
<li><p><strong>The output has to be analyzed at the right level.</strong> Raw promptfoo JSON is too low-level; a per-prompt report with top-3 rankings, cost deltas, latency deltas, and failure reasons is the interface people actually use.</p>
</li>
</ul>
<img src="https://cdn.hashnode.com/uploads/covers/631816f12cef8b07c2f1e630/0c3935a7-87d0-403a-bda5-f1deabc465f4.png" alt="" style="display:block;margin:0 auto" />

<p><em>Same rank-1 frequency view with model labels shown.</em></p>
<p>Operationally, this also gave us:</p>
<ul>
<li><p>A defended <code>current_model</code> choice per prompt type, with a one-line rationale traceable to a measured ranking.</p>
</li>
<li><p>Cost projections based on real calls/scan, which makes <code>\(/scan</code> and <code>\)/year</code> model swaps concrete.</p>
</li>
<li><p>An overnight refresh loop over 20 prompt types and 5,000 graded cells.</p>
</li>
<li><p>A diagnostic loop that flags cases like "all candidates pass &lt; 50%", "inverse correlation detected", and "cost axis dropped, relative spread 0.3%".</p>
</li>
</ul>
<hr />
<h2>The takeaway</h2>
<p>The main lesson is that model choice is not global.</p>
<p>The same model can be optimal for one prompt and wasteful for another. Once we measured prompts separately, many decisions became obvious: pay more where mistakes are expensive, save money where volume dominates, and keep the eval cheap enough to rerun.</p>
<p>The process is simple:</p>
<ol>
<li><p>Shortlist with a calibrated cost gate.</p>
</li>
<li><p>Profile the production load before scoring.</p>
</li>
<li><p>Weight axes per prompt type, not only by volume bucket.</p>
</li>
<li><p>Layer assertions: cheap deterministic checks first, expensive LLM judge last, with min-dimension floors.</p>
</li>
<li><p>Curate from real production traces, not synthetic examples.</p>
</li>
<li><p>Lock candidate sets in version-controlled config.</p>
</li>
<li><p>Use promptfoo as the harness; pin the judge through Codex for schema-enforced outputs.</p>
</li>
<li><p>Use a Claude Code skill to analyze the raw eval output and produce the ranking.</p>
</li>
</ol>
<p>Maximize pwn. Minimize tokens. Pick the right model for the right job.</p>
<hr />
<h2>Notation summary</h2>
<table>
<thead>
<tr>
<th>Symbol</th>
<th>Meaning</th>
</tr>
</thead>
<tbody><tr>
<td><code>M</code></td>
<td>number of candidate models</td>
</tr>
<tr>
<td><code>N</code></td>
<td>number of prompt types</td>
</tr>
<tr>
<td><code>K</code></td>
<td>fixtures per (prompt type, model) cell</td>
</tr>
<tr>
<td><code>B_p</code></td>
<td>per-call cost of baseline model on prompt type <code>p</code></td>
</tr>
<tr>
<td><code>B_ref</code></td>
<td>fixed reference cost used to normalize costs before the power-law gate</td>
</tr>
<tr>
<td><code>b_p</code></td>
<td>normalized baseline cost, <code>B_p / B_ref</code></td>
</tr>
<tr>
<td><code>q_p(m)</code></td>
<td>normalized candidate cost, <code>cost_p(m) / B_ref</code></td>
</tr>
<tr>
<td><code>T^in_p, T^out_p</code></td>
<td>avg input/output tokens for prompt type <code>p</code></td>
</tr>
<tr>
<td><code>c^in_m, c^out_m</code></td>
<td>per-token cost of model <code>m</code></td>
</tr>
<tr>
<td><code>\bar s_m</code></td>
<td>rubric mean score of model <code>m</code> on the primary intelligence assertion</td>
</tr>
<tr>
<td><code>\rho_m</code></td>
<td>overall pass rate of model <code>m</code></td>
</tr>
<tr>
<td><code>\bar r_m, r_{m,c}</code></td>
<td>rubric mean and per-criterion score of model <code>m</code></td>
</tr>
<tr>
<td><code>τ_mean, τ_min</code></td>
<td>rubric pass-mean threshold and per-criterion minimum</td>
</tr>
<tr>
<td><code>κ_m, ℓ^p95_m</code></td>
<td>cost per scan and p95 latency for model <code>m</code></td>
</tr>
<tr>
<td><code>w_intel, w_cost, w_lat, w_thr</code></td>
<td>composite weights, summing to 1</td>
</tr>
<tr>
<td><code>Σ_m</code></td>
<td>composite score for model <code>m</code></td>
</tr>
<tr>
<td><code>ε_q, ε_c, ε_l</code></td>
<td>non-discriminative-axis ε thresholds</td>
</tr>
</tbody></table>
<hr />
<h2>Composite formula (single source of truth)</h2>
<p>For the active-axis set <code>A ⊆ {intel, cost, lat, thr}</code> (after applying throughput-drop and non-discriminative-drop rules) and normalized weights <code>w'</code> summing to 1:</p>
<p>$$\Sigma_m = \sum_{a \in A} w'_a \cdot \widehat{a}_m$$</p>
<p>where <code>\widehat{a}_m</code> is the min-max normalized value on axis <code>a</code> for model <code>m</code>.</p>
<p>Intelligence:</p>
<p>$$\widehat{\text{intel}}_m = 0.7 \cdot \frac{\bar s_m - \min_k \bar s_k}{\max_k \bar s_k - \min_k \bar s_k} + 0.3 \cdot \rho_m$$</p>
<p>Cost, where lower cost is better:</p>
<p>$$\widehat{\text{cost}}_m = \frac{\max_k \kappa_k - \kappa_m}{\max_k \kappa_k - \min_k \kappa_k}$$</p>
<p>Latency, where lower p95 latency is better:</p>
<p>$$\widehat{\text{lat}}_m = \frac{\max_k \ell^{p95}_k - \ell^{p95}_m}{\max_k \ell^{p95}_k - \min_k \ell^{p95}_k}$$</p>
<p>Throughput is normalized analogously on <code>time_per_scan_s</code>.</p>
<p>When <code>max_k = min_k</code> for an axis, the axis is degenerate. We drop it and redistribute its weight across the remaining active axes.</p>
<p><em>This evaluation process backs model selection across</em> <a href="https://copilot.bugbase.ai"><em><strong>Pentest Copilot</strong></em></a><em>'s 20 production prompt types.</em></p>
]]></content:encoded></item><item><title><![CDATA[React2Shell (CVE-2025-55182): A Real-World Lesson on Why Incident Response Speed Matters]]></title><description><![CDATA[Zero-day vulnerabilities with a CVSS score of 10 are not theoretical risks, they are production outages waiting to happen. The recent React2Shell vulnerability demonstrated this brutally: a simple payload led to remote code execution (RCE) across tho...]]></description><link>https://blog.ssitaraman.com/react2shell-cve-2025-55182-a-real-world-lesson-on-why-incident-response-speed-matters</link><guid isPermaLink="true">https://blog.ssitaraman.com/react2shell-cve-2025-55182-a-real-world-lesson-on-why-incident-response-speed-matters</guid><category><![CDATA[React]]></category><category><![CDATA[Next.js]]></category><category><![CDATA[CVE-2025-55182]]></category><category><![CDATA[React2Shell]]></category><dc:creator><![CDATA[Sitaraman Subramanian]]></dc:creator><pubDate>Mon, 15 Dec 2025 13:36:47 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1765805571902/4255a637-9185-41a9-bb5b-35977eb527ff.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Zero-day vulnerabilities with a <strong>CVSS score of 10</strong> are not theoretical risks, they are production outages waiting to happen. The recent <strong>React2Shell</strong> vulnerability demonstrated this brutally: a simple payload led to <strong>remote code execution (RCE)</strong> across thousands of applications using <strong>Next.js / React</strong>, impacting startups and large enterprises alike. This post walks through how I identified the issue in real time, validated the exploit, and mitigated it across production systems within hours.</p>
<h2 id="heading-so-what-is-this-react2shell">So what is this React2Shell???</h2>
<p>React2Shell (CVE-2025-55182) allows <strong>unauthenticated RCE</strong> in vulnerable Next.js / React setups due to unsafe server-side rendering behavior. Exploitation requires <strong>no authentication</strong>, no complex chaining - just a crafted payload.</p>
<ul>
<li>Official site: <a target="_blank" href="https://react2shell.com/">https://react2shell.com/</a></li>
</ul>
<p>PoC used: <a target="_blank" href="https://github.com/lachlan2k/React2Shell-CVE-2025-55182-original-poc">https://github.com/lachlan2k/React2Shell-CVE-2025-55182-original-poc</a></p>
<p>Cloudflare impact analysis (excellent breakdown):<br /><a target="_blank" href="https://www.youtube.com/watch?v=7vw445i8gOI">https://www.youtube.com/watch?v=7vw445i8gOI</a></p>
<ul>
<li>by <em>ThePrimeTime</em></li>
</ul>
<p>Portswigger Detecting R2S: <a target="_blank" href="https://portswigger.net/blog/how-to-detect-react2shell-with-burp-suite">https://portswigger.net/blog/how-to-detect-react2shell-with-burp-suite</a></p>
<p>This wasn’t just a bug. It was a <strong>global fire drill</strong>.</p>
<h2 id="heading-how-i-discovered-it">How I discovered it</h2>
<p>TWITTER or should I call it X, Like many engineers, I was casually scrolling Twitter/X when I noticed chatter around a “new Next.js CVE.” At first glance, it looked like typical security banter noisy, half-confirmed claims. But CVEs around frontend frameworks are rare, and when they show up, they deserve attention.</p>
<h3 id="heading-a-server-was-down">A SERVER WAS DOWN</h3>
<p>One of our subdomains was un-responsive so tried digging deeper into what could’ve happened..</p>
<p>Digging into <strong>Nginx Docker logs</strong>, I noticed something that immediately felt off:</p>
<blockquote>
<p>The <strong>frontend upstream</strong> was failing—not the backend.</p>
</blockquote>
<p>In most architectures, backend services are the usual failure point. Seeing the <strong>frontend container crash or hang under malformed requests</strong> was a huge red flag.</p>
<h3 id="heading-log-analysis-showed-something-phishy">Log analysis showed something phishy</h3>
<p>Reviewing request logs showed:</p>
<ul>
<li><p>Broken request paths</p>
</li>
<li><p>Non-standard payloads</p>
</li>
<li><p>Repeated patterns hitting SSR routes</p>
</li>
</ul>
<p>These were not scanners. These were <strong>exploit attempts</strong>.</p>
<p>This aligned suspiciously well with what I’d just seen online.</p>
<hr />
<h2 id="heading-confirming-the-exploit">Confirming the exploit</h2>
<p>Rather than guessing, I moved fast:</p>
<ol>
<li><p>Spun up a <strong>local dev environment</strong> with the same Next.js version</p>
</li>
<li><p>Pulled the original PoC:<br /> <a target="_blank" href="https://github.com/lachlan2k/React2Shell-CVE-2025-55182-original-poc/blob/main/01-submitted-poc.js">https://github.com/lachlan2k/React2Shell-CVE-2025-55182-original-poc/blob/main/01-submitted-poc.js</a></p>
</li>
<li><p>Ran the payload</p>
</li>
</ol>
<p><strong>Result:</strong><br />➡️ <strong>RCE on the first attempt. No tweaking. No edge cases. (OK maybe a lil bit tweaking but you get it)</strong></p>
<p>That moment was sobering. This wasn’t a “theoretical exploit.” This was <strong>weaponized, reliable, and already in the wild</strong>.</p>
<h3 id="heading-immediate-mitigation-steps-taken">Immediate Mitigation Steps Taken</h3>
<p>Once confirmed, there was no room for debate or staged rollouts.</p>
<h4 id="heading-1-emergency-dependency-upgrades">1. Emergency Dependency Upgrades</h4>
<ul>
<li><p>Identified all projects using vulnerable:</p>
<ul>
<li><p>Next.js</p>
</li>
<li><p>React</p>
</li>
</ul>
</li>
<li><p>Upgraded immediately to patched versions</p>
</li>
<li><p>Redeployed all affected services</p>
</li>
</ul>
<h4 id="heading-2-temporary-compensating-controls">2. Temporary Compensating Controls</h4>
<p>While patching:</p>
<ul>
<li><p>Tightened WAF rules - enabled for all subdomains</p>
</li>
<li><p>Increased logging verbosity for anomaly detection</p>
</li>
</ul>
<hr />
<h3 id="heading-why-speed-is-everything-in-times-like-these">Why Speed Is Everything in times like these?</h3>
<p>This incident reinforced a hard truth:</p>
<blockquote>
<p><strong>There is no grace period for CVSS 10 vulnerabilities.</strong></p>
</blockquote>
<p>By the time a public PoC exists:</p>
<ul>
<li><p>Mass scanning is already happening</p>
</li>
<li><p>Bots don’t wait for your change advisory board</p>
</li>
<li><p>“We’ll patch next sprint” equals compromise</p>
</li>
</ul>
]]></content:encoded></item></channel></rss>