PROCESS26 April 2026 · 9 min read

We ran an external audit on our own system. It found five things.

Most "tested" software is tested against itself. We unit-test the functions our developers wrote, integration-test the flows we designed, regression-test against the bugs we already know about. None of those catch the bugs that are structurally invisible from the inside — the ones where the code is consistent with itself but inconsistent with reality. Last weekend we ran an external audit on the v5.8 line. It found five things. Here are all five.

What an external audit is, and why it's different

An external audit doesn't read your code. It picks an output your system produces, gathers the same information from sources outside your system, and compares. If your system says NVDA's max pain is at $150, an external audit asks what does ChartExchange say? OptionCharts? Barchart? If those numbers disagree, the audit doesn't tell you which is right — it tells you that something needs investigating.

Internal tests only catch bugs you can imagine. External audits catch bugs you can't.

The setup

We picked two contrasting tickers from a real portfolio: NVDA (where v5.8 made a differentiated bullish-with-conflict call — dark pool ACCUM but bearish max-pain pull) and TSLA (where v5.8 made a cautious "step aside, tape doesn't confirm framework" call). For each, we used public sources we'd never touched as inputs: ChartExchange, OptionCharts, FinancialContent, AltIndex, FlashAlpha, Motley Fool, HedgeFollow. Compared the system's narrations against what those sources said.

The audit found five things.

The five findings

# Finding What we missed Severity Fix
1 Weekend / off-hours data blindness Hidden Tape Coach narrations on Sunday confidently described "current" institutional flow, but the live whale module + put-sweep detector + news sentiment all rely on real-time trading flow that doesn't exist over weekends. "Neutral" reads could mean genuinely neutral — or no data wearing a confident mask. No staleness disclosure anywhere. HIGH v5.9.0
2 Point 5 (whale_sentiment) over-weighted news All five mega-caps in our test portfolio scored whale_sentiment 8/10 from MarketAux news. Independent verification showed actual institutional flow was neutral on four of five — only NVDA had real dark-pool ACCUM. The "whale" bar was reading bullish news as if it were bullish institutional positioning. News dominated 10:1 over flow signals in the scoring formula. HIGH v5.9.1
3 Point 11 fallback default rendered as "bearish" When real options/IV/dark-pool confluence data wasn't available, Point 11 returned a fallback default of 2/10. The 11pt bar chart rendered that as a red bar — visually identical to a real bearish put-sweep read. No way to tell "real bearish 2" from "fallback 2." Most tickers showed the same red bar regardless of what their actual options flow looked like. MED v5.9.0
4 Max pain calculation was structurally wrong Our system reported NVDA max pain at $150 (28% below spot, "structural bearish pressure $58 below price"). External sources for the corresponding expiration showed ~$200. On investigation: the algorithm was computing intrinsic value at the current price and attributing it to each strike — not iterating over candidate expiration prices as the standard formula requires. The output was an artifact, not a calculation. Every Hidden Tape narration referencing max pain in the v5.8 line was reasoning from hallucinated numbers. HIGH v5.9.2
5 Single-source institutional flow with no confidence indicator Hidden Tape Coach pulled from four sources (live whale module, dark-pool chip, Point 5 score, Point 11 score), but the panel header gave no signal about whether one or four were firing. A panel revealed via single-source news read looked identical to a panel with multi-source confluence. The most actionable read (NVDA: dark pool + Point 5 agreeing) was visually indistinguishable from the four mega-caps with news only. MED v5.9.3

The big one

Finding #4 — the max pain bug — was the kind of thing internal testing physically could not catch. The function had a docstring saying "calculate max pain price." It returned a number with the right type. The number was internally consistent across calls. The narrations that consumed it read fluently. The math, looked at line by line, was even self-consistent: it computed an intrinsic value, multiplied by open interest, summed it, took the minimum. It just wasn't computing what the function name said.

v5.8 NARRATION (BUGGY)

"Max-pain magnet sits at $150.0$58 below current price — signaling structural bearish pressure. Call gamma exceeds put gamma but total exposure is diffuse."

A confident, specific, actionable claim built on a number that was completely fabricated.

v5.9.2 NARRATION (FIXED)

"Max-pain magnet at $200.0 sits $8.27 below current price; gamma wall at $208.27 (current price) is tight — no concentrated dark-pool block or sweep detected."

Same data, same prompt structure — different inputs produced an honest, proportionate read.

The LLM didn't change. The prompt didn't change. The data the prompt builder assembles didn't change. Just the math underneath got correct. Every Hidden Tape narration in the v5.9 line is now sharper because it's reasoning from honest data instead of an artifact.

The pattern that worked: ship → audit → harden

All five findings were addressed in eight releases the same day (v5.9.0 through v5.9.7). The pattern that made that possible is worth naming, because we'll repeat it for every major feature line going forward.

Step 1 — Ship the feature. v5.8.8 added Hidden Tape Coach, the seventh per-ticker AI surface. Multi-source signal integration. Polished narrations. Internally consistent.

Step 2 — Run an external audit. Pick two contrasting outputs. Research each independently using public sources you've never touched. Compare. Do not let the system's confidence in its own outputs influence the comparison.

Step 3 — Document the hardening plan before writing fixes. For each finding: symptom, root cause, hardening proposal, effort estimate, priority. This step is critical. Several "weaknesses" turn out to be correct behavior the user just needs explanation for, not bugs needing fixing. Writing it down forces honest prioritization.

Step 4 — Ship the fixes in priority order. Each release small enough to verify end-to-end. Each commit message tells a complete story (symptom → root cause → fix → verification). Each push to origin/main makes the work durable.

What changed for the user

Eight v5.9.x releases shipped on top of the five audit findings:

Why this matters for the kind of product we're building

Swing Deck is a discipline product. People trust it with sizing decisions on real money. The asymmetry of "we shipped a feature" versus "we shipped a feature and then audited it ourselves" isn't really about engineering pride — it's about what level of confidence we earn the right to project.

A trading app that confidently tells you NVDA's max pain is $58 below spot when the actual value is $8 below spot is doing the opposite of helping. It's manufacturing precision it doesn't have, and the user pays the cost. The v5.9 line caught one such error. The pattern that caught it will catch the next one.

Most apps don't ship retrospectives. Most apps don't run audits on their own outputs. Most apps don't write hardening plans before writing fixes. We do, because the work compounds — every bug caught at audit time is a bug not paid for at trade time.

The full audit hardening doc and v5.9 retrospective are public on GitHub. So is every commit. So is every fix.

Eight releases · five hardenings · one pattern worth keeping. v5.9 ships now.