We ran an external audit on our own system. It found five things.
Most "tested" software is tested against itself. We unit-test the functions our developers wrote, integration-test the flows we designed, regression-test against the bugs we already know about. None of those catch the bugs that are structurally invisible from the inside — the ones where the code is consistent with itself but inconsistent with reality. Last weekend we ran an external audit on the v5.8 line. It found five things. Here are all five.
What an external audit is, and why it's different
An external audit doesn't read your code. It picks an output your system produces, gathers the same information from sources outside your system, and compares. If your system says NVDA's max pain is at $150, an external audit asks what does ChartExchange say? OptionCharts? Barchart? If those numbers disagree, the audit doesn't tell you which is right — it tells you that something needs investigating.
Internal tests only catch bugs you can imagine. External audits catch bugs you can't.
The setup
We picked two contrasting tickers from a real portfolio: NVDA (where v5.8 made a differentiated bullish-with-conflict call — dark pool ACCUM but bearish max-pain pull) and TSLA (where v5.8 made a cautious "step aside, tape doesn't confirm framework" call). For each, we used public sources we'd never touched as inputs: ChartExchange, OptionCharts, FinancialContent, AltIndex, FlashAlpha, Motley Fool, HedgeFollow. Compared the system's narrations against what those sources said.
The audit found five things.
The five findings
| # | Finding | What we missed | Severity | Fix |
|---|---|---|---|---|
| 1 | Weekend / off-hours data blindness | Hidden Tape Coach narrations on Sunday confidently described "current" institutional flow, but the live whale module + put-sweep detector + news sentiment all rely on real-time trading flow that doesn't exist over weekends. "Neutral" reads could mean genuinely neutral — or no data wearing a confident mask. No staleness disclosure anywhere. | HIGH | v5.9.0 |
| 2 | Point 5 (whale_sentiment) over-weighted news | All five mega-caps in our test portfolio scored whale_sentiment 8/10 from MarketAux news. Independent verification showed actual institutional flow was neutral on four of five — only NVDA had real dark-pool ACCUM. The "whale" bar was reading bullish news as if it were bullish institutional positioning. News dominated 10:1 over flow signals in the scoring formula. | HIGH | v5.9.1 |
| 3 | Point 11 fallback default rendered as "bearish" | When real options/IV/dark-pool confluence data wasn't available, Point 11 returned a fallback default of 2/10. The 11pt bar chart rendered that as a red bar — visually identical to a real bearish put-sweep read. No way to tell "real bearish 2" from "fallback 2." Most tickers showed the same red bar regardless of what their actual options flow looked like. | MED | v5.9.0 |
| 4 | Max pain calculation was structurally wrong | Our system reported NVDA max pain at $150 (28% below spot, "structural bearish pressure $58 below price"). External sources for the corresponding expiration showed ~$200. On investigation: the algorithm was computing intrinsic value at the current price and attributing it to each strike — not iterating over candidate expiration prices as the standard formula requires. The output was an artifact, not a calculation. Every Hidden Tape narration referencing max pain in the v5.8 line was reasoning from hallucinated numbers. | HIGH | v5.9.2 |
| 5 | Single-source institutional flow with no confidence indicator | Hidden Tape Coach pulled from four sources (live whale module, dark-pool chip, Point 5 score, Point 11 score), but the panel header gave no signal about whether one or four were firing. A panel revealed via single-source news read looked identical to a panel with multi-source confluence. The most actionable read (NVDA: dark pool + Point 5 agreeing) was visually indistinguishable from the four mega-caps with news only. | MED | v5.9.3 |
The big one
Finding #4 — the max pain bug — was the kind of thing internal testing physically could not catch. The function had a docstring saying "calculate max pain price." It returned a number with the right type. The number was internally consistent across calls. The narrations that consumed it read fluently. The math, looked at line by line, was even self-consistent: it computed an intrinsic value, multiplied by open interest, summed it, took the minimum. It just wasn't computing what the function name said.
"Max-pain magnet sits at $150.0 — $58 below current price — signaling structural bearish pressure. Call gamma exceeds put gamma but total exposure is diffuse."
A confident, specific, actionable claim built on a number that was completely fabricated.
"Max-pain magnet at $200.0 sits $8.27 below current price; gamma wall at $208.27 (current price) is tight — no concentrated dark-pool block or sweep detected."
Same data, same prompt structure — different inputs produced an honest, proportionate read.
The LLM didn't change. The prompt didn't change. The data the prompt builder assembles didn't change. Just the math underneath got correct. Every Hidden Tape narration in the v5.9 line is now sharper because it's reasoning from honest data instead of an artifact.
The pattern that worked: ship → audit → harden
All five findings were addressed in eight releases the same day (v5.9.0 through v5.9.7). The pattern that made that possible is worth naming, because we'll repeat it for every major feature line going forward.
Step 1 — Ship the feature. v5.8.8 added Hidden Tape Coach, the seventh per-ticker AI surface. Multi-source signal integration. Polished narrations. Internally consistent.
Step 2 — Run an external audit. Pick two contrasting outputs. Research each independently using public sources you've never touched. Compare. Do not let the system's confidence in its own outputs influence the comparison.
Step 3 — Document the hardening plan before writing fixes. For each finding: symptom, root cause, hardening proposal, effort estimate, priority. This step is critical. Several "weaknesses" turn out to be correct behavior the user just needs explanation for, not bugs needing fixing. Writing it down forces honest prioritization.
Step 4 — Ship the fixes in priority order. Each release small enough to verify end-to-end. Each commit message tells a complete story (symptom → root cause → fix → verification). Each push to origin/main makes the work durable.
What changed for the user
Eight v5.9.x releases shipped on top of the five audit findings:
- v5.9.0 — Markets-closed banner with last-refresh time. LLM narrations now acknowledge weekend staleness and recommend next-session verification. Point 11 fallback bars render gray with explanatory tooltip.
- v5.9.1 — Point 5 whale_sentiment caps at 7 for news-only reads. The 8-10 conviction band requires actual flow corroboration. Five mega-caps that previously scored 8 from news alone now correctly score 7 — the bar reflects the data depth.
- v5.9.2 — Max pain rewritten with the standard industry algorithm. NVDA goes from a hallucinated $150 / $58 below spot to an accurate $200 / $8 below. Every Hidden Tape narration referencing max pain became sharper for free. Confidence + strikes-analyzed + total-OI fields surface to the LLM prompt so it can disclose uncertainty when data is sparse.
- v5.9.3 — Hidden Tape Coach panel headers gain a confidence badge: ✓ N/4 signals (HIGH, multi-source confluence) · ● 1/4 signal (MEDIUM, single-source informational) · ○ paired (CYAN, confirmation context). At a glance, the user can tell which Hidden Tape reads carry real conviction.
- v5.9.4 — Auto-refresh defers when the user is engaged (any coach panel open or recent scroll). Up to 15 minutes of continuous engagement before forced refresh. Scroll position force-saved before each rebuild. No more jump-to-top mid-read.
- v5.9.5 — Dev mode auto-enables Coach Preview Mode. Every coach panel on every card is force-revealed for visual inspection.
- v5.9.6 — Closed out v5.8.7's card design system unification by removing legacy class aliases. Dashboard CSS is simpler, no broken paths.
- v5.9.7 — Settings UI adds Ollama as a first-class AI provider option, plus surfaces the Devil's Advocate auto-pair toggle that previously required console hacking. New
storage="local"schema convention reusable for future browser-local prefs.
Why this matters for the kind of product we're building
Swing Deck is a discipline product. People trust it with sizing decisions on real money. The asymmetry of "we shipped a feature" versus "we shipped a feature and then audited it ourselves" isn't really about engineering pride — it's about what level of confidence we earn the right to project.
A trading app that confidently tells you NVDA's max pain is $58 below spot when the actual value is $8 below spot is doing the opposite of helping. It's manufacturing precision it doesn't have, and the user pays the cost. The v5.9 line caught one such error. The pattern that caught it will catch the next one.
Most apps don't ship retrospectives. Most apps don't run audits on their own outputs. Most apps don't write hardening plans before writing fixes. We do, because the work compounds — every bug caught at audit time is a bug not paid for at trade time.
The full audit hardening doc and v5.9 retrospective are public on GitHub. So is every commit. So is every fix.
Eight releases · five hardenings · one pattern worth keeping. v5.9 ships now.