Mission Trace // Internal Show And Tell

Not raw prompting.
Structured AI workflow.
Auditable end to end.

This walkthrough is meant for engineers who are cautious about “vibe coding.” The point is not that AI can type fast. The point is that newer tooling can put specs, review gates, memory, and adversarial checks around that speed.

The mission: Build a complete IPTV player engine for Android TV — ExoPlayer singleton, channel switching state machine, error recovery, token refresh, 27 tests. The workflow caught 2 critical integration bugs after tests were already green, challenged its own plan before implementation, and audited whether its own verification claims were honest.

scroll to trace the mission

Orientation

Why this is different from raw “vibe coding.”

The skeptical reaction is reasonable. If the model is just generating code from a prompt, quality is mostly luck. This workflow is different because it introduces structure before and after code generation.

Skills

Reusable, named workflows for jobs like planning, adversarial review, verification, and memory capture. Instead of hoping the agent remembers a ritual, the ritual is explicit and callable.

Hooks

Automatic checks triggered around agent actions. Think policy and automation: lint, review, reminders, constraints, or follow-up tools that run without relying on the human to remember every step.

Memory

Project-specific conventions captured in files like CLAUDE.md, so lessons from one session become defaults for the next instead of disappearing into chat history.

Evidence

Plans, decisions, reviews, findings, and blocked verifications become artifacts. That gives you something to inspect, challenge, and improve instead of treating the model as a black box.

Pre-Pipeline

Start with a spec, not a vibe.

Before the pipeline starts, the developer creates a structured requirements document. Then the spec itself gets stress-tested through multi-model review before any implementation begins.

/gsd:new-project Project bootstrap

Analyzed the existing Go backend, a failed previous Compose TV attempt, and decompiled competitor apps (TiviMate, Sparkle TV). Produced PROJECT.md, REQUIREMENTS.md, and a phased ROADMAP.md with success criteria for each phase.

/gsd:discuss-phase Spec refinement

Surfaced gray areas: "How should the channel switching state machine work?", "What happens when a token expires mid-stream?", "Should tests live alongside production code?" Each decision locked with rationale and rejected alternatives.

/dialogue Live multi-engine debate on the spec

The final spec was sent through /dialogue, where Codex, Gemini, and Claude reviewed the same proposal from different angles. Claude moderated the flow, claims were checked against local code context, and the session forced convergence on concrete design decisions before execution.

What survived the debate

43 locked decisions in CONTEXT.md. Every decision includes what was chosen, why, what was rejected, and what assumptions it rests on. This document becomes the contract all downstream agents follow.

Stage 1

Lock decisions. Scout the codebase.

Ship-phase begins. It reads the validated spec, scans the codebase for reusable assets, and produces context that downstream agents can use without repeatedly rediscovering the same facts.

/gsd:discuss-phase --auto

Auto-mode picked defaults for remaining gray areas. The ROADMAP risk notes were so thorough that convergence happened in 1 round — no /chat or /challenge needed.

Codebase explorer agent

Mapped 72 files across 6 modules. Identified reusable assets: ConnectivityMonitor, FakeChannelRepository, DeviceEventSource. Documented every integration point.

Stage 2a

Research before planning. Discover the landmines.

/gsd:research-phase Technical investigation

A researcher agent checked official docs, issue trackers, and release notes. Three critical findings:

ExoPlayer is not mockable since Media3 1.9.0 (static initializer reads Build.DEVICE). This shaped the entire test architecture — PlayerCommands interface became the testability boundary.

player.stop() causes a black screen on Android TV hardware (Media3 issue #2941).

SDK 35 requires FOREGROUND_SERVICE_MEDIA_PLAYBACK permission or the service crashes at runtime.

Stage 2b

Create plans. Then challenge the plans.

/gsd:plan-phase Execution plans

4 plans across 4 waves with task-level <read_first>, <acceptance_criteria>, and <action> blocks. Every requirement (PLAY-01 through PLAY-06) mapped to at least one plan. A plan-checker agent verified coverage.

Stage 2c

Stress-test the plans before writing code.

/challenge Adversarial stress-test

Wrapped the plans as a claim and sent through Gemini for critical reassessment. Verdict: partially_agree. The strongest counter: deferring all tests to Wave 4 means the complex state machine is developed "blind."

/consensus Blinded multi-model evaluation

Three models. Three stances. Same proposal. None see the others' output.

Gemini 2.5 Pro Arguing FOR

"The PlaybackController is not a simple CRUD component; it is a complex state machine with intricate, time-sensitive logic. Writing tests alongside the implementation will catch subtle bugs as they are introduced."

GPT-5.2 Arguing AGAINST

"The hardest logic produces the most brittle tests when written early. With 300ms debounce windows and exponential backoff, early tests tend to over-specify incidental scheduling details rather than invariants." But conceded: "Deferring all tests is risky for the PlaybackController."

GPT-5.4 NEUTRAL

"For PlaybackController, the evidence clearly favors restructuring. The ROADMAP calls out rapid D-pad switching, serialized tune behavior, and token refresh as core risks — those exact behaviors are in Wave 2 and only checked in Wave 4."

/chat + /plan-hardening

All 3 converged. Plans were restructured: tests moved into production waves, boundaries were clarified, and PLAN-HARDENING.md was produced. The important point is not the theatrics. The important point is that the workflow changed course before implementation.

Stage 3

Execution still moves fast, but it is gated.

12Commits

27Tests

22New files

8Gates

Wave 1 — Scaffold

2 new Gradle modules, 6 interface contracts, Media3 1.9.2 deps, token refresh API

Wave 2 — Engine + Tests

PlaybackController (tune state machine, 300ms debounce, retry backoff), PlaybackService (ExoPlayer singleton), StreamTokenManager — plus 24 tests

Wave 3 — UI + Tests

PlayerActivity, ChannelOverlayView (4s auto-hide, TalkBack), EpgSyncCoordinator — plus 3 tests

Wave 4 — Verification

Full test suite pass, full app build, no regressions across 371 Gradle tasks

/phase-execution-hardening 8-gate gauntlet

G1 Execute → G2 /tracer wiring audit → G3 decision hardening → G4 external code review → G4b /test-audit → G4c /rubber-ducky blind spots → G4d /code-health SOLID/KISS → G5–G8 gap closure, simplify, verify, reconcile. This is the core pitch: speed plus repeatable review pressure.

The Catch

The most convincing part: the workflow found what green tests missed.

After all code was written and all 27 tests passed, an external adversarial review still found bugs at the wrong abstraction layer for unit tests to catch. This is the part skeptical engineers usually care about most.

Critical @Singleton missing on PlaybackController

Without @Singleton, Hilt creates separate instances for PlayerActivity and PlaybackService. The Activity's tune/channelUp calls would never reach the Service's player. Playback would never start. All 27 tests passed because tests use direct instantiation, not Hilt injection.

High Token refresh on wrong thread

handleTokenExpired() launched on IO dispatcher but called playerCommands.setMediaItemAndPrepare() which must run on the main thread. Intermittent crashes on real devices.

Fixed Both resolved immediately

Fixes were applied, tests were re-run, and the changes were committed. These bugs were not caught by the original tests because the problem was integration and threading behavior, not isolated business logic. That is exactly why extra review layers matter.

Stage 4 & 5

The workflow also audits its own confidence.

/gsd:verify-work Acceptance testing

Extracted 8 testable deliverables. 2 automated passed, 6 blocked pending physical Android TV device. Zero code issues.

/implementation-review Process integrity audit

This gate does not review code first. It reviews the review process. 16 findings: proactive token refresh was still a stub while the requirement was claimed complete, mockk(relaxed = true) hid failure paths, and PlaybackService had zero unit tests. None are code bugs. They are integrity gaps between what the workflow said and what it truly proved.

/claude-md-improver Institutional memory

9 conventions captured in CLAUDE.md for future sessions: ExoPlayer not mockable, @Singleton for cross-component injection, MPEG-TS via ProgressiveMediaSource, foreground service permissions, PlaybackController lifecycle.

The building blocks behind the workflow.

Each skill is a specialized capability with a narrow responsibility. That is useful because it replaces vague prompting with named, repeatable operations.

/dialogue

Multi-model review of a shared proposal. Useful when you want competing critiques before locking a decision.

Pre-pipeline · spec validation

/gsd:discuss-phase

Surfaces gray areas and locks decisions. Interactive or auto-mode. Produces CONTEXT.md.

S1 · context gathering

/gsd:research-phase

Investigates libraries, APIs, pitfalls. Checks docs, issue trackers, release notes.

S2 · pre-planning

/gsd:plan-phase

Creates execution plans with acceptance criteria. Runs a plan-checker for coverage.

S2 · plan creation

/challenge

Adversarial stress-test. Wraps claims, sends through CLI engines with structured verdicts.

S2 · plan hardening

/consensus

Blinded multi-model evaluation. Three stances, same proposal, none see each other.

S2 · plan hardening

/chat

Collaborative deliberation. Up to 5 rounds to converge on concrete edits.

S2 · convergence

/plan-hardening

Orchestrates challenge → consensus → chat. Produces boundary declarations and plan edits.

S2 · post-planning

/gsd:execute-phase

Wave-based execution with checkpoints. Spawns agents, handles checkpoints, and validates wiring instead of treating implementation as a single opaque blob.

S3 · code execution

/phase-execution-hardening

8-gate quality gauntlet: wiring, code review, test audit, blind spots, health, gaps.

S3 · post-execution

/tracer

Static wiring analysis via LSP. Verifies call chains, interfaces, reachability.

S3 · gate G2

/gsd:verify-work

Conversational UAT. Extracts testable deliverables, records pass/fail/blocked.

S4 · acceptance

/implementation-review

Adversarial process audit. Reviews the reviewers. 7 dimensions of verification integrity.

S5 · process audit

/claude-md-improver

Captures conventions into CLAUDE.md. Prevents knowledge decay across sessions.

S5 · memory capture

/gsd:add-todo

Captures discovered work as structured todos. Prevents scope creep, preserves ideas.

S5 · work triage

What this workflow actually demonstrates.

Code Shipped

2 new Gradle modules
ExoPlayer singleton in MediaSessionService
Tune state machine, 300ms debounce
Channel switching with wrap-around
Exponential backoff retry
Single-flighted token refresh
Network reconnection
Channel overlay + TalkBack
Player launch contract

Why It Matters

Specs become executable context
Plans are challenged before coding
Green tests are not treated as proof
External review catches integration bugs
Audit trails make claims inspectable
Lessons learned become team memory
Follow-up work is captured instead of lost
Human oversight moves to decisions and risk
Speed comes with more review pressure

What The Human Still Owns

Define requirements and constraints
Decide what “done” means
Inspect artifacts and findings
Approve or reject risky tradeoffs
Decide when reality on device contradicts theory
The strongest use of AI coding tools is not “let the model freestyle.” It is giving the model a system that makes planning, checking, challenging, and remembering much harder to skip.