Mission Trace // Internal Show And Tell

Not raw prompting.
Structured AI workflow.
Auditable end to end.

This walkthrough is meant for engineers who are cautious about “vibe coding.” The point is not that AI can type fast. The point is that newer tooling can put specs, review gates, memory, and adversarial checks around that speed.

The mission: Build a complete IPTV player engine for Android TV — ExoPlayer singleton, channel switching state machine, error recovery, token refresh, 27 tests. The workflow caught 2 critical integration bugs after tests were already green, challenged its own plan before implementation, and audited whether its own verification claims were honest.
scroll to trace the mission
Orientation

Why this is different from raw “vibe coding.”

The skeptical reaction is reasonable. If the model is just generating code from a prompt, quality is mostly luck. This workflow is different because it introduces structure before and after code generation.

Skills
Reusable, named workflows for jobs like planning, adversarial review, verification, and memory capture. Instead of hoping the agent remembers a ritual, the ritual is explicit and callable.
Hooks
Automatic checks triggered around agent actions. Think policy and automation: lint, review, reminders, constraints, or follow-up tools that run without relying on the human to remember every step.
Memory
Project-specific conventions captured in files like CLAUDE.md, so lessons from one session become defaults for the next instead of disappearing into chat history.
Evidence
Plans, decisions, reviews, findings, and blocked verifications become artifacts. That gives you something to inspect, challenge, and improve instead of treating the model as a black box.
Pre-Pipeline

Start with a spec, not a vibe.

Before the pipeline starts, the developer creates a structured requirements document. Then the spec itself gets stress-tested through multi-model review before any implementation begins.

/gsd:new-project Project bootstrap
Analyzed the existing Go backend, a failed previous Compose TV attempt, and decompiled competitor apps (TiviMate, Sparkle TV). Produced PROJECT.md, REQUIREMENTS.md, and a phased ROADMAP.md with success criteria for each phase.
/gsd:discuss-phase Spec refinement
Surfaced gray areas: "How should the channel switching state machine work?", "What happens when a token expires mid-stream?", "Should tests live alongside production code?" Each decision locked with rationale and rejected alternatives.
/dialogue Live multi-engine debate on the spec
The final spec was sent through /dialogue, where Codex, Gemini, and Claude reviewed the same proposal from different angles. Claude moderated the flow, claims were checked against local code context, and the session forced convergence on concrete design decisions before execution.
What survived the debate
43 locked decisions in CONTEXT.md. Every decision includes what was chosen, why, what was rejected, and what assumptions it rests on. This document becomes the contract all downstream agents follow.
Stage 1

Lock decisions. Scout the codebase.

Ship-phase begins. It reads the validated spec, scans the codebase for reusable assets, and produces context that downstream agents can use without repeatedly rediscovering the same facts.

/gsd:discuss-phase --auto
Auto-mode picked defaults for remaining gray areas. The ROADMAP risk notes were so thorough that convergence happened in 1 round — no /chat or /challenge needed.
Codebase explorer agent
Mapped 72 files across 6 modules. Identified reusable assets: ConnectivityMonitor, FakeChannelRepository, DeviceEventSource. Documented every integration point.
Stage 2a

Research before planning. Discover the landmines.

/gsd:research-phase Technical investigation
A researcher agent checked official docs, issue trackers, and release notes. Three critical findings:

ExoPlayer is not mockable since Media3 1.9.0 (static initializer reads Build.DEVICE). This shaped the entire test architecture — PlayerCommands interface became the testability boundary.

player.stop() causes a black screen on Android TV hardware (Media3 issue #2941).

SDK 35 requires FOREGROUND_SERVICE_MEDIA_PLAYBACK permission or the service crashes at runtime.
Stage 2b

Create plans. Then challenge the plans.

/gsd:plan-phase Execution plans
4 plans across 4 waves with task-level <read_first>, <acceptance_criteria>, and <action> blocks. Every requirement (PLAY-01 through PLAY-06) mapped to at least one plan. A plan-checker agent verified coverage.
Stage 2c

Stress-test the plans before writing code.

/challenge Adversarial stress-test
Wrapped the plans as a claim and sent through Gemini for critical reassessment. Verdict: partially_agree. The strongest counter: deferring all tests to Wave 4 means the complex state machine is developed "blind."
/consensus Blinded multi-model evaluation
Three models. Three stances. Same proposal. None see the others' output.
Gemini 2.5 Pro Arguing FOR
"The PlaybackController is not a simple CRUD component; it is a complex state machine with intricate, time-sensitive logic. Writing tests alongside the implementation will catch subtle bugs as they are introduced."
GPT-5.2 Arguing AGAINST
"The hardest logic produces the most brittle tests when written early. With 300ms debounce windows and exponential backoff, early tests tend to over-specify incidental scheduling details rather than invariants." But conceded: "Deferring all tests is risky for the PlaybackController."
GPT-5.4 NEUTRAL
"For PlaybackController, the evidence clearly favors restructuring. The ROADMAP calls out rapid D-pad switching, serialized tune behavior, and token refresh as core risks — those exact behaviors are in Wave 2 and only checked in Wave 4."
/chat + /plan-hardening
All 3 converged. Plans were restructured: tests moved into production waves, boundaries were clarified, and PLAN-HARDENING.md was produced. The important point is not the theatrics. The important point is that the workflow changed course before implementation.
Stage 3

Execution still moves fast, but it is gated.

12Commits
27Tests
22New files
8Gates
Wave 1 — Scaffold
2 new Gradle modules, 6 interface contracts, Media3 1.9.2 deps, token refresh API
Wave 2 — Engine + Tests
PlaybackController (tune state machine, 300ms debounce, retry backoff), PlaybackService (ExoPlayer singleton), StreamTokenManager — plus 24 tests
Wave 3 — UI + Tests
PlayerActivity, ChannelOverlayView (4s auto-hide, TalkBack), EpgSyncCoordinator — plus 3 tests
Wave 4 — Verification
Full test suite pass, full app build, no regressions across 371 Gradle tasks
/phase-execution-hardening 8-gate gauntlet
G1 Execute → G2 /tracer wiring audit → G3 decision hardening → G4 external code review → G4b /test-audit → G4c /rubber-ducky blind spots → G4d /code-health SOLID/KISS → G5–G8 gap closure, simplify, verify, reconcile. This is the core pitch: speed plus repeatable review pressure.
The Catch

The most convincing part: the workflow found what green tests missed.

After all code was written and all 27 tests passed, an external adversarial review still found bugs at the wrong abstraction layer for unit tests to catch. This is the part skeptical engineers usually care about most.

Critical  @Singleton missing on PlaybackController
Without @Singleton, Hilt creates separate instances for PlayerActivity and PlaybackService. The Activity's tune/channelUp calls would never reach the Service's player. Playback would never start. All 27 tests passed because tests use direct instantiation, not Hilt injection.
High  Token refresh on wrong thread
handleTokenExpired() launched on IO dispatcher but called playerCommands.setMediaItemAndPrepare() which must run on the main thread. Intermittent crashes on real devices.
Fixed Both resolved immediately
Fixes were applied, tests were re-run, and the changes were committed. These bugs were not caught by the original tests because the problem was integration and threading behavior, not isolated business logic. That is exactly why extra review layers matter.
Stage 4 & 5

The workflow also audits its own confidence.

/gsd:verify-work Acceptance testing
Extracted 8 testable deliverables. 2 automated passed, 6 blocked pending physical Android TV device. Zero code issues.
/implementation-review Process integrity audit
This gate does not review code first. It reviews the review process. 16 findings: proactive token refresh was still a stub while the requirement was claimed complete, mockk(relaxed = true) hid failure paths, and PlaybackService had zero unit tests. None are code bugs. They are integrity gaps between what the workflow said and what it truly proved.
/claude-md-improver Institutional memory
9 conventions captured in CLAUDE.md for future sessions: ExoPlayer not mockable, @Singleton for cross-component injection, MPEG-TS via ProgressiveMediaSource, foreground service permissions, PlaybackController lifecycle.

The building blocks behind the workflow.

Each skill is a specialized capability with a narrow responsibility. That is useful because it replaces vague prompting with named, repeatable operations.

Mission Console // Skill Invocation Trace
0/0
/dialogue
Multi-model review of a shared proposal. Useful when you want competing critiques before locking a decision.
Pre-pipeline · spec validation
/gsd:discuss-phase
Surfaces gray areas and locks decisions. Interactive or auto-mode. Produces CONTEXT.md.
S1 · context gathering
/gsd:research-phase
Investigates libraries, APIs, pitfalls. Checks docs, issue trackers, release notes.
S2 · pre-planning
/gsd:plan-phase
Creates execution plans with acceptance criteria. Runs a plan-checker for coverage.
S2 · plan creation
/challenge
Adversarial stress-test. Wraps claims, sends through CLI engines with structured verdicts.
S2 · plan hardening
/consensus
Blinded multi-model evaluation. Three stances, same proposal, none see each other.
S2 · plan hardening
/chat
Collaborative deliberation. Up to 5 rounds to converge on concrete edits.
S2 · convergence
/plan-hardening
Orchestrates challenge → consensus → chat. Produces boundary declarations and plan edits.
S2 · post-planning
/gsd:execute-phase
Wave-based execution with checkpoints. Spawns agents, handles checkpoints, and validates wiring instead of treating implementation as a single opaque blob.
S3 · code execution
/phase-execution-hardening
8-gate quality gauntlet: wiring, code review, test audit, blind spots, health, gaps.
S3 · post-execution
/tracer
Static wiring analysis via LSP. Verifies call chains, interfaces, reachability.
S3 · gate G2
/gsd:verify-work
Conversational UAT. Extracts testable deliverables, records pass/fail/blocked.
S4 · acceptance
/implementation-review
Adversarial process audit. Reviews the reviewers. 7 dimensions of verification integrity.
S5 · process audit
/claude-md-improver
Captures conventions into CLAUDE.md. Prevents knowledge decay across sessions.
S5 · memory capture
/gsd:add-todo
Captures discovered work as structured todos. Prevents scope creep, preserves ideas.
S5 · work triage

What this workflow actually demonstrates.

Code Shipped

  • 2 new Gradle modules
  • ExoPlayer singleton in MediaSessionService
  • Tune state machine, 300ms debounce
  • Channel switching with wrap-around
  • Exponential backoff retry
  • Single-flighted token refresh
  • Network reconnection
  • Channel overlay + TalkBack
  • Player launch contract

Why It Matters

  • Specs become executable context
  • Plans are challenged before coding
  • Green tests are not treated as proof
  • External review catches integration bugs
  • Audit trails make claims inspectable
  • Lessons learned become team memory
  • Follow-up work is captured instead of lost
  • Human oversight moves to decisions and risk
  • Speed comes with more review pressure

What The Human Still Owns

  • Define requirements and constraints
  • Decide what “done” means
  • Inspect artifacts and findings
  • Approve or reject risky tradeoffs
  • Decide when reality on device contradicts theory
  • The strongest use of AI coding tools is not “let the model freestyle.” It is giving the model a system that makes planning, checking, challenging, and remembering much harder to skip.