EXAMPLE

CI Pipeline Triage

PROMPT

> Our Playwright + integration test pipeline went from 8 minutes to 35 minutes, and flaky failures are now around 7% of runs. Local runs usually pass. How would you triage root causes and redesign CI for both speed and reliability?

Opus Default produced a genuinely strong answer — diagnostic structure, clear categories of failure, a tiered plan for managing flaky tests. It treats the problem seriously and gives an engineer real direction.

Opus Deepthink found that the problem has a self-worsening quality Opus Default didn’t address. Slow pipelines lead developers to raise timeouts. Flaky tests lead teams to add retries. New features add tests but old ones are rarely removed. Each of these makes the next cycle worse. Single fixes don’t stick because the underlying loops keep pulling the pipeline back toward failure.

Opus Deepthink also challenged its own recommendations before presenting them. It found that standard fixes like sharding, caching, and quarantine each create their own problems when applied without understanding the broader structure. Retries hide real bugs. Quarantine lists grow without being resolved. Sharding multiplies startup costs when startup is the actual bottleneck. Opus Default explains how to fix the pipeline. Opus Deepthink asks whether you should be running the whole pipeline on every change at all.

“Fix the test pyramid first”

Pre-mortem after inversion

Too slow as primary intervention — Pareto distribution means easy conversions don’t help

“Standard playbook will work”

Pre-mortem after stress-testing

Found 5 specific failure modes where shard/cache/retry makes things worse

“Two-pipeline solves everything”

Pre-mortem after creative reframing

Caught critical risks: fast-gate inflation, ownership gaps, need for automated rollback

“We should start fixing things”

Blocked premature solutions

Blocked premature solutions — forced diagnostic triage before prescriptions

“Local passes = CI is the problem”

Initial framing check

Reframed: maybe local is too forgiving, CI is closer to production truth

Default
DEFAULT · Opus 4.6

This is a systemic CI reliability problem with two intertwined symptoms: a 4x slowdown and a 7% flake rate. The fact that local runs usually pass is the most important clue: it points to environment differences, not test logic errors. Before fixing anything, gather CI timing history plotted over 30–60 days to identify whether the slowdown was a step function or gradual creep, a structured flake log tagging every failed run with the specific test and failure mode, and resource metrics from CI runners.

Sort every flake into one of these buckets: timing and race conditions, port and resource collision, state leakage, external dependency failures, and OOM/resource exhaustion. The distribution across these buckets determines your remediation priorities.

The 8-to-35-minute jump has a finite number of causes: test count grew without parallelism scaling; serial execution where parallel is possible; CI runner specs degraded; expensive setup running per-test instead of per-suite; and no caching of dependencies or build artifacts — every run does a full npm install + build from scratch.

State leakage is typically the largest source of flakes in integration tests. Each test can get a transactional wrapper that rolls back, giving a clean slate for the next test. For API and service dependencies, replace live calls with contract-level mocking by intercepting external API calls at the network level.

Implement a three-tier system. Tier 1: auto-retry — a test that passes on retry is flagged as flaky but does not block merge. Tier 2: a flake dashboard tracking flake frequency per test over time. Tier 3: quarantine — tests that flake above a threshold get moved to a non-blocking suite until root cause is resolved.

The key principle: treat flakiness as a bug with a tracking process, not as background noise to be tolerated. Every flake that gets retried should generate a tracking entry, and tests that flake repeatedly should be quarantined and fixed, not just retried indefinitely.

Done · 66 sec

RVRY Deepthink
RVRY DEEPTHINK · Opus 4.6
CONSTRAINTS6 total · 3 caught, 3 resolved

A hard constraint forced diagnostic triage before prescriptions. Cannot recommend fixes without knowing whether the bottleneck is setup/caching, test execution, or retry overhead.

Timeout inflation, retry tax, and test accumulation compound. Fixing one without the others allows the remaining loops to re-establish the degradation.

Each standard fix creates its own failure mode when applied without structural awareness. Retries hide real bugs. Quarantine without SLA becomes a graveyard. Sharding without isolation audit breaks state-dependent tests.

Pareto distribution means 20% of tests cause 80% of runtime. Converting the easy-to-convert tests doesn’t move the needle. Must be paired with immediate tactical wins.

Fast-gate inflation as teams lobby to add their tests, ownership gaps for async post-merge failures, and need for automated rollback when full gate fails on main.

CI is closer to production conditions. Some ‘flaky’ tests may be catching real race conditions that fast local hardware never surfaces. Verify before suppressing.

The system has three feedback loops that make the problem self-worsening: timeout inflation (CI slow → tests timeout → devs raise timeouts → pipeline slower), retry tax (flaky tests → add retries → retries mask root cause → more flakes accumulate → retry overhead grows), and test accumulation (features ship → E2E tests added → none deleted → suite grows → devs stop running locally → more CI-only failures). Fixing any one cause in isolation may not help because the loops reinforce each other.

Pre-mortem on the standard playbook of shard, cache, retry, quarantine revealed 5 ways it fails: sharding without an isolation audit breaks state-dependent tests; retries without a flake classification pipeline actively hide real bugs — a genuine race condition in an auth flow can be retried away for months before causing a production incident; quarantine without an SLA becomes a graveyard of untested critical paths. A second pre-mortem found that fixing the test pyramid first is too slow as primary intervention — the Pareto distribution means 20% of tests cause 80% of runtime.

The key creative reframe: the question isn’t “how do we make 35 minutes faster?” It’s “why are we running 35 minutes of tests on every PR regardless of what changed?” A two-pipeline architecture with a fast blocking gate under 5 minutes plus an async full gate post-merge, paired with risk-proportional test selection, can make most PRs fast while maintaining full coverage.

The uncomfortable idea: “local passes” might mean local is too forgiving, not that CI is too strict. CI is closer to production conditions — constrained resources, network latency, cold starts. Some flaky tests might be catching real race conditions that deterministic local runs with faster hardware never surface. The framing of CI as the problem to be solved may itself be wrong.

Test Impact Analysis — only running tests affected by the changed code using coverage maps — could cut 35 minutes to 5–10 minutes for most PRs. Google, Meta, and Microsoft all use variants of this. The CI pipeline should function as a triage system, not a gauntlet — scoring each PR by risk and selecting test depth proportionally.

The thinking evolved from ‘optimize the pipeline’ to ‘question the pipeline’s scope.’ Initial mapping revealed three reinforcing feedback loops that mean single-cause fixes won’t stick. The inversion was the real turning point — pre-mortems on the standard playbook showed that each obvious fix creates its own failure mode when applied without structural awareness. Whether the two-pipeline reframe is genuine insight or dressed-up common sense depends on whether the team has already considered it — if they haven’t, this is the highest-leverage idea here.

9 constraints resolved · 4m 48s

RVRY DEEPTHINK · Opus 4.6
CONSTRAINTS6 total · 3 caught, 3 resolved

A hard constraint forced diagnostic triage before prescriptions. Cannot recommend fixes without knowing whether the bottleneck is setup/caching, test execution, or retry overhead.

Timeout inflation, retry tax, and test accumulation compound. Fixing one without the others allows the remaining loops to re-establish the degradation.

Each standard fix creates its own failure mode when applied without structural awareness. Retries hide real bugs. Quarantine without SLA becomes a graveyard. Sharding without isolation audit breaks state-dependent tests.

Pareto distribution means 20% of tests cause 80% of runtime. Converting the easy-to-convert tests doesn’t move the needle. Must be paired with immediate tactical wins.

Fast-gate inflation as teams lobby to add their tests, ownership gaps for async post-merge failures, and need for automated rollback when full gate fails on main.

CI is closer to production conditions. Some ‘flaky’ tests may be catching real race conditions that fast local hardware never surfaces. Verify before suppressing.

The system has three feedback loops that make the problem self-worsening: timeout inflation (CI slow → tests timeout → devs raise timeouts → pipeline slower), retry tax (flaky tests → add retries → retries mask root cause → more flakes accumulate → retry overhead grows), and test accumulation (features ship → E2E tests added → none deleted → suite grows → devs stop running locally → more CI-only failures). Fixing any one cause in isolation may not help because the loops reinforce each other.

Pre-mortem on the standard playbook of shard, cache, retry, quarantine revealed 5 ways it fails: sharding without an isolation audit breaks state-dependent tests; retries without a flake classification pipeline actively hide real bugs — a genuine race condition in an auth flow can be retried away for months before causing a production incident; quarantine without an SLA becomes a graveyard of untested critical paths. A second pre-mortem found that fixing the test pyramid first is too slow as primary intervention — the Pareto distribution means 20% of tests cause 80% of runtime.

The key creative reframe: the question isn’t “how do we make 35 minutes faster?” It’s “why are we running 35 minutes of tests on every PR regardless of what changed?” A two-pipeline architecture with a fast blocking gate under 5 minutes plus an async full gate post-merge, paired with risk-proportional test selection, can make most PRs fast while maintaining full coverage.

The uncomfortable idea: “local passes” might mean local is too forgiving, not that CI is too strict. CI is closer to production conditions — constrained resources, network latency, cold starts. Some flaky tests might be catching real race conditions that deterministic local runs with faster hardware never surface. The framing of CI as the problem to be solved may itself be wrong.

Test Impact Analysis — only running tests affected by the changed code using coverage maps — could cut 35 minutes to 5–10 minutes for most PRs. Google, Meta, and Microsoft all use variants of this. The CI pipeline should function as a triage system, not a gauntlet — scoring each PR by risk and selecting test depth proportionally.

The thinking evolved from ‘optimize the pipeline’ to ‘question the pipeline’s scope.’ Initial mapping revealed three reinforcing feedback loops that mean single-cause fixes won’t stick. The inversion was the real turning point — pre-mortems on the standard playbook showed that each obvious fix creates its own failure mode when applied without structural awareness. Whether the two-pipeline reframe is genuine insight or dressed-up common sense depends on whether the team has already considered it — if they haven’t, this is the highest-leverage idea here.

9 constraints resolved · 4m 48s

Unlock your AI’s deep reasoning potential

Simple installation for Claude Code, Claude Desktop, Codex, Gemini, Anti-Gravity, Cursor, and Windsurf.
Works with any client that supports MCPs.