Needing a Hero Is a Warning Sign

Needing a hero in your organization is a warning sign, unless you’re Disney or Marvel of course. However, at any company, it’s a red flag, because it means there are layers of failures, many of them silent and undetected for a long time.

I say this while actually being that hero recently at Sidecar Health. We had a legacy ETL pipeline that had been silently failing for 6 months and no one knew. A vendor drops provider data into our SFTP, and we are supposed to ingest it, validate it, and update internal models. This data is customer-facing and valued by our executives, which makes it even harder to reckon with how long it had gone unnoticed. Now I finally know the answer to the age-old question: when a tree falls in the forest and no one is around to hear it, it does not make a noise.

When I dug in, I found untyped Python, zero tests, no logs, no monitoring, no idempotency, and no admin tooling. A prototype that was never properly finished.

My team had already built a platform service to handle general ETLs and event-driven pipelines, with event-driven fan-out, full observability, admin APIs, and idempotent retry logic built in. Once I understood the original issue, absorbing it was straightforward. I migrated it into my team’s domain in under a day, using Claude to plan the approach and move fast. Because I knew the platform inside out, I knew exactly what to ask.

The new pipeline processes the entire file in 30 seconds, providing full visibility into how many rows were processed, which passed validation, and which failed and why. Runs are fully idempotent and can be safely re-triggered at any time. The speed of the fix was not a product of heroics, but of having invested in the right foundation ahead of time.

The more important conversation, though, is how we got here. This is not about blame, but about responsibilities across the organization so that we don’t need heroes in the first place.

The tech lead should have a clear picture of every system under their purview, knowing not just what was built, but the quality of how it was built. Gaps in testing, observability, and operational tooling should be caught at design or launch, and if deadlines force shortcuts, those gaps should be tracked and scheduled, not forgotten.

Product should treat feature health as an ongoing responsibility, not a launch-day checkbox. Exec-valued features need a lightweight but consistent review cadence. If there is no process to verify something is working after it ships, you are trusting silence to mean success.

Engineering managers should know their full architecture, including the edges. What systems exist, who owns them, and critically, what is unowned or under-resourced. Unowned systems are where silent failures live. If a critical process does not have a clear home and an accountable team, that is a risk that needs to be surfaced and resolved before it becomes an incident.

Leadership should be building an organization where there is no such thing as an unreviewed corner. That means creating space in the roadmap for KTLO, incentivizing engineers to raise concerns rather than work around them, and treating architecture audits as a normal part of how the org operates, not a reaction to something going wrong.

Engineers should not need to be told to add logs, tests, and operational tooling. Those are not nice-to-haves, they are the baseline. If that is not the shared standard on your team, it needs to become one.

Silent failures do not announce themselves; they are stumbled upon, and heroes get called in to quickly resolve them.

The best organizations are not the ones with the best heroes, they are the ones that never need them. The only defense against silent failures is a culture where engineers raise concerns early, product makes space for them, and doing things correctly is the default, not the exception.

That culture doesn’t build itself, but that’s a conversation for another post.