What Caused Cloudflare’s Big Crash? It’s Not Rust

By Ramkumar Sundarakalatharan | November 20, 2025 | Comments 0 Comment

The Promise

Cloudflare’s outage did not just take down a fifth of the Internet. It exposed a truth we often avoid in engineering: complex systems rarely fail because of bad code. They fail because of the invisible assumptions we build into them.

This piece cuts past the memes, the Rust blame game and the instant hot takes to explain what actually broke, why the outrage misfired and what this incident really tells us about the fragility of Internet-scale systems.

If you are building distributed, AI-driven or mission-critical platforms, the key takeaways here will reset how you think about reliability and help you avoid walking away with exactly the wrong lesson from one of the year’s most revealing outages.

1. Setting the Stage: When a Fifth of the Internet Slowed to a Crawl

On 18 November, Cloudflare experienced one of its most significant incidents in recent years. Large parts of the world observed outages or degraded performance across services that underpin global traffic.
As always, the Internet reacted the way it knows best: outrage, memes, instant diagnosis delivered with absolute confidence.

Within minutes, social timelines flooded with:

“It must be DNS”
“Rust is unsafe after all”
“This is what happens when you rewrite everything”
“Even Downdetector is down because Cloudflare is down”
Screenshots of broken CSS on Cloudflare’s own status page
Accusations of over-engineering, under-engineering and everything in between

The world wanted a villain. Rust happened to be available. But the actual story is more nuanced and far more interesting. (For the record, I am still not convinced we should rewrite Linux kernel in Rust !)

2. What Actually Happened: A Clear Summary of Cloudflare’s Report

Cloudflare’s own post-incident write-up is unusually thorough. If you have not read it, you should. In brief:

Cloudflare is in the middle of a major multi-year upgrade of its edge infrastructure, referred to internally as the 20 percent Internet upgrade.
The rollout included a new feature configuration file.
This file contained more than two hundred features for their FL2 component, crossing a size limit that had been assumed but never enforced through guardrails.
The oversized file triggered a panic in the Rust-based logic that validated these configurations.
That panic initiated a restart loop across a large portion of their global fleet.
Because the very nodes that needed to perform a rollback were themselves in a degraded state, Cloudflare could not recover the control plane easily.
This created a cascading, self-reinforcing failure.
Only isolated regions with lagged deployments remained unaffected.

The root cause was a logic-path issue interacting with operational constraints. It had nothing to do with memory safety and nothing to do with Rust’s guarantees.

In other words: the failure was architectural, not linguistic.

3.2 The “unwrap() Is Evil” Argument (I remember writing a blog titled Eval() is not Evil() ~2012)

One of the most widely circulated tweets framed the presence of an unwrap() as a ticking time bomb, casting it as proof that Rust developers “trust themselves too much”. This is a caricature of the real issue.

The error did not arise because of an unwrap(), nor because Rust encourages poor error handling. It arose because:

an unexpected input crossed a limit,
guards were missing,
and the resulting failure propagated in a tightly coupled system.

The same failure would have occurred in Go, Java, C++, Zig, or Python.

3.3 Transparency Misinterpreted as Guilt

Cloudflare did something rare in our industry.
They published the exact code that failed. This was interpreted by some as:

“Here is the guilty line. Rust did it.”

In reality, Cloudflare’s openness is an example of mature engineering culture. More on that later.

4. The Internet Rage Cycle: Humour, Oversimplification and Absolute Certainty

The memes and tweets around this outage are not just entertainment. They reveal how the broader industry processes complex failure.

4.1 The ‘Everything Balances on Open Source’ Meme

Images circulated showing stacks of infrastructure teetering on boxes labelled DNS, Linux Foundation and unpaid open source developers, with Big Tech perched precariously on top.

This exaggeration contains a real truth. We live in a dependency monoculture. A few layers of open source and a handful of service providers hold up everything else.

The meme became shorthand for Internet fragility.

4.2 The ‘It Was DNS’ Routine

The classic:
“It is not DNS. It cannot be DNS. It was DNS.”

Except this time, it was not DNS.

Yet the joke resurfaces because DNS has become the folk villain for any outage. People default to the easiest mental shortcut.

4.3 The Rust Panic Narrative

Tweets claiming:

“Cloudflare rewrote in Rust, and half the Internet went down 53 days later.”

This inference is wrong, but emotionally satisfying.
People conflate correlation with causation because it creates a simple story: rewrites are dangerous.

4.4 The Irony of Downdetector Being Down

The screenshot of Downdetector depending on Cloudflare and therefore failing is both funny and revealing. This outage demonstrated how deeply intertwined modern platforms are. It is an ecosystem issue, not a Cloudflare issue.

4.5 But There Were Also Good Takes

Kelly Sommers’ observation that Cloudflare published source code is a reminder that not everyone jumped to outrage.

There were pockets of maturity. Unfortunately, they were quieter than the noise.

5. The Real Lessons for Engineering Leaders

This is the part worth reading slowly if you build distributed systems.

Lesson 1: Reliability Is an Architecture Choice, Not a Language Choice

You can build fragile systems in safe languages and robust systems in unsafe languages. Language is orthogonal to architectural resilience.

Lesson 2: Guardrails Matter More Than Guarantees

Rust gives memory safety.
It does not give correctness safety.
It does not give assumption safety.
It does not give rollout safety.

You cannot outsource judgment.

Lesson 3: Blast Radius Containment Is Everything

Uniform rollouts are dangerous.
Synchronous edge updates are dangerous.
Large global fleets need layered fault domains.

Cloudflare knows this. This incident will accelerate their work here.

Lesson 4: Control Planes Must Be Resilient Under Their Worst Conditions

The control plane was unreachable when it was needed most. This is a classic distributed systems trap: the emergency mechanism relies on the unhealthy components.

Always test:

rollback unavailability
degraded network conditions
inconsistent state recovery

Lesson 5: Complexity Fails in Complex Ways

The system behaved exactly as designed. That is the problem.
Emergent behaviour in large networks cannot be reasoned about purely through local correctness.

This is where most teams misjudge their risk.

6. Additional Lesson: Accountability and Transparency Are Strategic Advantages

This incident highlighted something deeper about Cloudflare’s culture.

They did not hide behind ambiguity.
They did not release a PR-approved statement with vague phrasing.

They published:

the timeline
the diagnosis
the exact code
the root cause
the systemic contributors
the ongoing mitigation plan

This level of transparency is uncomfortable. It puts the organisation under a microscope.
Yet it builds trust in a way no marketing claim can.

Transparency after failure is not just ethical. It is good engineering. Very few people highlighted including my man Gergely Orosz.

Most companies will never reach this level of accountability.
Cloudflare raised the bar.

7. What This Outage Tells Us About the State of the Internet

This was not a Cloudflare problem, This is a reminder of our shared dependency.

Too much global traffic flows through too few choke points.
Too many systems assume perfect availability from upstream.
Too many platforms synchronise their rollouts.
Too many companies run on infrastructure they did not build and cannot control.

The memes were not wrong.
They were simply incomplete.

8. Final Thoughts: Rust Did Not Fail. Our Assumptions Did.

Outages like this shape the future of engineering. The worst thing the industry can do is learn the wrong lesson.

This was not:

a Rust failure
a rewrite failure
an open source failure
a Cloudflare hubris story

This was a systems-thinking failure.
A reminder that assumptions are the most fragile part of any distributed system.
A demonstration of how tightly coupled global infrastructure has become.
A case study in why architecture always wins over language debates.

Cloudflare’s transparency deserves respect.
Their engineering culture deserves attention.
And the outrage cycle deserves better scepticism.

Because the Internet did not go down because of Rust.
It went down because the modern Internet is held together by coordination, trust, and layered assumptions that occasionally collide in surprising ways.

If we want a more resilient future, we need less blame and more understanding.
Less certainty and more curiosity.
Less language tribalism and more systems design thinking.

The Internet will fail again.
The question is whether we learn or react.

Cloudflare learned. The rest of us should too!

What Is the Truth About the Viral MIT Study Claiming 95% of AI Deployments Are a Failure – A Critical Analysis

By Ramkumar Sundarakalatharan | November 2, 2025 | Comments 0 Comment

Introduction: The Statistic That Shook the AI World

When headlines screamed that 95% of AI projects fail, the internet erupted. Boards panicked, investors questioned their bets, and LinkedIn filled with hot takes. The claim, sourced from an MIT NANDA report, became a viral talking point across Fortune, Axios, Forbes, and Tom’s Hardware (some of my regular reads), as well as more than ten Substack newsletters that landed in my inbox, dissecting the story. But how much truth lies behind this statistic? Is the AI revolution really built on a 95% failure rate, or is the reality far more nuanced?

This article takes a grounded look at what MIT actually studied, where the media diverged from the facts, and what business leaders can learn to transform AI pilots into measurable success stories.

The Viral 95% Claim: Context and Origins

What the MIT Study Really Said

The report, released in mid‑2025, examined over 300 enterprise Generative AI pilot projects and found that only 5% achieved measurable profit and loss impact. Its focus was narrow, centred on enterprise Generative AI (GenAI) deployments that automated content creation, analytics, or decision‑making processes.

Narrow Scope, Broad Misinterpretation

The viral figure sounds alarming, yet it represents a limited dataset. The study assessed short‑term pilots and defined success purely in financial terms within twelve months. Many publications mistakenly generalised this to mean all AI initiatives fail, converting a specific cautionary finding into a sweeping headline.

Methodology and Media Amplification

What the Report Actually Measured

The MIT researchers used surveys, interviews, and case studies across industries. Their central finding was simple: technical success rarely equals business success. Many AI pilots met functional requirements but failed to integrate into core operations. Common causes included:

Poor integration with existing systems
Lack of process redesign and staff engagement
Weak governance and measurement frameworks

The few successes focused on back‑office automation, supply chain optimisation, and compliance efficiency rather than high‑visibility customer applications.

How the Media Oversimplified It

Writers such as Forbes contributor Andrea Hill and the Marketing AI Institute noted that the report defined “failure” narrowly, linking it to short‑term financial metrics. Yet outlets like Axios and TechRadar amplified the “95% fail” headline without context, feeding a viral narrative that misrepresented the nuance of the original findings.

Case Study 1: Retail Personalisation Gone Wrong

One of the case studies cited in Fortune and Tom’s Hardware involved a retail conglomerate that launched a GenAI‑driven personalisation engine across its e‑commerce sites. The goal was to revolutionise product recommendations using behavioural data and generative content. After six months, however, the project was halted due to three key issues:

Data Fragmentation: Customer information was inconsistent across regions and product categories.
Governance Oversight: The model generated content that breached brand guidelines and attracted regulatory scrutiny under UK GDPR.
Cultural Resistance: Marketing teams were sceptical of AI‑generated messaging and lacked confidence in its transparency.

MIT categorised the case as a “technical success but organisational failure”. The system worked, yet the surrounding structure did not evolve to support it. This demonstrates a classic case of AI readiness mismatch: advanced technology operating within an unprepared organisation.

Case Study 2: Financial Services Success through Governance

Conversely, the report highlighted a financial services firm that used GenAI to summarise compliance reports and automate elements of regulatory submissions. Initially regarded as a modest internal trial, it delivered measurable impact:

45% reduction in report generation time
30% fewer manual review errors
Faster auditor sign‑off and improved compliance accuracy

Unlike the retail example, this organisation approached AI as a governed augmentation tool, not a replacement for human judgement. They embedded explainability and traceability from the outset, incorporating human checkpoints at every stage. The project became one of the few examples of steady, measurable ROI—demonstrating that AI governance and cultural alignment are decisive success factors.

What the Press Got Right (and What It Missed)

Where the Media Was Accurate

Despite the sensational tone, several publications identified genuine lessons:

Integration and readiness are greater obstacles than algorithms.
Governance and process change remain undervalued.
KPIs for AI success are often poorly defined or absent.

What They Overlooked

Most commentary ignored the longer adoption cycle required for enterprise transformation:

Time Horizons: Real returns often appear after eighteen to twenty‑four months.
Hidden Gains: Productivity, compliance, and efficiency improvements often remain off the books.
Sector Differences: Regulated industries must balance caution with innovation.

The Real Takeaway: Why Most AI Pilots Struggle

It Is Not the AI, It Is the Organisation

The high failure rate underscores organisational weakness rather than technical flaws. Common pitfalls include:

Hype‑driven use cases disconnected from business outcomes
Weak change management and poor adoption
Fragmented data pipelines and ownership
Undefined accountability for AI outputs

The Governance Advantage

Success correlates directly with AI governance maturity. Frameworks such as ISO 42001, NIST AI RMF, and the EU AI Act provide the structure and accountability that bridge the gap between experimentation and operational success. Firms adopting these frameworks experience faster scaling and more reliable ROI.

Turning Pilots into Profit: The Enterprise Playbook

Start with Measurable Impact: Target compliance automation or internal processes where results are visible.
Design for Integration Early: Align AI outputs with established data and workflow systems.
Balance Build and Buy: Work with trusted partners for scalability while retaining data control.
Define Success Before You Deploy: Create clear metrics for success, review cycles, and responsible ownership.
Govern from Day One: Build explainability, ethics, and traceability into every layer of the system.

What This Means for Boards and Investors

The greatest risk for boards is not AI failure but premature scaling. Decision‑makers should:

Insist on readiness assessments before funding.
Link AI investments to ROI‑based performance indicators.
Recognise that AI maturity reflects governance maturity.

By treating governance as the foundation for value creation, organisations can avoid the pitfalls that led to the 95% failure headline.

Conclusion: Separating Signal from Noise

The viral MIT “95% failure” claim is only partly accurate. It exposes an uncomfortable truth: AI does not fail, organisations do. The underlying issue is not faulty algorithms but weak governance, unclear measurement, and over‑inflated expectations.

True AI success emerges when technology, people, and governance work together under measurable outcomes. Those who build responsibly and focus on integration, ethics, and transparency will ultimately rewrite the 95% narrative.

References and Further Reading

Axios. (2025, August 21). AI on Wall Street: Big Tech’s Reality Check.
Tom’s Hardware. (2025, August 21). 95 Percent of Generative AI Implementations in Enterprise Have No Measurable Impact on P&L.
Fortune. (2025, August 21). Why 95% of AI Pilots Fail and What This Means for Business Leaders.
Marketing AI Institute. (2025, August 22). Why the MIT Study on AI Pilots Should Be Read with Caution.
Forbes – Andrea Hill. (2025, August 21). Why 95% of AI Pilots Fail and What Business Leaders Should Do Instead.
TechRadar. (2025, August 21). American Companies Have Invested Billions in AI Initiatives but Have Basically Nothing to Show for It.
Investors.com. (2025, August 22). Why the MIT Study on Enterprise AI Is Pressuring AI Stocks.