Cloud - Nocturnalknight's Lair

Why One AWS Spot Still Crashes Sites In 2025?

By Ramkumar Sundarakalatharan | October 20, 2025 | Comments 0 Comment

It started innocently enough. Morning coffee, post-workout calm, a quick “Computer, drop in on my son.”

Instead of his sleepy grin, I got the polite but dreaded:

“There is an error. Please try again later.”
-Alexa (i call it “Computer” as a wannabe Capt of NCC1701E)

Moments later, I realised it wasn’t my internet or device. It was AWS again.

A Familiar Failure in a Familiar Region

If the cloud has a heartbeat, it beats somewhere beneath Northern Virginia.

That is the home of US-EAST-1, Amazon Web Services’ oldest and busiest region, and the digital crossroad through which a large share of the internet’s authentication, routing, and replication flows. It is also the same region that keeps reminding the world that redundancy and resilience are not the same thing.

In December 2022, a cascading power failure at US-EAST-1 set off a chain of interruptions that took down significant parts of the internet, including internal AWS management consoles. Engineers left that incident speaking of stronger isolation and better regional independence.

Three years later, the lesson has returned. The cause may differ, but the pattern feels the same.

The Current Outage

As of this afternoon, AWS continues to battle a widespread disruption in US-EAST-1. The issue began early on 20 October 2025, with elevated error rates across DynamoDB, Route 53, and related control-plane components.

The impact has spread globally.

Snapchat, Ring, and Duolingo have reported downtime.
Lloyds Bank and several UK financial platforms are seeing degraded service.
Even Alexa devices have stopped responding, producing the same polite message: “There is an error. Please try again later.”

For anyone who remembers 2022, it feels uncomfortably familiar. The more digital life concentrates in a handful of hyperscale regions, the more we all share the consequences when one of them fails.

The Pattern Beneath the Problem

Both the 2022 and 2025 US-EAST-1 events reveal the same architectural weakness: control-plane coupling.

Workloads may be distributed across regions, yet many still rely on US-EAST-1 for:

IAM token validation
DynamoDB global tables metadata
Route 53 DNS propagation
S3 replication management

When that single region falters, systems elsewhere cannot authenticate, replicate, or even resolve DNS. The problem is not the hardware; it is that so many systems rely on a single control layer.

What makes today’s event more concerning is how little has changed since the last one. The fragility is known, yet few businesses have redesigned their architectures to reduce the dependency.

How Zerberus Responded to the Lesson

When we began building Zerberus, we decided that no single region or provider should ever be critical to our uptime. That choice was not born from scepticism but from experience in building 2 other platforms that had millions of users across 4 continents.

Our products, Trace-AI, ComplAI™, and ZSBOM, deliver compliance and security automation for organisations that cannot simply wait for the cloud to recover. We chose to design for failure as a permanent condition rather than a rare event.

Inside the Zerberus Architecture

Our production environment operates across five regions: London, Ireland, Frankfurt, Oregon, and Ohio. The setup follows an active-passive pattern with automatic failover.

Two additional warm standby sites receive limited live traffic through Cloudflare load balancers. When one of these approaches a defined load threshold, it scales up and joins the active pool without manual intervention.

Multi-Cloud Distribution

AWS runs the primary compute and SBOM scanning workloads.
Azure carries the secondary inference pipelines and compliance automation modules.
Digital Ocean maintains an independent warm standby, ensuring continuity even if both AWS and Azure suffer regional difficulties.

This diversity is not a marketing exercise. It separates operational risk, contractual dependence, and control-plane exposure across multiple vendors.

Network Edge and Traffic Management

At the edge, Cloudflare provides:

Global DNS resolution and traffic steering
Web application firewalling and DDoS protection
Health-based routing with zero-trust enforcement

By externalising DNS and routing logic from AWS, we avoid the single-plane dependency that is now affecting thousands of services.

Data Sovereignty and Isolation

All client data remains within each client’s own VPC. Zerberus only collects aggregated pass/fail summaries and compliance evidence metadata.

Databases replicate across multiple Availability Zones, and storage is separated by jurisdiction. UK data remains in the UK; EU data remains in the EU. This satisfies regulatory boundaries and limits any failure to its own region.

Observability and Auto-Recovery

Telemetry is centralised in Grafana, while Cloudflare health checks trigger regional routing changes automatically.
If a scanning backend becomes unavailable, queued SBOM analysis tasks shift to a healthy region within seconds.

Even during an event such as the present AWS disruption, Zerberus continues to operate—perhaps with reduced throughput, but never completely offline.

Learning from 2022

The 2022 outage made clear that availability zones do not guarantee availability. The 2025 incident reinforces that message.

At Zerberus, we treat resilience as a practice, not a promise. We simulate network blackouts, DNS failures, and database unavailability. We measure recovery time not in theory but in behaviour. These tests are themselves automated(monitored), because the cost of complacency is always greater than the cost of preparation.

Regulation and Responsibility

Europe’s Cyber Resilience Act and NIS2 Directive are closing the gap between regulatory theory and engineering reality. Resilience is no longer an optional control; it is a legal expectation.

A multi-region, multi-cloud, data-sovereign architecture is now both a technical and regulatory necessity. If a hyperscaler outage can lead to non-compliance, the responsibility lies in design, not in the service-level agreement.

Designing for the Next Outage

US-EAST-1 will recover; it always does. The question is how many services will redesign themselves before the next event.

Every builder now faces a decision: continue to optimise for convenience or begin engineering for continuity.

The 2022 failure served as a warning. The 2025 outage confirms the lesson. By the next one, any excuse will sound outdated.

Final Thoughts

The cloud remains one of the greatest enablers of our age, but its weaknesses are equally shared. Each outage offers another chance to refine, distribute, and fortify what we build.

At Zerberus, we accept that the cloud will falter from time to time. Our task is to ensure that our systems, and those of our clients, do not falter with it.

🟩 Author: Ramkumar Sundarakalatharan
Founder & Chief Architect, Zerberus Technologies Ltd

(This article reflects an ongoing incident. For live updates, refer to the AWS Status Page and technology news outlets such as BBC Tech and The Independent.)

References:

https://www.bbc.co.uk/news/live/c5y8k7k6v1rt

https://www.independent.co.uk/tech/aws-amazon-internet-outage-latest-updates-b2848345.html

https://www.dailystar.co.uk/news/world-news/amazon-breaks-silence-outage-reason-36096705

Why Startups Need To Architect Cloud Agnostic Products

By Ramkumar Sundarakalatharan | November 26, 2023 | Comments 0 Comment

Nobody plans to leave AWS in the startup world, but as they say, “sh** happens.”

As engineers, when we write software, we’re taught to keep it elegant by never depending directly on external systems. We write wrappers for external resources, we encapsulate data and behaviour and standardise functions with libraries.

But, When it comes to the cloud… “eerie silence”

Companies have died because they needed to move off AWS or GCP but couldn’t do it in a reasonable and cost-effective timeline.

We (at Itilite) had a close call with GCP, which served as our brush with the fire. Google had arguably one of the best Distance Matrix capabilities out there. It was used in one of our core logic and ML models. And on one fine Monday afternoon, I have to set up a meeting with my CEO to communicate that we will have to spend ~250% more on our cloud service bill in about 60 days.

Actually, google increased the pricing by 1400% and gave 60 days to rewrite, migrate, move out or perish!

The closest competitor in terms of capability was DistanceMatrix and a reliable “Large” player was Bing. But, both left a lot for in the “Accuracy”. So, for us, the business decision was simple: make the entire product work in “Reduced Functionality” mode for all or start differential pricing for better accuracy! In either case, those APIs must be rewritten with a new adaptor.

It is not an enigma why we do this. It’s simple: there are no alternatives, there is no time to GTM, But maybe there is. I’ll explain why you should take cloud-agnostic architecture seriously and then show you what I do to keep my projects cloud-agnostic.

Cloud Service Rationalisation

The prime reason you should consider the ability to switch clouds and cloud services is so you can choose to use the cloud service that is price and performance-optimized for your use case.

When I first got into serverless, we wrote a transformative API on Oracle Cloud (Bcoz we were part of their Accelerator Program and had a huge credit.) but it fed part of the data that the customer-facing API relied on.

No prize for guessing what happened?

It was a horrible mistake. Our API had an insane latency problem. Cold start requests added additional latency of at least 2 seconds per request. The AWS team has worked hard to build a service that can do things that GCP’s Cloud Functions simply can’t, specifically around cold starts and latency.

I had to move my infrastructure to a different service and a revised network topology.

Guess we would have learned the problem by now, but as we will find out, we did not.

This time it was a combination of Kafka and the AWS Lambda that created an issue. We had relied on Confluent’s connectors for much of the workload interfaces and had to shell out almost $1000 per month per connector!

Avoiding the Cloud Provider Killswitch

Protect Your Business from Unexpected Termination

As a CXO, you may not be aware that cloud providers like AWS, GCP, and Azure reserve the right to terminate your account and destroy your infrastructure at any time, effectively shutting down your business operations. While this may seem like an extreme measure, it’s important to understand that cloud providers have strict terms of service that can lead to account termination for a variety of reasons, even if you’re not engaged in illegal or harmful activities.

A Chilling Example

I recently spoke with a friend who is the founder of a fintech platform. He shared a chilling incident that highlights the risks of relying on cloud providers. His team was using GCP’s Cloud Run, a container service, to host their API. They had a unique use case that required them to call back to their own API to trigger additional work and keep the service active. Unfortunately, GCP monitors this type of behaviour and flags it as potential crypto-mining activity.

On an ordinary Sunday, their infrastructure vanished, and their account was locked. It took them six days of nonstop effort to migrate to AWS.

Protect Your Business

This incident serves as a stark reminder that any business operating on cloud infrastructure is vulnerable to unexpected termination. While you may not be intentionally engaging in activities that violate cloud provider terms of service, it’s crucial to build your infrastructure with the possibility of termination in mind.

Here are some key steps you can take to protect your business from the cloud provider killswitch:

Read and understand the terms of service for each cloud provider you use.
Choose a cloud provider that aligns with your industry and business model.
Avoid relying on a single cloud provider.
Have a backup plan in place.
Regularly review your cloud usage and ensure compliance with cloud provider terms of service.

By taking these proactive measures, you can significantly reduce the risk of your business being disrupted by cloud provider termination and ensure the continuity of your operations.

Unleash the Power of Free Cloud Credits

For early-stage startups operating on a shoestring budget, free cloud credits can be a lifeline, shielding your runway from the scorching heat of cloud infrastructure costs. Acquiring these credits is a breeze, but the way most startups build their infrastructure – akin to an unbreakable blood oath with their cloud provider – restricts them to the credits granted by that single provider.

Why limit yourself to the generosity of one cloud provider when you could seamlessly switch between them to optimize your resource allocation? Imagine the possibilities:

AWS to GCP: Upon depleting your AWS credits, you could effortlessly migrate your infrastructure to GCP, taking advantage of their generous $200,000 credit offer.
Y Combinator: As a Y Combinator startup, you’re entitled to a staggering $150,000 in AWS credits and a mind-boggling $200,000 on GCP.
AI-Powered Startups: If you’re developing AI solutions, Azure welcomes you with open arms, offering $300,000 in free credits to fuel your AI models on their cloud.

By embracing cloud-agnostic architecture, you unlock the freedom to switch between cloud providers, potentially saving you a significant $200,000 upfront. Why constrain yourself to a single cloud provider when cloud-agnosticism empowers you to navigate the cloud landscape with flexibility and cost-efficiency?

Building Resilience: The Importance of Cloud Redundancy

In the ever-evolving world of technology, no system is immune to failure. Even industry giants like Silicon Valley Bank can outright disappear over a weekend or AWS’ main Datacenter can go offline due to a power fluctuation, highlighting the importance of proactively safeguarding your business operations.

Imagine the potential financial impact of a 12-hour outage on AWS for your company. The costs could be staggering, not only in lost revenue but also in reputational damage and customer dissatisfaction or even potential churn.

This is where cloud redundancy comes into play. By running parallel segments of your platform on multiple cloud providers, such as AWS and GCP, you’re essentially creating a fail-safe mechanism.

In the event of an outage on one cloud platform, the other can seamlessly pick up the slack, ensuring uninterrupted service for your customers and minimizing the impact on your business. Cloud redundancy is not just about disaster preparedness; it’s also about optimizing performance and scalability. By distributing your workload across multiple cloud providers, you can tap into the unique strengths and resources of each platform, maximizing efficiency and responsiveness.

In our case, we run the OCR packages, SAML, and Accounts service on Azure, our core “Recommendation engine” and “Booking Engine” on AWS. Yes, having a multi-cloud will involve initial costs that might be prohibitive, but in the long run, the benefits will far outweigh the costs.

Cloud Cost Negotiation: A Matter of Leverage

In the realm of business negotiations, the ultimate power lies in the ability to walk away. If the other party senses your lack of alternatives, they gain a significant advantage, effectively holding you hostage. Cloud cost negotiations are no exception.

Imagine you’ve built a substantial $10 million infrastructure on AWS, heavily reliant on their proprietary APIs like S3, Cognito, and SQS. In such a scenario, walking away from AWS becomes an unrealistic option. You’re essentially at their mercy, accepting whatever cloud costs they dictate.

While negotiating cloud costs may seem insignificant to a small company, for an organization with $10 million of AWS infrastructure, even a 3% discount translates into substantial savings.

To gain leverage in cloud cost negotiations, you need to establish a credible threat of walking away. This requires careful planning and strategic implementation of cloud-agnostic architecture, enabling you to seamlessly switch between cloud providers without disrupting your operations.

Cloud Agnosticism: Your Negotiating Edge

Cloud-agnostic architecture empowers you to:

Diversify your infrastructure: Run your applications on multiple cloud platforms, reducing reliance on a single provider.
Reduce switching costs: Design your infrastructure to minimize the effort and cost of migrating to a new cloud provider.
Strengthen your negotiating position: Demonstrate to cloud providers that you have alternative options, giving you more bargaining power.

By embracing cloud-agnosticism, you transform from a captive customer to a savvy negotiator, capable of securing favorable cloud cost terms.

Unforeseen Challenges: The Importance of Cloud Agnosticism

In the dynamic world of business, unforeseen challenges (and opportunities) can arise at any moment. We often operate with limited visibility, unable to predict every possible scenario that could impact our success. Here’s an actual scenario that highlights the importance of cloud-agnostic architecture:

Acquisition Deal Goes Through

This happened with One of my previous organisations, we tirelessly built this company from the ground up. Our hard work and dedication paid off when a large SaaS Unicorn approached us with an acquisition proposal.

However, during the due diligence, a critical issue emerged: Our company’s infrastructure was entirely reliant on AWS. The Acquiring company had a multi-year multi-million dollar deal with Azure and the M&A team made it clear that unless our platform can operate on Azure, the deal is off the table!

Our team faced the daunting task of migrating the entire infrastructure to Azure within a limited timeframe and budget. Unfortunately, the complexities of the migration proved time-consuming and the merger took 5 months to complete and the offer was reduced by $2 million!

The Power of Cloud Agnosticism

This story serves as a stark reminder of the risks associated with a single-cloud strategy. Had our company embraced cloud-agnostic architecture, we would have possessed the flexibility to seamlessly switch between cloud providers, potentially leading to a bigger exit for all of us!

Cloud-agnostic architecture offers several benefits:

Reduced Vendor Lock-in: Avoids dependence on a single cloud provider, empowering you to switch to more favourable options based on your needs.
Improved Negotiation Power: Gains leverage in cloud cost negotiations by demonstrating the ability to switch providers.
Increased Resilience: Protects your business from disruptions caused by cloud provider outages or policy changes.
Enhanced Scalability: Enables seamless expansion of your infrastructure across multiple cloud platforms as your business grows.

Embrace Cloud Agnosticism for Business Continuity

In today’s ever-changing technological landscape, cloud-agnostic architecture is not just a benefit; it’s a necessity for businesses seeking long-term success and resilience. By adopting a cloud-agnostic approach, you empower your company to navigate the complexities of the cloud landscape with agility, adaptability, and cost-efficiency, ensuring that unforeseen challenges don’t derail your journey.

My Solution

Here’s what I do about it, now after the lessons learnt. I use Multy. Multy is an open-source tool that simplifies cloud infrastructure management by providing a cloud-agnostic API. This means that developers can define their infrastructure configurations once and deploy them to any cloud provider without having to worry about the specific syntax or nuances of each cloud platform. While Multy provides an abstraction layer for deploying cross-cloud environments, you will also need to incorporate cloud-environment agnostic libraries to really make a difference.