Table of Contents

Reach SOC 2 Compliance in 6 Weeks or Less.

  / ,

  / When the Cloud Goes Dark: Regional Outages and What They Mean for SOC 2 and ISO 27001 Compliance

When the Cloud Goes Dark: Regional Outages and What They Mean for SOC 2 and ISO 27001 Compliance

In March 2026, a regional conflict in the Middle East did something that stress tests and tabletop exercises rarely manage to do: it took down cloud infrastructure across multiple availability zones at the same time, in the same region, without warning.

AWS data centers in the UAE and Bahrain were impacted. Banking apps went offline. Payments failed. Delivery platforms stopped. And a significant portion of the affected organizations had done everything “right” by conventional standards — multi-AZ deployments, redundancy within the region, documented continuity plans.

It wasn’t enough.

This article breaks down what happened, what it revealed about how most organizations think about availability, and what a more resilient architecture actually looks like. If your systems run on cloud infrastructure — in any region — this case is worth understanding closely.

What Happened: The March 2026 Incident

Regional conflict in the Middle East caused physical and infrastructural disruption to AWS facilities across the UAE and Bahrain. Based on publicly reported information, the incident involved power outages affecting data center operations, physical damage to infrastructure facilities, connectivity loss across affected environments, and service degradation spanning multiple availability zones within the same region — simultaneously.

That last point is the one that matters most. AWS designs its availability zones to be isolated from one another — separate power, cooling, and networking — so that a failure in one zone doesn’t cascade into another. Under normal failure conditions, that isolation holds. But this wasn’t a normal failure condition. It was a regional-scale disruption. The “rooms” were fine. The “building” was the problem.

“Availability zones are designed to handle localized failures, not regional ones. This incident sits firmly in the second category.”

The result was that organizations with multi-AZ architectures — which many rightly considered robust — still went down. There was no in-region fallback left to use.

Business Impact: What Actually Went Offline

The impact was not subtle. Banking platforms experienced downtime that prevented customers from accessing accounts or completing transactions. Payment processors were unable to process transactions. Mobility and delivery platforms halted operations entirely. Customer-facing applications became unavailable across the board.

This wasn’t degraded performance or slower load times. It was a full loss of availability for any system that lived entirely within the affected region. The AWS Well-Architected Framework acknowledges that regional failures, while rare, are a defined risk category — and designing for them requires a fundamentally different approach than designing for AZ failures.

Organizations with multi-region architectures kept operating. Everything else stopped. That single architectural decision — single-region versus multi-region — was the difference between availability and a complete outage.

What Risks Actually Materialised

This incident didn’t create new risks. It exposed ones that were already there, quietly embedded in architectural choices and compliance assumptions that had never been stress-tested at this scale.

Regional Single Point of Failure

The most common pattern among affected organizations: applications, databases, and backups all deployed within a single region. When that region became unavailable, there was no secondary environment to take over. No warm standby, no traffic rerouting, no automated failover. Just downtime.

This is the architectural equivalent of backing up your data to a drive sitting next to your laptop. It works until it doesn’t.

The Limits of Availability Zone Redundancy

Availability zones are a powerful tool — but they’re a tool designed for a specific class of failure, and understanding that class matters. Think of an availability zone as a separate floor in a building. If one floor has a problem, you move to another floor. But if the entire building loses power — or becomes inaccessible — floor redundancy doesn’t help. You needed another building entirely. That’s what a region is. And this incident took down the building.

Pro tip: When mapping your architecture against a business continuity plan, explicitly define your regional failure scenario. “What happens if this entire region becomes inaccessible for 24 hours?” is a question that exposes gaps that AZ-level planning will never catch.

Infrastructure-Level Disruption Is Not Solvable at the Application Layer

Power outages. Connectivity loss. Physical damage. These are not conditions that clever application architecture can work around if your infrastructure is entirely contained within the affected geography. No amount of microservices design, caching strategy, or auto-scaling helps when there’s no power reaching the data center.

This is an important framing shift for engineering teams who own availability: some failure modes require infrastructure-layer responses, not code-layer ones.

The Compliance Gap: Controls on Paper vs. Controls in Practice

Perhaps the most uncomfortable implication of this incident. In many environments — particularly those undergoing ISO/IEC 27001:2022 certification or SOC 2 audits — availability controls are documented but don’t reflect the actual system architecture. Redundancy is listed as a control. It’s just redundancy within a single region, which, as this event demonstrated, is insufficient for regional-scale disruptions. The control passes an audit. It fails a real incident.

This is the exact gap that compliance frameworks are designed to close — and that audit processes sometimes fail to catch.

Reach SOC 2 Compliance in 6 Weeks or Less

Schedule Your Free SOC 2 Assessment Today

Cloud Hosting and SOC 2 Compliance Requirements

Choosing AWS or Azure doesn’t hand you a SOC 2 compliance. It hands you a shared responsibility model, which means your provider secures the physical infrastructure and you secure everything running on top of it — including whether your architecture can actually deliver on your availability commitments.

Auditors know this distinction well. When they evaluate your Availability criteria, they’re looking at your controls, not your provider’s SOC 2 report.

What that means in practice: your recovery objectives need to be real numbers tied to a real architecture, not placeholders in a policy document. Your failover plan needs test records behind it. And your cloud provider should appear in your vendor risk register with an annual review of their own audit reports.

A single-region deployment with no tested failover isn’t compliant in any meaningful sense. It’s a documentation exercise waiting to be disproved.

The March 2026 incident made this concrete. Organizations that had documented availability controls but confined their entire infrastructure to one region found those controls counted for nothing when the region went down. The control passed the audit. It failed the incident.

That gap is exactly what a SOC 2 audit is supposed to catch. Sometimes it doesn’t. 

What Mitigating Controls Could Have Reduced the Impact

The following aren’t theoretical best practices. They’re the specific capabilities that separated organizations that stayed online from those that didn’t.

Multi-region deployment is the foundational requirement. Deploying systems across independent geographic regions — not just independent availability zones — means a regional disruption in one location doesn’t take everything down. Google Cloud’s documentation on multi-region architectures provides useful reference material on how this is structured in practice.

Cross-region data replication ensures that when failover happens, the secondary region has current data to work with. Replication lag is a design variable — it can be tuned based on acceptable recovery point objectives. What can’t be tuned is the existence of the replication relationship itself. If it isn’t there before the incident, it can’t help during one.

Automated failover removes the human response time variable from the equation. If traffic rerouting to a secondary region requires manual intervention, you are adding minutes or hours to your outage window during the exact moment when your team is most overwhelmed. Route 53 failover routing, Azure Traffic Manager, and equivalent tools in other clouds exist specifically for this scenario.

Regional outage testing is the practice that most organizations skip. Simulating a full regional failure — not just a single AZ — validates whether recovery strategies actually work, not just whether they exist. The NIST SP 800-34 guide on contingency planning recommends testing at the scenario level, not just the control level.

Dependency resilience is the one that catches teams off guard. If your identity provider, monitoring stack, or secrets management system lives in the same region as your primary workload, your failover may not actually work — because the systems your application depends on to function are also offline.

Insider note: A common failure in multi-region DR testing is discovering that the authentication service doesn’t fail over cleanly, even when the application does. Audit your dependency chain before you test — not during.

Compliance Perspective: What the Frameworks Actually Require

This incident maps cleanly onto requirements that many organizations are already accountable for.

ISO/IEC 27001:2022 addresses this directly across several controls. A.8.14 covers redundancy of information processing facilities — and the intent is effective redundancy, not documented redundancy. A.8.13 covers backup, with an expectation that backup data is accessible when primary systems are not. A.5.30 addresses ICT readiness for business continuity, which includes planning for scenarios beyond localized failure. The standard is explicit that controls must be implemented in a way that is proportionate to the risk — and a single-region deployment for a mission-critical application is a risk the standard expects to be addressed.

Unsure whether your current architecture actually satisfies these controls? An ISO 27001 gap analysis is usually the fastest way to find out, and an internal audit against your documented controls will surface the delta between what’s on paper and what’s in production.

SOC 2 Availability Criteria requires that systems are available in line with commitments and expectations. If your service-level commitments assume high availability, and your architecture cannot deliver that when a region goes offline, you have a gap between your commitments and your design. The AICPA’s Trust Services Criteria are clear on this point: availability controls must reflect real-world capability, not aspirational architecture.

The common thread across both frameworks: compliance asks whether controls are effective, not just whether they’re present. This incident is a clear case study in what ineffective-but-documented redundancy looks like under real conditions.

What This Means for Your Organization

The March 2026 incident is not a cautionary tale about a distant edge case. It’s a practical reference point for evaluating your own architecture — right now, before you need it.

The questions worth asking are direct ones. Can your systems operate if an entire cloud region becomes unavailable — not for five minutes, but for hours? Does your failover extend beyond a single region, or does it just move traffic between availability zones? Have you tested a full regional failure scenario, or only component-level failures? Do your compliance controls reflect actual system architecture, or how it was originally designed two years ago?

If the answer to any of those is uncertain, that uncertainty is the finding.

NIST‘s Cybersecurity Framework is also worth revisiting in this context — specifically the “Recover” function, which provides a structured way to think about resilience planning at the organizational level, not just the infrastructure level.

Conclusion

The March 2026 incident made one thing concrete: availability is not defined by the presence of redundancy within a region — it’s defined by the ability to operate beyond it.

Multi-AZ architecture is good design. It protects against the failures it’s designed to protect against. But it was never intended to be a substitute for multi-region resilience, and organizations that treated it as one found out the hard way. For most organizations, closing this gap doesn’t require rebuilding from scratch. It requires an honest assessment of where your architecture actually stands versus where you assumed it did.

Axipro works with scaling software companies to assess availability architecture, close compliance gaps, and ensure that continuity controls hold up under real-world conditions — not just audit conditions. If the questions raised in this article surfaced something worth investigating in your own environment, reach out to our team to schedule a technical review. Or if you’d prefer to start with a self-assessment, learn more about how we approach availability and compliance readiness.

Reach SOC 2 Compliance in 6 Weeks or Less

Schedule Your Free SOC 2 Assessment Today

Axipro Author

Picture of Abeera Zainab

Abeera Zainab

Blog Highlights

Explore More Articles

Researchers who buy second-hand drives off online marketplaces keep finding the same thing: live data.  A widely cited study by Blancco Technology Group found that 42% of used drives sold on eBay still held recoverable information, including financial records and personal data the previous owners assumed was long gone. The drives were not hacked; they were thrown away by organizations that treated deleting a file as the same thing as destroying it. Secure data disposal is where many compliance programs fail. ISO 27001, SOC 2, and GDPR all demand it, but they describe it in different languages, enforce it through different mechanisms, and punish failure in very different ways.  This article sets out what each framework requires, where the requirements overlap, and how to run a single disposal program that satisfies all three at once. Why Secure Data Disposal Matters Across Compliance Frameworks Disposal is the last link in the data lifecycle, and the easiest one to skip. An organization can run flawless access controls, encryption, and monitoring for years and still cause a reportable breach the moment one unwiped laptop leaves the building. A recoverable drive in a recycling skip is functionally identical to an open database on the internet, and auditors and regulators know it. Most disposal failures are unforced errors: a control that was already written into policy but never carried through to the actual hardware. The gap between having a disposal policy and proving this specific drive was destroyed is exactly where audits and breach investigations live. Defining Secure Data Disposal: Key Terms and Concepts What Is Secure Data Disposal? Secure data disposal is the end-to-end process of removing data and the equipment that holds it from active use, in a way that prevents its recovery. It covers the full lifecycle end: deletion of data while a system is still live, sanitisation of media that will be reused, physical destruction of media that will not, and the safe handling of equipment that is recycled, returned to a lessor, or sold. Disposal is the goal. The methods are how you get there. What Is Secure Data Destruction? Secure data destruction is the subset of disposal that renders media permanently unusable or its contents mathematically irretrievable. Shredding a drive, pulverising it, incinerating it, or destroying the encryption keys that make an encrypted disk readable are all forms of destruction. Destruction is one route to disposal, and it is the right route when the data is highly sensitive, or the media will never be reused. Secure Data Disposal vs. Secure Data Destruction: What Is the Difference? The distinction matters more than it looks. Disposal is the outcome you owe to every framework: data gone, unrecoverable, equipment handled appropriately. Destruction is just one of the methods. You can dispose of data without destroying the hardware by sanitising a drive thoroughly enough to reuse it. Confusing the two leads to two classic mistakes: destroying assets that could have been securely wiped and reused, and assuming a quick deletion counts as disposal when it does not. Important: Emptying the recycle bin, formatting a drive, or hitting delete does not dispose of data under any of these frameworks. Standard deletion only removes the pointer to the data; the bits remain until they are overwritten. Every framework discussed here expects the data to be unrecoverable, which is a far higher bar than not visible. What ISO 27001 Requires for Secure Data Disposal ISO/IEC 27001 handles disposal through a small cluster of Annex A controls that auditors read as a single process rather than in isolation. The two controls that do most of the work are 7.14 and 8.10. For a deeper look at how these controls fit into a broader compliance program, see our ISO 27001 implementation guide. ISO 27001 Annex A 7.14: Secure Disposal or Re-Use of Equipment Annex A 7.14 is a physical control. Before any equipment is disposed of or reused, the organisation must check whether it holds information assets or licensed software and ensure those are permanently erased or the media physically destroyed. It applies to servers, laptops, desktops, mobile devices, printers, network gear, and any storage media: if it ever processed information, it is in scope. The control replaces the older 2013 clause 11.2.7 and adds explicit expectations around removing identifying markings and handling end-of-occupancy scenarios. ISO 27001 Control 8.10: Information Deletion Annex A 8.10 is a technological control, and it focuses on the data rather than the box. It requires information stored in systems, devices, or media to be deleted when it is no longer required, and rendered unrecoverable. The cleanest way to keep these straight: 8.10 governs the data while it is in use or reaches its retention limit; 7.14 governs the hardware at end of life. Most retention-driven deletion sits under 8.10; most decommissioning sits under 7.14. ISO 27001 Control 8.12: Data Leakage Prevention and Its Role in Disposal Control 8.12 is rarely filed under disposal, but improperly discarded media is one of the oldest data leakage channels there is. A drive that leaves your control with recoverable data on it is a leak, regardless of how it left. Treating disposal as part of your leakage prevention posture forces the right question at the right time: what could walk out the door on this device, and has it actually been removed? Physical Destruction and Irretrievable Erasure Under ISO 27001 ISO 27001 offers two broad routes: physically destroy media that holds information, or erase and overwrite it so retrieval by a malicious party is precluded. The standard cross-references ISO/IEC 27040 for detailed sanitisation methods. The unifying requirement is that recovery should be impractical, not merely inconvenient. Deletion alone never satisfies this. Overwriting, Full-Disk Encryption, and Other Approved Methods Overwriting user-accessible storage with multiple passes is acceptable for many sensitivity levels. Full-disk encryption changes the economics of disposal entirely: if a device is encrypted from day one and the keys are properly managed, secure disposal can be as simple as destroying the keys, a technique known as

A business continuity plan that has never been tested is, to a SOC 2 auditor, a document and nothing more. The Availability criteria do not award credit for a polished plan sitting in a shared drive. They ask for evidence that you ran the plan, watched it work or fail, recorded what happened, and fixed what broke. That gap — between having a plan and proving it works — is where most availability findings originate. Business continuity plan testing for SOC 2 is the exercise that turns your plan into auditable evidence. It maps directly to Availability criterion A1.3, one of the few SOC 2 controls that explicitly requires you to test something rather than merely document it. This guide covers what counts as a valid test, the test types auditors accept, a step-by-step process, the exact evidence you need, and the mistakes that turn a routine review into a finding. What Is Business Continuity Plan Testing in the Context of SOC 2? Business continuity plan (BCP) testing is the structured validation of whether your organization can keep critical operations running — and restore them within defined targets — during a disruption. In a SOC 2 context, the testing is not freeform. It must produce dated, traceable evidence that the recovery procedures in your plan actually work, that the people involved know their roles, and that systems and data come back within your stated recovery objectives.   Why SOC 2 Requires Business Continuity Plan Testing SOC 2 is an attestation against the AICPA’s Trust Services Criteria, and the Availability category exists specifically for organizations that make uptime or resilience commitments to customers. A plan you never exercise cannot demonstrate operating effectiveness over the audit period — which is the entire point of a Type 2 examination. Testing is the control that converts a static plan into a recurring, observable activity an auditor can sample. SOC 2 Trust Services Criteria and BCP Testing Requirements Availability is one of the five Trust Services Criteria, and it is optional, included only when your service commitments warrant it. When in scope, it is built around three sub-criteria: A1.1 addresses capacity management. A1.2 addresses recovery infrastructure and backup processes. A1.3 addresses the testing of recovery procedures. BCP testing lives squarely in A1.3, with A1.2 supplying the backups and infrastructure that the test validates. Availability Criteria A1.2 and A1.3 Explained Per the AICPA’s Trust Services Criteria, A1.2 requires the entity to design, implement, operate, and monitor environmental protections, recovery infrastructure, and data backup processes that meet its availability objectives. In plain terms: you need real backups, stored away from production, with recovery infrastructure ready to use. A1.3 then requires the entity to test recovery plan procedures supporting system recovery to meet its objectives. The two work as a pair: A1.2 builds the capability, A1.3 proves it functions. Important: The most common A1.3 gap is not a missing test. It is a test that never validated the recovery objectives. Teams run a tabletop, write “no issues found,” and move on — but the plan claims a 4-hour RTO that no one ever measured against an actual restore. If your plan states recovery targets, your test evidence must show whether you met them. A test that does not measure against your RTO and RPO leaves the most important question unanswered.   What Auditors Look for During a BCP Test Review Auditors want proof that the test happened, proof that it was meaningful, and proof that it led somewhere. Concretely, that means a test plan with a defined scenario, a dated record of execution with participants, results measured against your recovery objectives, a list of gaps or issues found, and evidence that those issues were remediated. A test that finds nothing and changes nothing is treated with suspicion — because real tests almost always surface something.   Types of Business Continuity Plan Tests Accepted for SOC 2 SOC 2 does not mandate a specific test type. It expects the rigor of the test to match the criticality of what you are protecting. The four common approaches sit on a spectrum from low-effort, low-disruption to high-effort, high-assurance. Tabletop Exercises A tabletop exercise is a facilitated discussion where key personnel talk through a disruption scenario and their responses. It is cheap, fast, and excellent for confirming that people understand their roles and that the plan reads coherently. Its limit is obvious: nobody actually recovers anything. For many organizations a tabletop is a legitimate annual test, especially in the first audit cycle, but auditors expect more rigor as a program matures. Walkthrough and Simulation Tests A simulation applies a specific scenario and asks the team to perform recovery actions, not just describe them. It is more involved than a tabletop and far better at exposing the gaps that only appear when people touch the tools. Simulations are where teams discover that a runbook references a system that was decommissioned, or that the on-call engineer lacks the access the plan assumes. Full Interruption Tests A full interruption test shuts down primary systems and shifts operations entirely to the recovery environment. It is the most comprehensive validation available and the only one that proves your failover genuinely works end to end. It also carries real operational risk, so it demands thorough planning and is usually reserved for mature programs and the most critical systems. Parallel Testing Parallel testing activates recovery systems alongside production without taking the primary offline, then compares the two to confirm the recovery environment performs as expected. It delivers much of the assurance of a full interruption test while sparing the business the disruption. For most SaaS and cloud-hosted services, parallel testing of failover and restore is the sweet spot between confidence and risk. How to Test Your Business Continuity Plan for SOC 2 Compliance The sequence below aligns with the contingency planning process in NIST’s Contingency Planning Guide, SP 800-34, which auditors widely treat as authoritative for resilience practices. Each step produces an artifact, and the artifacts together form

A SOC 2 auditor will not ask whether you have an incident reporting policy. They will ask you to pull a specific incident from the last twelve months and walk them through it: when it was detected, who classified it, when it was escalated, who was notified, and how it was closed. The policy is the easy part. The part that fails audits is the gap between what the document says and what the timestamps actually show. Incident reporting sits at the center of the SOC 2 System Operations criteria, and it is one of the most frequently exception-flagged areas in Type 2 reports. The reason is consistent: teams treat reporting as paperwork generated after the fire is out, rather than as a controlled process that produces evidence at every step. This guide breaks down how to build a reporting process that an auditor can test, sample, and sign off on without a finding. What Is the Incident Reporting Process in SOC 2? The incident reporting process is the documented, repeatable sequence your organization follows from the moment a security event is detected to the moment the incident is formally closed and archived. It governs how events are logged, classified, escalated, communicated, and recorded. Reporting is not a single notification email. It is the connective tissue that links detection, response, and post-incident review into an auditable chain. How SOC 2 Defines a Security Incident SOC 2 does not hand you a rigid statutory definition. It works through the AICPA’s Trust Services Criteria, which frame an incident around a failure, or potential failure, of the system to meet the organization’s service commitments and security objectives. In practice, a security incident is any event that compromises, or could compromise, the confidentiality, integrity, or availability of systems or data. The criteria expect you to define this threshold yourself and apply it consistently, which is precisely what auditors test against. What Qualifies as a Reportable Security Incident Under SOC 2? An event becomes reportable when it crosses the threshold your own policy sets. The distinction matters. A blocked phishing email is a security event. A user who clicked the link and entered credentials is a reportable incident. SOC 2 rewards organizations that draw this line explicitly, because a clear definition is what makes consistent triage possible. Vague language like “significant events will be reported” invites the auditor to ask who decides what counts as significant, and on what basis. Examples of Security Incidents Relevant to SOC 2 Common reportable incidents include unauthorized access to production systems, credential compromise, malware or ransomware infection, data exfiltration or accidental disclosure, denial-of-service events affecting availability, lost or stolen devices holding company data, and misconfigurations that expose data to the public. Vendor and subprocessor breaches that touch your data belong on this list, too, since the criteria extend your responsibility into the supply chain. How Incident Severity Levels Are Established and Classified Severity classification drives everything downstream: how fast you respond, who gets pulled in, and which notification clocks start ticking. Most mature programs use a tiered scheme tied to business impact rather than technical noise. The point is not the labels you choose but the fact that the labels map to defined response times and escalation paths, and that the mapping is documented before an incident occurs, not invented during one. Auditors quietly judge your maturity by how few P1s you declare and how consistently you apply the tiers. A program that labels everything critical looks panicked; one that never escalates looks asleep. The strongest signal is a severity matrix with response-time SLAs next to each tier, and ticket history showing the tiers were actually applied as written. SOC 2 Incident Reporting Requirements There is no single “incident reporting requirement” in SOC 2. The obligation is distributed across several Common Criteria, and the auditor assembles a picture from all of them. Understanding which criteria govern reporting tells you exactly what evidence to keep. Which SOC 2 Trust Services Criteria Govern Incident Reporting? Incident reporting lives mainly in the CC7 (System Operations) series. CC7.2 covers monitoring system components to detect anomalies that may signal an incident. CC7.3 requires you to evaluate detected events to determine whether they are incidents and to take action. CC7.4 governs the response itself, including containment, eradication, and communication. CC7.5 addresses recovery and remediation. Communication obligations also reach into CC2.2 and CC2.3, which deal with internal and external information flow, and third-party incidents implicate CC9.2 on vendor risk. These are points of focus, not a checklist, but auditors use them to frame their testing. For a deeper look at how these criteria map to your broader compliance program, see our SOC 2 compliance guide. What Evidence Do Auditors Expect From Your Incident Reporting Process? Auditors want artifacts with time references, not assertions. That means incident tickets showing detection and closure timestamps, severity classifications with the name of who assigned them, escalation records, communication logs, and post-incident review notes. In a Type 2 examination they will trace one real incident end to end. Evidence pulled from a staging environment, or any artifact with no clear date, gets challenged immediately. Who Is Responsible for Reporting Security Incidents? Everyone reports; a defined role decides. SOC 2 expects that all staff know how to raise a suspected incident, and that a named function, often a security lead or incident commander, owns the determination of severity and the decision to escalate. The auditor will look for evidence that this ownership is real: a RACI chart is fine, but ticket history showing the right person actually classified and closed incidents is better. Step-by-Step SOC 2 Incident Reporting Process The following sequence maps cleanly to the lifecycle in NIST’s Computer Security Incident Handling Guide (SP 800-61), which auditors widely recognize as authoritative. NIST withdrew Revision 2 in April 2025 and released Revision 3, which reorganizes the lifecycle around the six functions of the Cybersecurity Framework 2.0. The underlying steps below remain the same; the framing simply shifts toward continuous risk management.