A postmortem of our May 7, 2026 outage

By CoinbaseJun 1, 2026

On the evening of May 7, 2026, Coinbase experienced severe service disruptions. Trading, deposits, withdrawals, and most of the surfaces our customers depend on were unavailable or degraded for roughly eight hours, with full recovery of all systems taking another twelve.

The duration of the outage is unacceptable to us and we owe our customers, our partners, and the broader engineering community a detailed explanation. This post explains what happened, why recovery took as long as it did, and steps we are taking to durably improve our resilience and infrastructure.

What happened?

At 7:20 PM ET, multiple chiller units failed simultaneously in a single data hall inside AWS's us-east-1 region (availability zone use1-az4). The cooling loss triggered a thermal-safety shutdown of the affected racks, taking EC2 instances and EBS volumes in that building offline. The same event affected other major services on the public internet that night, including several of our peers and partners.

This is the kind of failure mode a hyperscale cloud provider is designed to absorb at the zone boundary. The architectural principle we and most AWS customers operate against is straightforward: a single availability zone can fail in its entirety, and a properly designed system should remain available out of the remaining zones.

What our customers experienced

At 7:48 PM ET on May 7, nearly all trading on Coinbase halted. Retail customers were unable to trade or complete related account actions, such as buying, selling, sending, receiving, depositing, and withdrawing, across affected Coinbase products. Institutional customers on Prime experienced broad degradation in order routing as Coinbase Exchange markets became unavailable during the incident.

Recovery was uneven. The matching engine was restored in cancel-only mode at 2:25 AM ET on May 8, allowing customers to begin canceling open orders. Full trading resumed across all books at 3:49 AM ET. Trading on coinbase.com and the mobile app was restored in a degraded state at 5:30 AM ET and fully recovered by 9:53 AM ET. The remaining backlog of event-streaming topics cleared by 2:00 PM ET.

We have notified the relevant regulatory bodies within their required windows and are completing the formal impact assessments on that timeline.

Why recovery took as long as it did

Two failure modes extended a localized provider event into a multi-hour platform outage. Each one was independently recoverable. Together they compounded.

1. Our matching engine was pinned to a single building. The Coinbase Exchange matching engine runs as a Raft-based replicated cluster inside an AWS Cluster Placement Group. We make this choice deliberately. A matching engine that meets the latency and throughput demands of a serious market cannot tolerate inter-zone network hops between voting cluster members. The physics of distributed consensus and the economics of running a fair, liquid order book point to the same answer, which is co-location.

The architectural issue was that we lacked an automated ability to fail over to another availability zone. When AWS terminated EC2 instances inside our placement group at 9:29 PM ET, three of five matching-engine nodes went down and we lost quorum. There was no automated cross-zone failover. Recovery required an emergency code change shipped during the incident to remove a startup assumption that all five cluster nodes were resolvable, the creation of a new node group outside the impaired placement group, and a careful sequence to restore a 3-of-5 quorum. This allowed us to reopen markets: first cancel-only, then auction mode, and finally full trading.

We restored quorum at 12:06 AM ET. We did not reopen markets until 3:49 AM ET. That delay was largely due to a second major issue regarding AWS’s Managed Kafka Service.

2. AWS's managed Kafka service failed silently. A significant portion of our event-streaming infrastructure runs on MSK, AWS's managed Kafka offering. The architectural promise of a managed Kafka service is that when individual brokers go down, the service automatically reelects partition leaders and continues to serve traffic out of the remaining brokers. The loss of an entire zone should result in reduced capacity, not unavailability.

That is not what happened and this extended the outage.

A defect in the AWS MSK control plane prevented automatic partition-leader reelection. Two of our MSK clusters became stuck in a "healing" state with producers unable to write. The cascading effect blocked our fee service, which blocked quoting, which is why most customers experienced this incident as broken trades and quotes rather than as a Kafka outage. Adjacent systems, including portions of our ledger pipeline, payments, and several data pipelines, were affected the same way. Additionally, one of our Kafka clusters was set up in a 2-AZ configuration that increased the blast radius and recovery time, but the MSK control plane defect impacted 2-AZ and 3-AZ Kafka clusters similarly.

We worked the recovery in real time with AWS engineering, ultimately performing manual partition reassignments at 3:00 AM ET to migrate topics off the impaired brokers. Priority-zero and priority-one topics were back to full availability by 9:30 AM ET. The remainder cleared by 2:00 PM ET.

What we are doing about it

We are committed to improving our platform and providing the best possible experience for our users. The following are commitments we are accountable for, not aspirations.

Matching engine resiliency: We are improving warm cross-zone standby design for our matching engines on Coinbase Exchange. We are moving to production failover exercises on a regular cadence, with our largest venue participating in maintenance windows and a sandbox cadence for the remainder.

Kafka: We're working with AWS to root-cause the control-plane defect and confirm their remediation plan. We are developing tooling, runbooks and tests to handle failure modes seen during this incident as well as Kafka controls not available to us via MSK. Equally importantly, we are migrating the 2-AZ Kafka cluster to a 3-AZ deployment.

Closing

We want to extend our gratitude to the partners who helped us resolve this issue. A special thank you to the AWS engineers who worked through the night alongside us, and to the Coinbase engineers who managed this incident from the initial alert through to full recovery.

Ultimately, Coinbase has an obligation to be available for our customers, and in this instance, we fell short. Our standard does not depend on whether AWS experiences an outage; it rests entirely on whether we have built systems resilient enough to withstand one.

Our customers deserve a seamless, uninterrupted experience every time they use Coinbase, and we are fully committed to delivering that.