Breaking: CloudFlare outage 2025 Nov 18

Cloudflare Incident Brief: The Control Plane Spike (Nov 18, 2025)

Disclaimer: This is a consolidated brief of the Cloudflare global network incident on November 18, 2025, synthesized from the status updates and public statements, focusing on the technical narrative.

The Cloudflare global network suffered a major internal service degradation starting around 11:48 UTC, impacting customer-facing services and causing widespread 500 Internal Server Errors across the internet. The incident was officially declared resolved at 14:42 UTC, though full service stabilization continued until past 15:40 UTC.

1. The Definitive Root Cause: Data Size Limit

The core cause of the outage was a single, non-malicious failure within the Control Plane (the centralized management brain):

The Culprit: A configuration file, automatically generated by Cloudflare’s internal systems to manage threat traffic (bot mitigation), grew beyond the size limits expected by the processing software.
The Cascade: When the oversized file was deployed, the software designed to handle that configuration crashed globally, leading to the collapse of services relying on it. This explains the initial confusion over an “unusual traffic spike,” which was likely the cascade of services crashing and aggressively retrying API calls.

2. The Blast Radius: Configuration vs. Traffic

The incident highlighted a critical vulnerability in the Control Plane’s capacity to handle its own configuration data:

Control Plane Failure: The crash immediately rendered the Cloudflare Dashboard and API unusable, as they rely directly on the configuration service.
Data Plane Confusion: The core Application Services (like WARP, Access, and WAF) began failing (widespread 500 errors) because the centralized “brain” could not supply them with the necessary rules to process traffic, resulting in the surgical necessity of disabling WARP access in London at 13:04 UTC to isolate the problem.

3. The Resolution and Residual Chaos

Resolution began at 13:09 UTC with the identification of the failing configuration file. The fix involved mitigating the data size issue and deploying a corrected config, leading to:

Phased Recovery: Access and WARP recovered quickly (13:13 UTC) once the core crashing issue was resolved. The Dashboard lagged significantly, requiring a separate fix and taking until 14:34 UTC to restore initial functionality.
Post-Deployment Mitigation: The team continued working until 15:40 UTC to address “several issues that remain post-deployment.” This cleanup phase is typical after a configuration crash, involving flushing stale caches and ensuring the newly deployed config is correctly registered across every node in the global network.

This incident strongly mirrors the theme of the 1.1.1.1 outage: a latent error in configuration data integrity (size vs. a misplaced prefix) that was allowed to propagate globally by an automated deployment system that lacked robust validation checks.

Incident Timeline

2025-11-18 15:40 UTC update

Update - The team is continuing to focus on restoring service post-fix. We are mitigating several issues that remain post-deployment.

Explanation: The main crisis is over, but we’re now playing whack-a-mole with residual errors. We’ve deployed the big fix (the one that stopped the “unusual traffic spike”), but when a massive system fails and is brought back quickly, things don’t go back to 100% instantly. We are finding several minor, lingering issues (the “post-deployment” problems) that we need to actively address and clear up to ensure a complete, smooth recovery.

2025-11-18 15:23 UTC update

Update - We are continuing to monitor for any further issues.

Explanation: We’re done fixing it, but we’re still watching. All the major changes and fixes have been deployed and we believe the entire system is stable again. We are keeping a close eye on our metrics (like error rates and traffic flow) to catch any potential aftershocks and ensure a full and lasting recovery.

2025-11-18 14:57 UTC update

Update - Some customers may be still experiencing issues logging into or using the Cloudflare dashboard. We are working on a fix to resolve this, and continuing to monitor for any further issues.

Explanation: We’re almost done, but our admin tools are still wobbly. We know the big websites are mostly back online, but a few customers are still having trouble logging into their Cloudflare accounts or using their management tools (the dashboard). This suggests a specific internal service that powers the login or configuration settings is still struggling, and we’re working on that last piece of the puzzle.

2025-11-18 14:42 UTC update

Monitoring - A fix has been implemented and we believe the incident is now resolved. We are continuing to monitor for errors to ensure all services are back to normal.

Explanation: We are calling this one fixed. We are confident that the final solution has been deployed and the system is back to normal operations. We’ll stick around to watch the metrics for a while, just to be certain everything stays stable and no hidden problems pop up.

2025-11-18 14:34 UTC update

Update - We’ve deployed a change which has restored dashboard services. We are still working to remediate broad application services impact

Explanation: We got the admin tools working, but the core services are still wobbly. We successfully fixed the part of the system that lets our customers log in and manage their settings (the dashboard). This is great, but the underlying system that actually processes and secures website traffic (broad application services) is still struggling. This suggests the fix was targeted at the management layer, which then needs time to feed into the traffic processing layer.

2025-11-18 14:22 UTC update

Update - We are continuing to work on a fix for this issue.

Explanation: A major step forward: we fixed the management tools. The fix we deployed has brought back the dashboards and tools that our customers use to manage their security and traffic settings. This is a critical recovery for them, but our more general, core application services (like advanced security and web application firewalls) are still having problems.

2025-11-18 13:58 UTC update

Update - We are continuing working on restoring service for application services customers.

Explanation: We are in the final cleanup phase. The major crisis is past, but we are still focused on fully restoring all the advanced services we provide to our business customers, making sure everything is 100% operational again.

2025-11-18 13:13 UTC update

Update - We have made changes that have allowed Cloudflare Access and WARP to recover. Error levels for Access and WARP users have returned to pre-incident rates. We have re-enabled WARP access in London. We are continuing to work towards restoring other services.

Explanation: The specific fix worked for some services. Two of our key security/access services (Cloudflare Access and WARP) are now back to normal, and we were able to turn WARP back on in London. We are now working on restoring all the other services that were affected.

2025-11-18 13:09 UTC update

Identified - The issue has been identified and a fix is being implemented.

Explanation: We found the smoking gun! We know exactly what caused the problem and our engineers are now putting the solution in place. This is a crucial pivot point, signaling the immediate crisis is over and the cleanup has begun.

2025-11-18 13:04 UTC update

Update - During our attempts to remediate, we have disabled WARP access in London. Users in London trying to access the Internet via WARP will see a failure to connect.

Explanation: We had to do an emergency shut-off in one area. To stop the problem from spreading or to help fix it, we temporarily turned off one of our specific consumer services (WARP, which is a mobile security app) only for users connected through London. This was a targeted measure to control the situation.

2025-11-18 12:53 UTC update

Update - We are continuing to investigate this issue.

Explanation: We’re still searching. This is the technical equivalent of saying, “We haven’t found the root of the problem yet, but we have a lot of engineers looking at it.”

2025-11-18 12:21 UTC update

Update - We are seeing services recover, but customers may continue to observe higher-than-normal error rates as we continue remediation efforts.

Explanation: Things are starting to stabilize on their own or because of changes we made. We are seeing signs of life, but not everyone is fully better yet. Websites and apps may still show error messages more often than usual while we try to clean up the underlying issue.

2025-11-18 12:03 UTC update

Update - We are continuing to investigate this issue.

Explanation: We’re still searching. This is the technical equivalent of saying, “We haven’t found the root of the problem yet, but we have a lot of engineers looking at it.”

2025-11-18 11:48 UTC incident started

Investigating - Cloudflare is experiencing an internal service degradation. Some services may be intermittently impacted. We are focused on restoring service.

Explanation: Something is definitely broken inside our system. We know our services are running slowly or failing sometimes, which is causing problems for many websites and apps that use us. Our top priority is getting everything back up and running, and we’re hunting for the exact cause right now.

Post-Mortem Speculation: The Magnificent Fragility of Data

This incident is an absolute gold mine for infrastructure engineers. Forget blame; this is a pure-engineering case study demonstrating the magnificent fragility hidden within complex, distributed systems.

The fact that the root cause was a configuration file that “grew beyond an expected size” doesn’t point to a simple operational slip; it points to a profound and fascinating latent bug in the way the Control Plane manages its own foundational data.

Our curiosity, therefore, shifts from “what happened?” to the much deeper “how was the systemic failure encoded into a single file?”:

The Recursive Loop of Complexity: What was the nature of the threat traffic that triggered the config file’s growth? Was it a benign event that simply hit a poorly designed growth algorithm, causing it to exponentially bloat the file? We are looking for the exact mechanism where legitimate input (traffic) translates into catastrophic output (oversized configuration data).
The Phantom Capacity Limit: The downstream software crashed because it couldn’t handle the size. This implies a hardcoded or implicit limit (a fixed buffer size, a timeout on a parsing loop, or a sudden spike in memory/CPU usage from a poorly bounded regex engine) that only revealed itself when pushed past its design threshold. The curiosity here is finding the specific line of code or environment variable that defined this invisible wall.
The Meta-Configuration Paradox: Cloudflare’s whole purpose is to validate and control traffic for everyone else. Yet, their own failure came from an invalid config being deployed internally. The true engineering puzzle is how to build a self-validating deployment mechanism for the very rules that govern the entire network. This will involve details on the pre-deployment testing harnesses—the silent heroes that failed to catch the oversized file before it was globally deployed.

The upcoming official post-mortem will serve as an essential artifact for the entire industry. It’s not just about fixing Cloudflare; it’s about mapping the unexpected boundaries of state, configuration, and scale that all modern cloud providers unknowingly operate within. Until then, the “oversized config file” remains a beautiful, terrifying symbol of how a single byte too many can take down a fifth of the internet.

Tags:

Pavels Gurskis

IT Advisor

Helping startups making most of IT investments.

Breaking: CloudFlare outage 2025 Nov 18

Cloudflare Incident Brief: The Control Plane Spike (Nov 18, 2025)

1. The Definitive Root Cause: Data Size Limit