First Sketch, First Cracks
Four minutes. That is all it took for 1.1.1.1 traffic to fall off a cliff on 14 July 2025. Cloudflare’s status page blinked from “Investigating” to “Identified,” Twitter filled with traceroute screenshots, and phones on every help desk lit up. I opened the official post‑mortem expecting a neat plot twist, yet the story read like a thriller with whole scenes missing. So I did what any curious engineer does on a slow evening: I made coffee, grabbed a pad, and started drawing boxes and arrows until the gaps yelled back at me. And before we go further, this is my attempt to reconstruct missing technical details from public crumbs and conclusions might be off by a healthy margin.
My first sketch was embarrassingly tidy.
- Prefixes → Services → PoPs. I wrote that across the top, convinced this was Cloudflare’s universal chain of custody. Resolver prefixes live in the “Resolver” service, services choose the data‑centers (PoPs) that will announce them. Simple, elegant, and fully declarative.
- Below the chain I added a note: “No direct prefix→PoP map exists.” Why maintain a second mapping when the service layer already expresses intent?
That comfortable picture lasted about ten paragraphs into the blog post. Buried under the heading “How our addressing system evolved” was the first crack: legacy tooling still distributes an explicit prefix→PoP list to routers. The strategic controller may generate that list, but the file exists, and it is the file that routers obey.
My marker squeaked across the page. I drew a second lane under the first: Legacy prefix list (hand‑edited in the old days, now machine‑generated).
Suddenly the system was not one chain but two layers: a high‑level graph that produces a low‑level stencil. If that sounds trivial, remember I had just told myself the stencil was gone. Worse, the post hinted that both layers were live at the same time. I circled the point twice:
“The legacy address‑list repository remains authoritative for the physical push until migration is complete.”
With that sentence my third realisation clicked: the difference between old and new is not the data itself but how the data is built. The strategic layer derives the prefix→PoP list indirectly, the legacy layer used to be edited by hand. Same output format, different source of truth.
Boxes, arrows, and corrections now covered half the page. My neat chain was rubble, but the investigation finally had momentum. If a generator can overwrite a stencil that thousands of routers trust, one rogue reference could rewrite the Internet in seconds. I marked a fresh question in red: “How often does the generator run, and what triggers it?” That breadcrumb would lead to the next wrong turn.
First lessons, already.
- Assume your first diagram is wrong. Draw it anyway, then hunt for cracks with the stubbornness of a bug in prod.
- Beware of silent layers. A file you thought was retired may still be the last mile to hardware, quietly accepting whatever the new system feeds it.
- Focus on the join. When two representations of the same truth coexist, the join between them is a mine field.
If you have never mapped your own routing or firewall pipeline, pause here. Take one critical object (a public subnet, a VIP, an ACL) and sketch every transform it passes through before hardware acts. Expect the first sketch to crumble, and let the crumbling guide your next questions.
The DLS Detour
Ten minutes into refactoring my sketch I stumbled across three letters that seemed to solve everything: D L S. The post‑mortem described a Data Localization Suite (sounds strategic enough to deploy data paths depending on locations/PoPs) still waiting for launch. It had its own service object, its own topology, and it appeared in the timeline only twice - once on 6 June, again on 14 July. Perfect culprit. A hidden new platform goes live, grabs the resolver prefixes by mistake, the world breaks. Done. I wrote “DLS = strategic controller” in big letters, circled it twice, and pushed the paper away feeling smug for a whole thirty seconds.
The very next paragraph torpedoed that theory.
Cloudflare said the 14 July change was nothing more dramatic than:
“attaching a test location (PoP) to a non‑production service.”
Attaching one offline PoP is hardly a grand launch. More telling, the report insisted that DLS remained pre‑production. If the suite never flipped to prod, it could not be the all‑powerful controller feeding routers. My shiny synonym link (DLS equals strategic) fell apart.
I rewound. Maybe the strategic controller itself was still the future and 14 July was its first real outing. That would explain the sudden blast radius. If the tooling had just come online, perhaps no one had switched on canary rollouts or diff‑size brakes yet. The version felt plausible, but I had already learned to distrust neat explanations.
Time to hunt for dates. Cloudflare’s blog was coy, so I widened the net. Burried in a late‑night Reddit Q & A was a comment from a username wearing the corporate badge: the strategic topology compiler had been “feeding the legacy address‑list” since late 2024. A ThousandEyes blog mirrored that timeline. Another suspect collapsed.
The puzzle now looked like this:
- The strategic compiler is old news, quietly authoritative for months.
- DLS is just another service object inside its graph, flagged non‑production.
- A single offline PoP addition to that service coincides with the outage.
I crossed out the old headline and wrote a new one: “One graph, many services”. Every service, dev or prod, lives in the same graph, so any change can trigger a full recompile. The controller does not ask whether the service carries live traffic until after it has rebuilt the prefix matrix. That realisation shifted the ground again. The next breadcrumb was obvious - why did the June typo not blow up if every edit recomputed everything? Well, read on!
Second set of lessons:
- Beware seductive synonyms. A new acronym in a post‑mortem is not automatically the new control plane. Verify with timelines, not intuition.
- Chase exact dates. Real infrastructure pivots rarely happen the day an incident occurs. If a tool has been live for months, its failure modes are baked into daily ops.
- One graph means shared fate. When dev and prod objects coexist in the same data model, you gain consistency but also risk cross‑contamination.
Action item: pick one of your own staging‑only objects, trace whether changing it triggers any production pipeline you did not expect. The answer tends to surprise.
One Graph, Many Services
At midnight my notebook looked like a roadmap after three rounds of edits: arrows rerouted, boxes deleted, new layers stapled on. Still, one breadcrumb remained: if the strategic controller had been live since last year, why did a change to a dormant service topple the resolver fleet only in July? I stared at the sentence I had just written - “One graph, many services” - and felt the penny drop. What if every service, whether shipping to customers or living behind a feature flag, really did sit inside the same compile step, sharing the same gravity?
I reread the blog’s configuration section with fresh eyes. Words that had felt like fluff earlier now glowed like debug logs:
“All service topologies reside in a single source‑of‑truth database. The compiler evaluates the entire graph on any change, then emits a prefix to PoP table and an advertise or suppress flag.”
I’ve grabbed my pen:
No PoP = Empty change scope
That single paragraph flipped last assumption on its head. I had believed non‑production objects were ignored until promoted. Reality: the compiler does a blanket recompute every time someone nudges any node. The non‑prod flag matters only after the prefixes are already mapped to PoPs. It decides whether routers should advertise the resulting rows, but it does not stop those rows from existing in the output. Lack of PoPs does!
I added two arrows to the drawing:
Service edit → Compile everything → Generate full matrix
Next, I hunted for precedence rules. Reddit yielded a clue: “newer row wins unless priority is set.” That implied a duplicate prefix in two services would not raise an error; the later edit would silently override. Our dormant DLS typo, with resolver prefixes copied inside, suddenly looked like a powder keg waiting for its first PoP.
Timeline in hand, the chain now read:
- Late 2024: strategic compiler starts feeding legacy list.
- 6 Jun 2025: resolver prefixes mistakenly referenced inside DLS. Compiler runs, produces duplicate rows. DLS has zero PoPs, so its rows are marked suppress; diff is empty, routers stay calm.
- 14 Jul 2025: engineer adds one offline PoP to DLS. Compiler runs again, duplicate rows now marked advertise, they override the genuine Resolver rows. Legacy diff explodes, routers withdraw prefixes worldwide.
The elegance of a single graph had morphed into a shared fate problem. A change that looked safe - add a test location to a lab service - had full authority to rewrite global routing because the compiler could not distinguish critical from experimental once duplicates existed.
Pinned in the margin: “Next puzzle: why did the huge diff ship with no guardrail?” That question would steer the investigation toward deployment mechanics and my own experience with diff‑size brakes.
IaaC is a double‑edged sword
While consistency improves, the blast radius widens because every node shares the same policy. Since early tools like CFEngine building robust ‘one-push’ roll-back workflow is becoming best practice - and not without the reason. Not surprise Cloudflare had those controls in place and was able to recover 77% of their traffic promptly.
Exercise: run a simple query against your own config store and list any object that references the same IP block in more than one service. If the list is non‑empty, you might be walking around with a timebomb too.
When the Diff Hit the Fan
At this point the timeline was crystal clear, except for one nagging detail. The dormant typo slept for five weeks, the compiler ran countless times, yet nothing bad happened until a single offline PoP joined the party. Why? I scrolled back to the log lines Cloudflare shared and noticed the word “diff.” On 6 June the difference between old and new router tables was zero, so the legacy deploy engine sent nothing. On 14 July the diff ballooned, and that same engine shipped it everywhere in one gulp. My monitoring-forged brain immediately rang the alarm - giant diff is always a bad sign.
At my last job we enforced a hard gate: if more than 5% of the firewall config changed, the CI pipeline halted and waited for a senior review. The rule saved us once, when an intern almost swiped corporate access to the internet.
From what I’ve researched it seem that Cloudflare’s pipeline had no such threshold. The strategic compiler produced an enormous diff after the DLS edit, thousands of prefix‑to‑PoP rows flipped from advertise to suppress, and the legacy pusher obeyed blindly. Four minutes later resolver traffic flat‑lined. In my notebook I wrote:
“Zero PoPs in June → suppress rows, diff size zero. One PoP in July → advertise rows, diff huge, deployed globally.”
The beauty and peril of automated pipelines is that they treat every change as intentional. The legacy engine had done its job perfectly for years, so no one thought to add a brake. That gap turned an ordinary lab tweak into a global withdrawal.
Here the investigation became personal.
I dug out the YAML snippet we had used for our diff gate:
diff_threshold: 0.05 # 5 percent of total lines
on_exceed:
- notify: "network-architect"
- require_approval: true
That simple rule could delay some big roll-outs, but it forced human eyes on any sweeping change. Would it have saved 1.1.1.1? Possibly. It would certainly have slowed the blast radius while people asked why Resolver prefixes were leaving every production PoP.
Practical takeaway:
- Add a diff‑size gate tomorrow. Even a crude percent threshold catches many foot‑guns and buys review time.
- Instrument the gate with context. Tag critical objects so the pipeline can flag specific changes, not just raw line count.
- Remember nuance. Large diffs sometimes reflect planned work; build an override path that documents intent rather than bypasses the rule.
If you want to try my old policy, start with 5%, page one senior engineer, and log every override reason for a month. The data will tell you whether to tighten or relax. Better a false page than four minutes of global silence.
From Puzzle to Guardrails
By dawn the timeline felt complete. What struck me most was how ordinary the trigger looked: an engineer linked a test PoP to a pre‑production service, exactly the sort of BAU step every big network does a dozen times a day. In most shops the move would pass unnoticed, because lab services usually hold only lab prefixes. Yet one stray reference and a silent precedence rule turned that everyday gesture into a global outage. The question hanging over my desk was no longer “How did this happen?” but “How many different ways can we trip the same fuse, and how do we padlock each one?”
As a well writen post‑mortem Cloudflare’s blog post ends with “next steps”.
-
What Cloudflare has done / is doing:
- Rolled back the faulty topology within minutes, restoring BGP announcements.
- Accelerated retirement of the legacy push path in favor of a staged, health‑gated deploy pipeline.
- Building stronger compile‑time validation so critical prefixes cannot vanish from multiple PoPs.
- Committing to canary rollouts and automated rollback triggers for every future topology change.
Those are solid moves. Still, wearing my threshold‑loving, monitoring‑scarred hat, I would bolt on two extra padlocks - both cheap, both immediately actionable for almost any team. Personally, I never had a chance to run network operations at Cloudflare’s scale - my biggest deployment topped at ~40k IP addresses spread over few hundred subnets. So if your scale not as big as Cloudflare’s, below could be a perfect start:
-
Quick Fixes:
- Stop prod‑prefix leaks into dev
Tag and enforce. Introduce a
production: true
flag on critical objects. Fail CI if the object tags do not match environment (i.e.prod
->dev
). - Blind diff‑size threshold (quick fix everyone should steal). It costs five lines of YAML to refuse any roll-out that rewrites, say, more than 5% of the config without human sign‑off. Perfect? No. But the next time a single lab tweak wants to suppress thousands of rows, you wait for approval instead of a blackout.
- Stop prod‑prefix leaks into dev
Tag and enforce. Introduce a
With those two guardrails, a rogue reference must clear at least couple of sanity-checks. Defence‑in‑depth by inches, not giant rewrites.
Bullet Proof Your Network:
- Week 1 Add
production
tags to your objects, fail CI on environment mismatch. - Week 2 Copy‑paste the diff‑size gate I’ve listed above. Start permissive (10 %), tighten as you learn.
- Week 3 Wire a single‑device canary on your smallest site; abort if latency or reachability blips.
- Week 4 Review overrides and false positives, tweak thresholds, document lessons.
Tiny padlocks, layered together, turn a global blast radius into a harmless lab blip - and cost less than a single post‑mortem meeting.