Il postmortem del nuovo disservizio di Cloudflare, durato 25 minuti: la causa è di nuovo una configurazione distribuita globalmente senza rollout progressivo:

This second change of turning off our WAF testing tool was implemented using our global configuration system. This system does not perform gradual rollouts, but rather propagates changes within seconds to the entire fleet of servers in our network and is under review following the outage we experienced on November 18.

Unfortunately, in our FL1 version of our proxy, under certain circumstances, the second change of turning off our WAF rule testing tool caused an error state that resulted in 500 HTTP error codes to be served from our network.

Almeno stanno lavorando a una soluzione definitiva che non tiri giù tutto con un click:

Before the end of next week we will publish a detailed breakdown of all the resiliency projects underway, including the ones listed above. While that work is underway, we are locking down all changes to our network in order to ensure we have better mitigation and rollback systems before we begin again.