Bloom filter in the wild!

I bloom filter, una struttura dati per determinare in modo efficiente ma probabilistico la presenza/assenza di un elemento in una collezione, sembravano inizialmente relegati ai corsi di algoritmi dell'università ma con la scala di Internet si sono iniziati a vedere in implementazioni reali.

Esempio recente, da How we made global routing faster with Bloom filters (Vercel):

When you make a request to a Vercel deployment, our routing service first checks whether the requested path exists before attempting to serve it. [...] We do this by generating a JSON file on build that contains a tree of every path in your project's build outputs, including static assets, pages, API routes, webpack chunks, and Next.js route segments.

In alcuni casi questo file JSON era diventato un collo di bottiglia, specialmente perché il servizio di routing è single-threaded (Node.js, presumo):

These sites can create 1.5+ megabyte lookup files that take dramatically longer to parse. At the 99th percentile, parsing this JSON file takes about 100 milliseconds; at the 99.9th percentile, it takes about 250 milliseconds.

Qui entrano in gioco i Bloom filter:

A Bloom filter is a probabilistic data structure that can be used to test whether an element, or key, is a member of a set. [...] Bloom filters can return false positives, but never false negatives. For path lookups, this property is valuable. If the Bloom filter says a path does not exist, we can safely return a 404; if it says a path might exist, we fall back to checking the build outputs.

Il risultato: miglioramento della velocità di lookup fino a 200 volte. I dettagli nell'articolo.

Note di Matteo

Bloom filter in the wild!