Postmortem di Railway, la creazione di un indice PostgreSQL ha tirato giù tutto:

A routine change to this Postgres database introduced a new column with an index to a table containing approximately 1 billion records. This table is critical in our backend API’s infrastructure, used by nearly all API operations.

The index creation did not use Postgres’ CONCURRENTLY option, causing an exclusive lock on the entire table. During the lock period, all queries against the database were queued behind the index operation. [...] Manual intervention attempts to terminate the index creation failed.

Le misure:

We’re going to introduce several changes to prevent errors of this class from happening again:

  • In CI, we will enforce CONCURRENTLY usage for all index creation operations, blocking non-compliant pull requests before merge.
  • PgBouncer connection pool limits will be adjusted to prevent overwhelming the underlying Postgres instance's capacity.
  • Database user connection limits will be configured to guarantee administrative access during incidents, ensuring maintenance operations remain possible under all conditions.