Root Cause Analysis: PostgreSQL MultiXact member exhaustion incidents (May 2025)

May 20, 2025

•

0 Min Read

Cosmo Wolfe

Head of Technology

Contents

Heading 2

Heading 3

Get updates into your inbox

As of May 19, 2025, 00:29 UTC, the incident has been resolved for all impacted customers.

Executive summary

Between May 10, 2025 and May 17, 2025, Metronome experienced four distinct outage events, each lasting over an hour, that severely impacted our ability to process write operations across our platform (notably, our API and UI). During these outages, our clients were unable to update pricing and packaging or save changes through the UI. Importantly, event processing and alert notifications remained operational throughout all incidents. This was one of the most significant periods of API downtime in Metronome's history, and our incident response team worked throughout the duration of the incident to understand and then mitigate the issues.

The root cause was identified as PostgreSQL's protective mechanism against MultiXact member space exhaustion. As part of a planned database architecture improvement, we have been migrating from a monolithic table representing customer invoices to a more scalable partitioned structure to spread the data and enable parallel autovacuuming. Under sustained write pressure from this migration's associated backfill processes and other normal run-rate processes, we crossed a previously unknown and difficult to monitor global limit on MultiXact members, triggering emergency vacuums. This led to PostgreSQL's defensive lockout of write operations until these vacuums could complete—a recovery process that takes several hours for our larger tables.

The situation was complicated by multiple factors:

Our metrics indicated we were far below the configured MultiXact ID threshold (less than 50% utilization), suggesting we had ample capacity remaining. However, PostgreSQL was actually hitting a different, related limit - the MultiXact member space capacity. This distinction is significant because PostgreSQL does not expose metrics for member space consumption, making it extremely difficult to detect preemptively. Even with our comprehensive database monitoring, this separate limit was not visible until we began experiencing the actual failures. This is a subtle aspect of PostgreSQL internals that is not widely documented.
Scheduled mid-month invoice creation processes ("bookkeeper") began running concurrently with an ongoing backfill, creating additional database write volume that compounded the issue.
When attempting to pause the backfill, we initially stopped the job that generated backfill tasks but critically failed to pause the task queue consumers. This meant that while no new backfill tasks were being created, a large backlog of previously queued tasks continued to be processed by workers, maintaining high database write volume.
Our API alarm configuration required multiple consecutive failing data points before triggering, resulting in delayed alerts for write operation outages. While this mainly affected our awareness timing rather than resolution in most cases (as PostgreSQL vacuums needed to complete regardless), it did reduce our ability to proactively communicate with customers during outages.
Initial recovery attempts were hindered by our incomplete understanding of the root cause, leading to extended downtime as we worked to identify and address the actual issue.

Impact

During the impacted periods, all API endpoints and UI operations that perform write operations experienced failures or inconsistent behavior. This specifically affected:
- Customer creation and management
- Pricing and packaging updates
- Configuration changes
- Any UI or API operation that required saving changes
API errors created data integrity challenges during outages. Operations would sometimes partially complete before failing with 5xx errors, particularly affecting multi-step client workflows and leaving data in inconsistent states. These partial completions required manual cleanup processes with affected clients to restore consistency.
For customers using threshold billing or auto-recharge features, payments from end-customers were successfully collected, but the system couldn't immediately grant the purchased credits due to write operation failures. These credits were eventually delivered once systems recovered, but end-customers experienced delays in receiving their purchased credits.
Each individual outage lasted over an hour, with the cumulative impact spanning multiple business days.
The repeated nature of the outages (four occurrences within a week) significantly compounded customer frustration and impact.‍
Event processing and alert notifications remained operational throughout all incidents, allowing customers to continue receiving notifications, for example about customers hitting spend thresholds.

Timeline

All timestamps are in UTC.

April 7, 2025 15:13: Migration to partition invoice tables begins with dual writes enabled
April 11, 2025 16:10: Began backfilling invoices into new partitioned tables
May 7, 2025 16:20: Deployed refactored backfill job with higher throughput
May 9, 2025 17:00: Primary PostgreSQL cluster conducts routine aggressive vacuums without any downtime.
May 10, 2025
- 00:02: Backfill job begins hitting errors
- 02:50: MultiXact member space runs out and a large number of tables are enqueued for aggressive vacuum. This marks the beginning of the first occurrence.
- 03:09: API Alarm indicated that several customer endpoints began failing. Internal incident response team begins investigating API error rates.
- 03:09 - 04:33: Incident response team attempts tuning and workload changes to allow emergency vacuums to complete more quickly.
- 04:33: In an attempt to mitigate what is now extremely prolonged downtime, incident response team triggers failover to hot standby, unintentionally restarting vacuum processes.
- 05:15: Vacuums complete and systems recover. This marks the resolution of the first occurrence.
May 11, 2025: Resumed backfill at lower concurrency, which initially appears stable.
May 15, 2025 15:00: Our scheduled, mid-month “bookkeeper” process begins to insert future invoice rows. This write load, combined with the still running migration, begins to consume MultiXAct member space faster than it can be reclaimed.
May 16, 2025
- 20:31: Write operations begin failing across API and UI as MultiXact IDs reach critical threshold again. This marks the beginning of the second occurrence.
- 20:50: API Alarm indicated that several customer endpoints began failing. Incident response team re-assembles.
- 21:01: Status page is updated.
- 21:07: Backfill jobs are paused, but incident team doesn't realize they must also pause task consumers, since the backfill jobs have at this point enqueued a large backlog of tasks.
May 17, 2025
- 00:05 - Incident response team identifies and implements an approach to replace in-flight vacuums with data files only vacuum, which complete much faster.
- 00:25 - Data-only vacuums complete, resolving incident impact, but backfill task consumers continue to work through their large backlog of changes to apply to the database. This marks the resolution of the second occurrence.
- 08:05: Due to backfill consumers burning through their backlog, we once again exhaust MultiXAct member space and writes start to fail. This marks the beginning of the third occurrence.
- 08:18: API alert triggered, but due to the high volume of ongoing alerts from earlier incident recovery work and the timing (middle of the night after extended firefighting), the alert wasn't immediately actioned. highlighting a process gap with our incident response team.
- 10:35: System recovers on its own without Metronome intervention. This marks the resolution of the third occurrence.
- 14:20: Write operations fail again despite belief that backfill jobs were paused. This marks the beginning of fourth occurrence.
- 15:31: Oncall sees spike of XAct Member Exhausted errors
- 15:51: Incident team re-assembles, status page updated
- 16:05: Incident team implements the "non-index, data files only" vacuum strategy.
- 16:35: Service was restored. Incident response team works to debug what was causing high MultiXact ID consumption, realizing workers were still processing a backlog of database changes. Additionally, the team works to understand why the system began emergency vacuums despite being roughly 50% below the configured threshold. This leads to the realization that PostgreSQL was actually hitting the separate MultiXact member space limit rather than the MultiXact ID threshold. This marks the resolution of the fourth occurrence.
May 19, 2025, 00:29: Armed with a significantly better understanding of the underlying constraints and monitoring to match, incident team re-enables the background processes without issue. The team immediately implemented enhanced monitoring for MultiXact member space and began work on architectural improvements to prevent similar incidents, including adjusting our backfill strategies and database parameter tuning.

Root Cause Analysis

Architectural Context

The incidents we experienced were deeply rooted in the internal mechanisms of PostgreSQL's concurrency control system. Before diving into the specifics, it's important to understand two key contextual elements:

Our database scale: A majority of Metronome's online infrastructure depends on an AWS Aurora PostgreSQL cluster exceeding 30TB in size running version 13.18. This scale is critical to understanding why recovery operations took hours to complete and why this issue had such significant impact.
Ongoing infrastructure improvement: To manage the challenges posed by this massive database size, for the last several weeks we’ve been conducting a planned migration from a single monolithic table (nearly 10TB) to a partitioned table structure partitioned across dozens of smaller tables. This architectural change was specifically designed to alleviate issues caused by our database scale by enabling faster processing and more efficient database maintenance through parallel autovacuuming. The migration involved dual writes to both new and old tables structures and a large-scale backfill process that, as we'll see, played a central role in these incidents.

PostgreSQL MultiXact IDs: Technical foundation and our misunderstanding

The incidents we experienced were deeply rooted in the internal mechanisms of PostgreSQL's concurrency control system. Before diving into the specifics, it's important to note that a majority of Metronome's infrastructure relies on an AWS Aurora PostgreSQL cluster exceeding 30TB in size running version 13.18. The scale of this database is critical to understanding why various operations took hours to complete and why this issue had such significant impact.

In PostgreSQL, individual table rows may have 1-N underlying tuples representing different versions of the data. This is the foundation of how Postgres manages concurrent access to data via MVCC. Each tuple has a header with space for a single transaction ID and flags. For shared locks across multiple transactions, PostgreSQL uses a separate MultiXact structure where the transaction ID in the tuple header is replaced by a MultiXact ID, and the list of locking transaction IDs is maintained in a secondary structure.

Key components of the MultiXact system:

MultiXact IDs: 32-bit identifiers (maximum ~4.2 billion) assigned to each unique group of locking transactions. These IDs appear in the tuple headers of rows with shared locks.
MultiXact Member Space: A separate data store that maintains the actual transaction IDs participating in each MultiXact. This space has a hard limit of approximately 4 billion total members across all MultiXacts.

Critical technical details that were not immediately evident:

1. Immutable MultiXact structure: MultiXacts in PostgreSQL are immutable collections of transaction IDs (members). When a transaction locks a row already managed by a MultiXact, PostgreSQL must create a completely new MultiXact containing all previous members plus the new transaction (Postgres heapam.c).

2. Quadratic member space growth: The immutable nature of MultiXacts leads to quadratic growth in member space consumption when multiple transactions concurrently lock the same row. As described by PostgreSQL core developer Thomas Munro in a mailing list post…

when n backends share lock a row we make O(n) multixacts and O(n^2) members. First we make a multixact with 2 members, then a new one with 3 members, etc... so that's n - 1 multixacts and (n * (n + 1)) / 2 - 1 members.
‍

For example, locking a row with 5 transactions creates 4 MultiXacts containing a total of 14 members until a vacuum can clean up the old MultiXacts that are now orphaned.

3. Foreign Key Impact on MultiXacts: Foreign keys compound this issue. When multiple transactions insert rows referencing the same parent rows, PostgreSQL creates MultiXacts for each referenced row. For instance, two transactions inserting into a table with foreign keys referencing the same rows in related tables will create separate MultiXacts for each foreign key reference, quickly inflating member space usage (AWS Database Blog, pganalyze).

For example:

Parent_Table
   -> foreign key to Child_One
   -> foreign key to Child_Two

Tx 1 inserts referencing Child_One row R1 and Child_Two row R2
Tx 2 inserts referencing Child_One row R1 and Child_Two row R2

Resulting MultiXacts:
- MultiXact for Child_One R1: members {Tx 1, Tx 2}
- MultiXact for Child_Two R2: members {Tx 1, Tx 2}

4. Contiguous member space constraint: PostgreSQL manages MultiXact members as a contiguous, sequentially allocated space, both globally and within each individual MultiXact. This creates a critical limitation: PostgreSQL's vacuum process can only reclaim member space in contiguous segments. Consequently, even a single long-running transaction holding an old MultiXact alive prevents the vacuum from reclaiming newer, unused member spaces. This limitation significantly exacerbates the impact of mixing long-running and short-lived transactions—precisely what occurred during our backfill operations.

5. Member space exhaustion and error handling: If the contiguous member space becomes exhausted, PostgreSQL will abort transactions attempting to create new MultiXacts with the error: This command would create a multixact with %u members, but the remaining space is only enough for %u member(Postgres multixact.c). Additionally, PostgreSQL initiates an emergency vacuum process to attempt recovery (Postgres multixact.c).

Our monitoring misunderstanding and discovery process:

Our monitoring was focused on the MultiXact ID count and thresholds (configured with autovacuum_multixact_freeze_max_age at 400 million), but the actual limiting factor was the MultiXact member space, which is not directly exposed by PostgreSQL metrics. This created a critical blind spot in our observability that prevented us from seeing the true state of our system until it was too late.

This misunderstanding was particularly insidious because:

Lack of direct visibility: PostgreSQL does not provide standard metrics or views to monitor MultiXact member space directly. While we had extensive monitoring for transaction IDs and MultiXact IDs, we had no visibility into member space consumption.
False sense of safety: Our dashboards showed we were using less than 50% of our configured MultiXact ID threshold, leading us to believe we had ample headroom. The member space exhaustion hit us unexpectedly despite this perceived safety margin.
Difficult error messages: When member space exhaustion began occurring, the error messages Postgres returns (This command would create a multixact with X members, but the remaining space is only enough for Y member) didn't immediately connect to our understanding of PostgreSQL resource limitations.
Limited documentation: This aspect of PostgreSQL is significantly less documented than transaction ID (XID) wraparound, which is covered extensively in PostgreSQL documentation and operational guides.

The fundamental check for MultiXact member space is performed in the PostgreSQL function GetNewMultiXactId (Postgres multixact.c), which is called every time a new MultiXact ID needs to be created. If the member space is exhausted, transactions fail and PostgreSQL triggers emergency vacuums to free space.

Through post-incident investigation of the source code and behavior, we finally understood this hidden limit and how it impacted our system. This insight has fundamentally changed how we approach PostgreSQL monitoring, capacity planning, and schema design.

What happened during the incidents

During this series of outages, PostgreSQL's protective mechanisms against MultiXact member space exhaustion were triggered by our increased write operations. The following cascade of events unfolded:

As our backfill process generated high volumes of inserts with foreign key relationships, we rapidly consumed MultiXact member space, a resource we weren't effectively monitoring
When member space became critically low, PostgreSQL began rejecting transactions and triggered aggressive vacuums across multiple tables to reclaim space
These emergency vacuums, particularly on our large tables, required several hours to complete and exceeded our available autovacuum worker threads
Until these vacuums completed, PostgreSQL rejected all write operations requiring new MultiXacts, causing a database-wide write outage
Our scale significantly compounded recovery time, as vacuuming tables with billions of rows is extremely time-consuming

Why this happened four times

It's important to address why we experienced four separate outages within a week. Each incident provided new insights, but our complete understanding was not achieved until the final occurrence:

Incident of May 10, 02:50 UTC: We identified high MultiXact usage but lacked understanding of the separate member space limit. We resumed backfill at lower concurrency, believing our MultiXact ID monitoring showed sufficient headroom.
Incident of May 16, 20:31 UTC: Our mid-month "bookkeeper" process combined with the ongoing backfill created write load that exceeded PostgreSQL's ability to reclaim member space.
Incident of May 17, 08:05 UTC: When attempting to mitigate the last occurrence, we stopped new backfill tasks but failed to pause workers processing the existing backlog. Additionally, this occurrence went initially undetected by the oncall engineer due to alert fatigue, and by the time it was noticed the system had recovered without intervention.
Incident of May 17, 14:20 UTC: This final incident revealed the critical insight - we were hitting the separate MultiXact member space limit rather than the MultiXact ID threshold, explaining why emergency vacuums triggered at around 200M MultiXact IDs despite our threshold being set at 400M.

Each occurrence provided a piece of the puzzle, ultimately leading to comprehensive understanding and effective mitigation.

Resolution and follow-up actions

Immediate mitigations implemented

Updated internal runbooks to incorporate our newly proven faster vacuum strategy by focusing on data-only (non-index) vacuums, to enable faster recovery if we enter into this state again.
Increased the number of worker threads available for autovacuum tasks and significantly tuned other vacuum parameters (autovacuum_vacuum_cost_delay, maintenance_work_mem, etc) to enable more efficient vacuums, for both normal and emergency vacuums.
Added better monitoring, alerting, and runbooks reflecting our new understanding of the MultiXact member space limits. Specifically, we ensured that we had monitoring and alerting across MultiXact age, as well as MultiXact member file and offset file storage.
The above enabled us to create monitoring that accounts for the separate MultiXact member space limit, rather than relying solely on the MultiXact ID count
We fully paused the backfill process and are rethinking how to execute it without thrash to our online datastores.

In progress mitigations

Implementing new API alerting that reduces the number of consecutive failing data points required to trigger alerts, ensuring faster detection of write operation failures. Additionally, we'll track error rates separately for read and write operations.
Conducting an audit of our usage of foreign keys, especially on low cardinality tables (such as enum tables).
Evaluating if more recent versions of PostgreSQL have better behavior characteristics that would justify pulling forward our upgrade timeline.
Evaluating whether certain high-write workloads should be moved off PostgreSQL entirely to avoid this class of issue.
Implementing better operational controls to ensure that when we "pause" processes, all related workers are also effectively paused.
Adding monitoring and alerting specifically around the autovacuum worker pool saturation.
Adding alerting based on the MultiXact member space limit which was causing emergency vacuums at around 200m MultiXact IDs, vs our configured MultiXact ID threshold of 400m.
Restarting the partitioning backfill and the bookkeeper processes while closely monitoring the above metrics, ensuring we can remain stable.

Conclusion

From May 10 to May 17, 2025, Metronome experienced 4 distinct outage periods for write actions, due to an edge case in PostgreSQL's protection mechanism for the MultiXact member space. This has since been addressed, and all systems are once again operational. We take these types of incidents extremely seriously, and are committed to preventing similar issues in the future through the mitigations outlined above. We sincerely apologize for any disruption these outages have had on our clients, and appreciate your patience with us throughout this time.

‍

Root Cause Analysis: PostgreSQL MultiXact member exhaustion incidents (May 2025)

Executive summary

Impact

Timeline

Root Cause Analysis

Architectural Context

PostgreSQL MultiXact IDs: Technical foundation and our misunderstanding

Key components of the MultiXact system:

Critical technical details that were not immediately evident:

Our monitoring misunderstanding and discovery process:

What happened during the incidents

Why this happened four times

Resolution and follow-up actions

Immediate mitigations implemented

In progress mitigations

Conclusion

Launch, Iterate, and Scale With Metronome

Product’s New Charter in Usage-Based Pricing

Cribl Saves Thousands of Development Hours With Metronome

Billing Mistakes Happen. What Matters Most Is How You Fix Them.

Usage-Based Billing: What It Is & How It Works

How to Implement a Price Increase Without Losing Trust

Keep up with the latest in pricing and packaging

Root Cause Analysis: PostgreSQL MultiXact member exhaustion incidents (May 2025)

Executive summary

Impact

Timeline

Root Cause Analysis

Architectural Context

PostgreSQL MultiXact IDs: Technical foundation and our misunderstanding

Key components of the MultiXact system:

Critical technical details that were not immediately evident:

Our monitoring misunderstanding and discovery process:

What happened during the incidents

Why this happened four times

Resolution and follow-up actions

Immediate mitigations implemented

In progress mitigations

Conclusion

Keep reading

Launch, Iterate, and Scale With Metronome

Product’s New Charter in Usage-Based Pricing

Cribl Saves Thousands of Development Hours With Metronome

Billing Mistakes Happen. What Matters Most Is How You Fix Them.

Usage-Based Billing: What It Is & How It Works

How to Implement a Price Increase Without Losing Trust

Keep up with the latest in pricing and packaging