Editorial

Database Configuration Change and Service Disruption - December 13, 2024

Published

December 20, 2024

•

Last updated

December 20, 2024

Details from Gadget CTO Harry Brundage on the service disruption that impacted the platform Friday, December 13, 2024.

Summary

On December 13th, 2024, Gadget experienced a service disruption during what was intended to be a brief emergency maintenance window. The incident began at 11:25 AM EST and full service restoration was completed by 12:10 PM EST, resulting in approximately 45 minutes of degraded service. This incident occurred during a response to previous, very small service disruptions caused by connection pool pressure due to rapid platform growth. Another related outage occurred on December 17th from 5:21 PM to 5:24 PM EST.

We understand that our platform is critical to your business operations, and any downtime is unacceptable. This incident fell short of our standards and your expectations. As CTO of Gadget, I want to personally apologize for this disruption.

‍

Timeline

Dec 13th 11:00 AM EST: Decision made to implement emergency configuration change to address connection limits
Dec 13th 11:25 AM EST: Planned maintenance began with database restart
Dec 13th 11:26 AM EST: Database failed to recover as expected
Dec 13th 11:45 AM EST: Root cause identified as memory fragmentation
Dec 13th 11:50 AM EST: Database successfully restarted following hardware reset
Dec 13th 12:10 PM EST: Full service restoration completed
Dec 17th 5:21 PM EST: Main database instance failed, platform is down
Dec 17th 5:24 PM EST: Main database HA standby promoted and platform recovers

‍

What happened

Our database architecture has several key components and configurations that lead to this incident:

Central PostgreSQL database
- We use one main Postgres database for all application data, which allows Gadget to offer serverless cost structures and responsive, elastic scaling
- Postgres’ <inline-code>max_connections<inline-code> configuration variable limits how many connections can be established, and cannot be adjusted without a full server restart
User separation for security
We maintain distinct PostgreSQL users for different operations:
- <inline-code>gadget_api<inline-code>: A restricted user, used for transaction processing
- <inline-code>gadget_utility<inline-code>: An administrative user with permission to make schema changes and maintenance
Connection pooling configuration
- Real Postgres database connections are remarkably expensive. We use PgBouncer for connection pooling across API server Node.js processes, which is key for horizontally scaling
- PgBouncer, when configured to act as a variety of different users, establishes independent connection pools for each user. To make a connection for a user, PgBouncer requires the password for that user, and must retrieve the password for that user at connection time, and does so every time using a subsystem called the <inline-code>auth_query<inline-code>. Gadget’s database vendor only supports this mode of PgBouncer authentication
Connection multiplication effect
- Issuing the <inline-code>auth_query<inline-code> to retrieve the password and establish a new connection requires its own connection. In practice, this means that creating one new <inline-code>gadget_api<inline-code> connection actually briefly creates 2 new Postgres connections under the hood:
  1. One for authentication verification (as superuser)
  2. One for the actual connection in the shared pool
- When a large spike of incoming traffic arrived, this doubling of new connections meant that we rapidly clipped the <inline-code>max_connections<inline-code> limit which subsequently returned errors to PgBouncer
- When PgBouncer encounters an error trying to establish a new connection (either when running the <inline-code>auth_query<inline-code> or when establishing the connection in the shared pool), it retries, but very slowly, resulting in frontend API processes timing out on queries
Frontend API server health checks
- Our API servers which serve GraphQL requests for reads, writes, and webhook reception issue a <inline-code>SELECT 1<inline-code> SQL query when being healthchecked. If the backend database is out of connections, or PgBouncer can’t establish and run this query fast enough, the API server will get marked as unhealthy, as it appears it cannot contact the database
- When unhealthy, API servers are removed from the load balancer and no traffic is sent to them
- During an unexpected incident where no API servers can contact the database permanently, all API servers get erroneously removed from rotation

Ideally, we could have just raised <inline-code>max_connections<inline-code> to account for the platform growth, but Postgres does not support changing this configuration dynamically – instead, it requires a full process restart, which involves a short amount of downtime. Enduring this downtime just to reconfigure something goes against our values. We did not want to take this downtime and being forced to take it is a failure on our part. However, rather than suffering the micro-outages caused by the clipping over a longer period of time, and knowing we had to take at least one momentary downtime to raise this limit much higher, we elected to do an emergency restart to roll out this increase as soon as possible.

Gadget’s database vendor supports two main approaches to restarting:

Restart the Postgres process in place: SIGTERM the process and wait for it to shut down cleanly. Then restart the process in-place, on the same host
Create a new replica of the existing process with new configuration, allow the replica to catch up to the leader, and then fail over to the new replica once it is ready. The failover process also involves downtime, as well as a DNS propagation delay for the new elements of the system to discover the new leader

The in-place process restart does not require waiting for a replica rebuild and generally is expected to have a lower overall amount of downtime. Gadget’s database stores many terabytes of data today, which makes replica rebuilds take quite a while before they are ready for failover. To prevent this delay and minimize downtime, we elected to take the in-place process restart approach.

‍

Linux Memory Fragmentation

We triggered the in-place restart at 11:25 AM. The Postgres server shut down but did not start up again. The incident escalated at this time due to a subtle interaction between PostgreSQL's memory allocation and Linux system memory management.

We run PostgreSQL on Linux servers using the <inline-code>huge_pages=on<inline-code> setting, which improves performance by forcing Postgres to use much larger, contiguous chunks of memory from the OS. Postgres allocates many of these chunks of memory at startup for its <inline-code>shared_buffer<inline-code> pools, and for a large database server like Gadget’s, the amount of memory allocated is many gigabytes. However, after the prior months of continuous operation on this same database server, the kernel’s memory had become quite fragmented. When the Postgres process was started again, the total free amount of memory was sufficient for it to start, but the amount of contiguous free memory to satisfy the <inline-code>huge_pages<inline-code> allocations was not. So, the kernel could not allocate the memory Postgres requested, and the new Postgres process failed to start. Identifying this issue took us and our database vendor some time to debug, as neither party was aware of this possibility.

To recover, we restarted the database server itself. This has the effect of clearing the memory entirely, leaving the kernel with a very wide and unfragmented memory space to allocate from. After a restart, the Postgres process was able to start up and service was largely restored.

You can read more about Linux kernel memory fragmentation here.

‍

Compute recovery

The incident was further complicated by our compute auto-scaling behavior. During the database outage, our auto-scalers detected reduced load, which triggered a scale-down of compute resources. When service was restored, we needed to provision almost 4x the compute we currently had, which strained the systems we use for provisioning. This scale-up was completed successfully but added another 20 minutes of degraded service, where some applications’ requests were processed much slower than they could have been.

‍

December 17th failover

In response to the above issues, we made a change to the PgBouncer configuration to preserve server-side connections for much longer by raising the <inline-code>server_lifetime<inline-code> and <inline-code>server_idle_timeout<inline-code> parameters. We hypothesized that the high rate of connecting and reconnecting caused by surges in traffic to the platform could occasionally exhaust database CPU, and that lowering the rate at which we reconnected would improve resiliency. We successfully reduced the amount of connection churn, but, due to a combination of an excessively high <inline-code>work_mem<inline-code> and the much increased connection lifetime, we started experiencing much more memory pressure within Postgres than before these changes. Increased traffic eventually caused the database to be marked as unhealthy, and a failover to the standby instance occurred. We reverted the changes to the PgBouncer configuration immediately after.

Impact

These service disruptions affected all requests to Gadget applications. The system was fully unavailable for the 35-minute window after the initial database restart, which included serving frontend requests, serving GraphQL API requests, webhook processing, and syncs. After the database service was successfully restarted at Dec 13 11:50 AM EST, requests began processing with elevated latency until 12:10 PM EST by which all metrics had recovered. Similarly, for the short 1 minute period on December 17th while the main database was failing over, all requests returned a 500 status code.

For those using the Shopify connection with Gadget, webhook data will have been delayed, but not lost. Shopify retries webhook delivery in the face of errors such that webhooks will have been re-delivered to Gadget once back online. Gadget’s nightly reconciliation for missed webhooks scanned for any missing deliveries and has re-run model actions for any affected models.

‍

Root cause analysis

Several deeper factors led to this outage that we’d like to share:

First, we made an error in judgment electing to use the in-place server restart option for rolling out the new configuration to our database. By doing this restart, we put the system in a state we weren’t confident in and hadn’t seen before, which was a risk we shouldn’t have taken. It is impossible to predict every failure state ahead of time, but it is often possible to avoid exposure to new ones.
Second, we made an error in capacity planning for Black Friday / Cyber Monday. We planned for a doubling of traffic but witnessed an almost 5x increase in load, such that many configurations for our systems stopped being appropriate, such as the original <inline-code>max_connections<inline-code> value. Had we correctly planned for this increase in capacity, we could have scheduled a maintenance window far in advance, and practiced it in a staging environment to ensure it would happen smoothly.
Third, we do not have a robust way of testing new database configurations in a production-like environment that doesn’t impact users. Gadget’s existing testing infrastructure didn’t catch this issue before it went live. Gadget runs comprehensive integration tests for every change we make to the system, but these tests run against a synthetic Postgres database and Kubernetes infrastructure. These databases do not experience the months of production load necessary to create the memory fragmentation issue, or the memory usage issues, and would likely allow a Postgres database to start without issue after a shutdown.

What We're Doing to Prevent This

There are two main categories of improvements we will make to avoid incidents like this in the future.

‍

Process

First, we’ll ensure we only ever do a replica-failover approach to disruptive database operations like this. We aspire to never do this in the first place – we don’t want any systems that require downtime for any maintenance operations, but we’re not there yet. So, if a restart is required before we finish the infrastructure work outlined below, we’ll use the approach that isn’t vulnerable to memory fragmentation issues. Starting the second replica also allows us to verify its functionality before cutting over to it, allowing us to test in situ within the production infrastructure.

Second, we’ll establish a much more aggressive pre-BFCM capacity planning process. We will prepare for a 6x increase in load, and we will execute comprehensive load tests to prove that we’re ready for this well in advance. This will be a challenge but is something our team has done before and can do again.

Third, we’ll commit to communicating any planned maintenance well in advance. We understand that both the duration and the unexpected nature of this incident was unacceptable. Both incidents and the (rare) maintenance events will always be communicated at https://status.gadget.dev, but any planned downtime will be communicated ahead of time via the status page and community Discord. We also poorly communicated the emergency nature of this change, and so we’ll refine our standard operating procedures to clearly communicate the differences between planned maintenance and service disruptions.

Infrastructure

First, we will complete the database sharding project we’ve been planning in order to allow different Gadget apps to use different databases under the hood. This will allow us to use many more risk-management techniques for our database configuration, like canary deploys and blue/green setups, so we can extensively test configuration changes like this well before they ever reach customer applications. This will also allow us to limit the failure domains for any unexpected issues.

Second, we will complete the work we’ve already been doing to build a 0-downtime database failover system. We’re dissatisfied with our vendor’s ability to non-disruptively resize or reconfigure our database clusters, so we’ve already been investing in a new system that will allow us to make these changes without any downtime for any app. It is a large infrastructure project and is not yet completed, so we were unable to use it for this change. However, several subsystems that will power this are already in use in production such that we’re confident we can migrate applications between databases without downtime in the future.

Third, we’ve adjusted our autoscalers to have much higher scaling floors. In any kind of future incident, recovery should be much faster as our compute allocations from our underlying providers will not shrink nearly as much.

‍

Our Commitment

While this incident revealed areas where we need to improve, it has also reinforced our commitment to building a more resilient platform. We're investing significant resources in developing zero-downtime maintenance procedures and improving our database change management process to eliminate vulnerabilities to these types of issues. Our goal is not just to avoid similar incidents, but to prevent planned downtime completely during maintenance procedures.

We believe in transparency and will continue to share our progress on these improvements. We appreciate your trust in Gadget and are committed to maintaining the highest standards of reliability and performance.

‍

Questions or Concerns?

If you have any questions about this incident or would like to discuss it further, please don't hesitate to reach out to me directly, or to our support team. We're here to help and value your feedback as we continue to improve our platform.

And as always, you can find up-to-the-minute information on Gadget’s platform health at https://status.gadget.dev.

Harry Brundage
CTO, Gadget

‍

Harry Brundage

Author

Mohammad Hashemi

Reviewer

Try Gadget

See the difference a full-stack development platform can make.

Create app

No items found.

Case studies