Editorial

Sharding our core Postgres database (without any downtime)

Published

September 17, 2025

•

Last updated

September 17, 2025

A deep dive into horizontal scaling: how we sharded our core db without any downtime or dropped requests.

For years, all of Gadget’s data lived in a single Postgres database that did everything. It stored lists of users, app environments, domains, app source code, as well as our user’s application data: every Gadget app’s tables, indexes, and ad hoc queries.

A single db worked well. We could vertically scale up resources with simple turns of knobs in a dashboard, as needed, which enabled Gadget to power thousands of ecommerce apps installed on 100K+ live stores.

That said, the monster that is Black Friday, Cyber Monday (BFCM) 2025 was coming up fast, and one database was no longer enough to handle the 400% (yes!) increase in app traffic over that weekend. At the same time our Postgres 13 database was reaching end-of-life and needed to be upgraded. And, as a wonderful bonus, we wanted to offer our largest users their own isolated database for guaranteed resource availability and scale.

We had taken vertical scaling as far as we could. We knew this day was coming, and it finally arrived: we needed to scale horizontally so the increased load could be spread across multiple database instances. It was time to shard.

But we had a hard requirement: it was time to shard without any downtime or dropped requests.

Gadget runs many mission critical apps with many terabytes of production data that has to be available. Our devs lose money when their apps are down. We’re not willing to schedule downtime for routine maintenance of this nature – this is what people pay us to avoid. The whole point of Gadget is to give devs their time back to work on the parts that are unique or interesting to them, not to deal with endless notification emails about service interruptions.

Se, we required our own strategies to scale horizontally, and to complete this major version bump. To break the problem down, we decided to treat our control plane and data plane differently. The control plane is Gadget’s internal data that powers the platform itself, like the list of users, apps, and domains. The data plane is where each individual app’s data is stored, and what serves reads and writes for an application, and is many orders of magnitude bigger than the control plane. Before we started, the data plane and control plane lived in the same Postgres instance, and we split the work up up in two phases:

Phase 1: shard the data plane off into its own set of Postgres instances, so that the control plane would be much smaller and (relatively) easy to upgrade.

Phase 2: execute a zero-downtime, major version upgrade of the now-smaller control plane database, which you can read more about here.

I’m going to dive into phase 1 and share how we sharded our user data from our core database to a series of Postgres instances running in GCP.

You can’t spell shard without hard

The workloads between our control plane and data plan were never the same. Control plane query volume and predictable – developers typing can only generate so many changes at once to their apps! However, the data plane is huge and unpredictable, storing data for thousands of apps, each with wildly different schemas, query shapes, and throughput characteristics. The data plane accounts for orders of magnitude more rows, indexes, and IO. That asymmetry gave us a natural split: keep the control plane centralized and small, and shard out only the data plane.

Sharding is generally a very scary thing to do – it’s a really fundamental change to the data access patterns, and to keep consistency guarantees throughout the process, you can’t do it slowly, one row at a time. You need all of a tenant’s data in one spot so you can transact against all of it together, so sharding tends to happen in one big bang moment. Beforehand, every system participant points at the one big database, and after, every system participant looks up the right shard to query against, and goes to that one. When I’ve done this in the past at Shopify, we succeeded with this terrifying big-bang cutover moment, and I never want to have to press a button like that again. It worked, but my blood pressure is high enough as is.

To add to the fun, we were on a tight calendar: our previous vendor’s support for our Postgres version was ending and we had to be fully sharded well before BFCM so we could complete the upgrade and safely handle the projected increase in traffic.

Our plan of attack

Instead of a big bang, we prefer incremental, small changes where we can validate as we go. For fundamental questions like “where do I send every SQL query” it is tricky, but not impossible, to pull off. Small, incremental changes also yield a reliable way to validate in production (real production) that the process is going to work as you expect without breaking everything. Put differently, with changes of this nature you must accept the inevitability of failure and make the cost of that failure as low as possible.

So, we elected to shard app-by-app, instead of all at once. This would allow us to test our process on small, throwaway staff apps first, refine it, and then move progressively bigger subsets of apps out until we’re done.

With these constraints, we came up with this general strategy for sharding:

Stand up the new Postgres databases alongside the existing core database, and set up all of the production monitoring and goodness we use for observability and load management.
For each app, copy its schema, and then data into the new database behind the scenes using postgres replication.
When the new database has replicated all the data, atomically cut over to the new database which then becomes the source of truth. And, don’t drop any writes. And, don’t serve any stale reads from the old database once the cutover is complete.
Remove defunct data in the old database once we have validated that it is no longer needed.

Maintenance mode as an engineering primitive

Stopping the world for a long period of time wasn’t an option because of the downtime. But we could pause DB traffic for a very short period of time, without creating any perceptible downtime. We would love to remove any and all pausing, but it just isn’t possible when atomic cutovers are required, as we must wait for all transactions in the source to complete before starting any new ones in the destination.

That cutover time can be very small, especially if we only wait for one particular tenant’s transactions to finish. If you squint, this is a gazillion tiny maintenance windows, none of which are noticeable, instead of one giant, high risk maintenance window that everyone will hate.

We needed a tool to pause all traffic to one app in the data plane so we could perform otherwise disruptive maintenance to the control plane. The requirements:

Pausing must be non-disruptive. It is ok to create a small, temporary latency spike, but it cannot drop any requests or throw errors.
It must allow us to do weird, deep changes to the control plane, like switch which database an app resides in, or migrate some other bit of data to a new system.
- This means it must guarantee exclusive access to the data under the hood, ensuring no other participants in the system can make writes while paused
It must not add any latency when not in use.
It must be rock solid and super trustworthy. If it broke, it could easily cause split brain (where database cluster nodes lose communication with each other and potentially end up in a conflicting state) or data corruption.

We built just this and called it maintenance mode! Maintenance mode allows us to temporarily pause traffic for an app for up to 5 seconds, giving us a window of time to do something intense under the hood, then resume traffic and continue to process requests like nothing happened. Crucially, we don’t error during maintenance, we just have requests block on lock for a hot second, do what we need to do, and then let them proceed as if nothing ever happened.

We’ve made use of it for sharding, as well as a few other under-the-hood maintenance operations. Earlier this year, we used it to cut over to a new background action storage system, and we’ve also used it to change the layout of our data on disk in Postgres to improve performance.

How the maintenance primitive works

We pause one environment at a time, as one transaction can touch anything within an environment, but never cross environments. Here’s the sequence of a maintenance window:

We track an “is this environment near a maintenance window” (it’s a working title) boolean on every environment that is almost always <inline-code>false<inline-code>. If <inline-code>false<inline-code>, we don’t do anything abnormal, which means no latency hit for acquiring locks during normal operation.
We also have a maintenance lock that indicates if an environment is actually in a maintenance window or not. We use Postgres advisory locks for this because they are robust and convenient, and allow us to transactionally commit changes and release them.
When we want to do maintenance on an environment to do a shard cutover or whatever, we set our “is this environment near a maintenance window” (still a working title) boolean to <inline-code>true<inline-code> (because, it is near a maintenance window), and then all participants in the system start cooperating to acquire the shared maintenance lock for an environment.
Because some units of work have already started running in that environment, or have loaded up and cached an environment’s state in memory, we set the boolean to <inline-code>true<inline-code>, and then wait for a good long while. If we don't wait, running units of work may not know the environment is near a maintenance window, and may not do the lock acquisition they need them to do, and may run amok. Amok. The length of the wait is determined by how long our caches live. (“Fun” fact: It took us a long time to hunt down all stale in-memory usages of an environment to get this wait time down to something reasonable.)
“Normal” data plane units of work acquire the maintenance lock in a shared mode. Many requests in the data plane can be in flight at once, and they all hold this lock in shared mode until they are done.
- We have a max transaction duration of 8 seconds, so the longest any data plane lock holder will hold is, you guessed it, 8 seconds.
- Actions in Gadget can be longer than this, but they can’t run transactions longer than this, so they are effectively multiple database transactions and multiple lock holds under the hood.
The maintenance unit of work that wants exclusive access to the environment acquires the lock in exclusive mode such that it can be the only one holding it.
- This corresponds directly to the lock modes that Postgres advisory locks support – very handy Postgres, thank you!
Once the maintenance unit of work acquires the lock, data plane requests are enqueued and waiting to acquire the lock, which stops them from progressing further into their actual work and pauses any writes.
To minimize the number of lock holders / open connections, we acquire locks within a central, per-process lock broker object, instead of having each unit of work open a connection and occupy it blocked on a lock.
When we’ve made whatever deep change we want to make to the environment and the critical section is done, we release the exclusive lock and all the blocked units of work can proceed. Again, this matches how PG locks work quite well, where shared-mode acquirers happily progress in parallel as soon as the exclusive holder releases it.

The workflow showing how units of work interact with the maintenance lock.

For the maintenance mode to be trustworthy, we need assurances that all requests actually go through the code paths that check the maintenance lock. Fortunately, we’ve known this has been coming for some time, and chose an internal architecture that would make this robust and reliable (and possible).

Internally within Gadget’s codebase, we broker access to an environment’s database exclusively through an internal object called an <inline-code>AppWorkUnit<inline-code>. This object acts as a central context object for every unit of work, holding the current unit of work’s timeout, actor, and abort signal. We “hid” the normal Postgres library that actually makes connections behind this interface and then systematically eliminated all direct references to the connection to give us the confidence that there are no violations. (At Shopify we used to call this shitlist driven development and boy oh boy is it easier with a type system.)

With <inline-code>AppWorkUnit<inline-code> being the only way to get a db connection from the data plane databases, we can use it as a choke point to ensure the locking semantics apply to every single callsite that might want to do database work, and have a high degree of confidence every participant will respect the locking approach.

So we can temporarily pause an environment, what now?

Now we can actually shard the database. The maintenance mode primitive allows us to atomically cut over an environment to a different database and point to the new database, while ensuring that all participants in the system happily wait while the cutover is happening.

But copying all data from our data plane is a challenge in itself!

We wanted to build as little custom tooling as possible to handle this kind of super-sensitive operation, so we elected to use Postgres logical replication as much as possible. Logical replication is a super robust and battle tested solution for copying data between Postgres databases, and, unlike binary replication, it even supports copying data across major versions. (This was foundational to our zero-downtime Postgres upgrade too.)

The downside to logical replication: you need to manage the database schema on both source and destination databases yourself. Thankfully, we’ve already automated the living daylights out of schema management for our Gadget apps beforehand, so we were in a good position to keep the database schemas in sync.

Here’s the algorithm we used to actually go about sharding our data plane:

An operator or a background bulk maintenance workflow initiates a shard move.
Any crufty old stuff from previous or failed moves is cleaned up.
The destination is prepared by converging the schema to exactly match the source db.
A Postgres logical replication stream is created between source and destination db.
The logical replication stream is monitored by the maintenance workflow to wait for the copy to finish (this takes seconds for small apps but hours for the biggest ones).
Once the stream is caught up, it will keep replicating changes indefinitely. It's time to cut over.
We start the maintenance mode window and wait again for the data plane to (definitely) know about it.
We take the maintenance exclusive lock, pausing all traffic to the environment.
We wait for the Postgres logical replication stream to fully catch up (it’s typically only a few megabytes behind at this point).
Once the stream is caught up, we update the control plane to point to the new source of truth for the environment, and release the maintenance lock. We’ve now passed the point of no return.

To gain confidence in our process, we were able to dry run everything up to the final cutover step. This was quite nice, and made me quite happy because we were able to catch issues before doing the final sharding process and cut over.

Task failed… successfully

In addition to the dry run-ability of the process, we have a whole bucketload of staff apps that are “safe to fail” on in production. To test, we just “ping-ponged” the same set of applications back and forth between databases to flush out all the issues, which allowed us to fail (a bunch) in our real production environment.

We wandered through the many subtleties of determining whether a logical replication stream is actually caught up to the source database. Many edge cases to handle. Many (arcane) system table queries to get right.

Our core database also had a max logical replication workers config set so low that we couldn’t migrate many environments in parallel. Updating this config would’ve required a disruptive server restart so we settled for a much slower process than we intended.

Onwards and upwards with horizontal scalability!

Once we were confident that we had a robust process in place, we migrated every single environment, of every single app successfully.

The longest pause window: 4 seconds.

The p95 pause window: 250ms.

Hot dog!

Our new database hardware is better performing and has been significantly more reliable than our previous provider.

Tackling this migration environment by environment, app by app, allowed us to avoid a big bang cutover, and helped me to maintain normal blood pressure through the cutover.

You can read all about phase 2 of our database upgrade process, our zero-downtime Postgres upgrade, in our blog.

If you have any questions about maintenance mode or our sharding process, you can get in touch with us in our developer Discord.

Harry Brundage

Author

Riley Draward

Reviewer

Try Gadget

See the difference a full-stack development platform can make.

Create app

Editorial

Zero downtime Postgres upgrades using logical replication

Editorial

Saturating Shopify: Gadget’s Shopify sync strategy

Case studies

Sharding our core Postgres database (without any downtime)

A deep dive into horizontal scaling: how we sharded our core db without any downtime or dropped requests.

Problem

Solution

Result

But we had a hard requirement: it was time to shard without any downtime or dropped requests.

Phase 1: shard the data plane off into its own set of Postgres instances, so that the control plane would be much smaller and (relatively) easy to upgrade.

Phase 2: execute a zero-downtime, major version upgrade of the now-smaller control plane database, which you can read more about here.

I’m going to dive into phase 1 and share how we sharded our user data from our core database to a series of Postgres instances running in GCP.

You can’t spell shard without hard

Our plan of attack

With these constraints, we came up with this general strategy for sharding:

Stand up the new Postgres databases alongside the existing core database, and set up all of the production monitoring and goodness we use for observability and load management.
For each app, copy its schema, and then data into the new database behind the scenes using postgres replication.
When the new database has replicated all the data, atomically cut over to the new database which then becomes the source of truth. And, don’t drop any writes. And, don’t serve any stale reads from the old database once the cutover is complete.
Remove defunct data in the old database once we have validated that it is no longer needed.

Maintenance mode as an engineering primitive

We needed a tool to pause all traffic to one app in the data plane so we could perform otherwise disruptive maintenance to the control plane. The requirements:

Pausing must be non-disruptive. It is ok to create a small, temporary latency spike, but it cannot drop any requests or throw errors.
It must allow us to do weird, deep changes to the control plane, like switch which database an app resides in, or migrate some other bit of data to a new system.
- This means it must guarantee exclusive access to the data under the hood, ensuring no other participants in the system can make writes while paused
It must not add any latency when not in use.
It must be rock solid and super trustworthy. If it broke, it could easily cause split brain (where database cluster nodes lose communication with each other and potentially end up in a conflicting state) or data corruption.

How the maintenance primitive works

We pause one environment at a time, as one transaction can touch anything within an environment, but never cross environments. Here’s the sequence of a maintenance window:

We track an “is this environment near a maintenance window” (it’s a working title) boolean on every environment that is almost always <inline-code>false<inline-code>. If <inline-code>false<inline-code>, we don’t do anything abnormal, which means no latency hit for acquiring locks during normal operation.
We also have a maintenance lock that indicates if an environment is actually in a maintenance window or not. We use Postgres advisory locks for this because they are robust and convenient, and allow us to transactionally commit changes and release them.
When we want to do maintenance on an environment to do a shard cutover or whatever, we set our “is this environment near a maintenance window” (still a working title) boolean to <inline-code>true<inline-code> (because, it is near a maintenance window), and then all participants in the system start cooperating to acquire the shared maintenance lock for an environment.
Because some units of work have already started running in that environment, or have loaded up and cached an environment’s state in memory, we set the boolean to <inline-code>true<inline-code>, and then wait for a good long while. If we don't wait, running units of work may not know the environment is near a maintenance window, and may not do the lock acquisition they need them to do, and may run amok. Amok. The length of the wait is determined by how long our caches live. (“Fun” fact: It took us a long time to hunt down all stale in-memory usages of an environment to get this wait time down to something reasonable.)
“Normal” data plane units of work acquire the maintenance lock in a shared mode. Many requests in the data plane can be in flight at once, and they all hold this lock in shared mode until they are done.
- We have a max transaction duration of 8 seconds, so the longest any data plane lock holder will hold is, you guessed it, 8 seconds.
- Actions in Gadget can be longer than this, but they can’t run transactions longer than this, so they are effectively multiple database transactions and multiple lock holds under the hood.
The maintenance unit of work that wants exclusive access to the environment acquires the lock in exclusive mode such that it can be the only one holding it.
- This corresponds directly to the lock modes that Postgres advisory locks support – very handy Postgres, thank you!
Once the maintenance unit of work acquires the lock, data plane requests are enqueued and waiting to acquire the lock, which stops them from progressing further into their actual work and pauses any writes.
To minimize the number of lock holders / open connections, we acquire locks within a central, per-process lock broker object, instead of having each unit of work open a connection and occupy it blocked on a lock.
When we’ve made whatever deep change we want to make to the environment and the critical section is done, we release the exclusive lock and all the blocked units of work can proceed. Again, this matches how PG locks work quite well, where shared-mode acquirers happily progress in parallel as soon as the exclusive holder releases it.

So we can temporarily pause an environment, what now?

But copying all data from our data plane is a challenge in itself!

Here’s the algorithm we used to actually go about sharding our data plane:

An operator or a background bulk maintenance workflow initiates a shard move.
Any crufty old stuff from previous or failed moves is cleaned up.
The destination is prepared by converging the schema to exactly match the source db.
A Postgres logical replication stream is created between source and destination db.
The logical replication stream is monitored by the maintenance workflow to wait for the copy to finish (this takes seconds for small apps but hours for the biggest ones).
Once the stream is caught up, it will keep replicating changes indefinitely. It's time to cut over.
We start the maintenance mode window and wait again for the data plane to (definitely) know about it.
We take the maintenance exclusive lock, pausing all traffic to the environment.
We wait for the Postgres logical replication stream to fully catch up (it’s typically only a few megabytes behind at this point).
Once the stream is caught up, we update the control plane to point to the new source of truth for the environment, and release the maintenance lock. We’ve now passed the point of no return.

Task failed… successfully

Onwards and upwards with horizontal scalability!

Once we were confident that we had a robust process in place, we migrated every single environment, of every single app successfully.

The longest pause window: 4 seconds.

The p95 pause window: 250ms.

Hot dog!

Our new database hardware is better performing and has been significantly more reliable than our previous provider.

Tackling this migration environment by environment, app by app, allowed us to avoid a big bang cutover, and helped me to maintain normal blood pressure through the cutover.

You can read all about phase 2 of our database upgrade process, our zero-downtime Postgres upgrade, in our blog.

If you have any questions about maintenance mode or our sharding process, you can get in touch with us in our developer Discord.

Interested in learning more about Gadget?

Join leading agencies making the switch to Gadget and experience the difference a full-stack platform can make.

Create app

Book a demo