Building apps for the Shopify platform is exciting because of the millions of merchants installing them every day. But, with this same enormous scale comes enormous responsibilities. Shopify apps must meet huge scaling requirements so that the huge sales merchants run over the Black Friday/Cyber Monday (BFCM) weekend become huge successes instead of huge embarrassments.
Over the course of BFCM, your app can generally expect up to 10 times as much traffic as it might see the weekend before. Customers place a LOT of orders over BFCM, and annoyingly, this traffic tends to be quite write-heavy – more cart changes, more checkouts, and more orders. Many merchants run flash sales which create huge spikes of writes that happen all at once. Most apps also see a spike in merchant installs and configuration changes in the weeks leading up to BCFM, as merchants prepare for their holiday season busy periods. All of this is good news for your app business but is scary as heck for the engineering team which needs to be ready for 10x traffic increases at any moment.
The good news is that with preparation and careful analysis, you can be sure your app will scale to meet even the most demanding traffic over BFCM.
The main load generated on apps over BFCM comes in the form of storefront traffic. This means that any scripts or pixels loaded on merchant storefronts will see huge traffic increases. As customers check out, you’ll also see large increases in the number of cart, checkout, and order webhooks.
If your app needs to do any expensive work to process these webhooks it’s easy to get overwhelmed. Gadget recommends processing any and all webhook traffic in a background job where possible. This adds a queue between webhook reception and webhook processing, so you can buffer and absorb big spikes in traffic instead of dropping it or 500ing. Gadget’s Shopify Connection has robust background webhook processing built right in if you’re looking for an off-the-shelf, scalable webhook processor.
Once you’re processing Shopify webhooks scalably in the background, the bottleneck for most Shopify apps becomes the database. Drastically increased traffic means many more queries are sent to the database, which often ends up overwhelming it. Just about all apps really do need the database available to do useful work, so if it is overwhelmed, requests start timing out or erroring, and merchants and customers get frustrated. Scaling your database to avoid this is a rich and complicated topic, but we have a few general recommendations:
Unlike the database, the app tier tends to be much easier to scale by adding more instances. As long as your app is structured to have the main business logic be stateless, and uses a network-attached database for storage, you can just turn up the knob on how much app you have. Serverless deployment tools like Gadget or AWS Lambda shift the burden of turning the knob up and down onto the platform, so you can be sure you’re not overpaying for instances when traffic subsides.
For scaling the app tier, Gadget recommends profiling to identify the slowest pieces of your workload as well. Counterintuitively, it can actually be more scalable to pull business logic out of the database and into your app code, like data transformation or client-side JOINs. This is often slower in terms of CPU time, but by shifting this processing out of the database, you buy back its very scarce CPU time to process transactions that only it can process.
Understanding which part of your app is the weakest scaling link can be difficult, especially since under normal operating conditions, your app probably works great! BFCM is a rare traffic spike of the kind that normal traffic doesn’t prepare you for. So, Gadget recommends two solutions for preparing:
Synthetic load testing means running a fleet of scripts to generate whatever (usually large) amount of traffic you can against yourself. It allows you to simulate BFCM ahead of time to prove that you can scale to your targets or to help identify the weak links. Shopify and Gadget both use synthetic load testing of the production environment frequently to ensure that the scalability is really there. Gadget recommends k6 for an easy-to-use, scalable load tester.
During both normal conditions and load tests, it is key to get deep insight into how the various systems that make up your app are performing. You need to prioritize your development efforts against the biggest slowdowns or bottlenecks in the system, and to do that, you need evidence as to where the problems actually are. There’s a multitude of observability tools for capturing and analyzing this data out there today.
For logs, most cloud providers (Gadget included) have decent loggers out of the box, though you often need to carefully tune what logs you emit. Gadget recommends adding structured logging to your application and using hosted logging tools from your cloud provider.
Gadget also recommends production error tracking. In addition to finding bugs, error tracking will give you some insight into failing requests and alert you when key infrastructure components begin to fail with timeout errors - very important things to watch for during load spikes. We recommend Sentry for error tracking, and Gadget apps can be connected to Sentry easily with the Sentry Connection.
And finally, if you have the time, we also recommend adopting a tracing tool like Honeycomb or Lightstep. Traces are the most powerful and robust way of capturing and analyzing the behavior of your application by giving super detailed information on the behavior of each individual request to your system. Gadget uses tracing tooling extensively internally and we find that it is invaluable for ensuring we can meet the demands of BFCM.
BFCM is an exciting time for merchants and a demanding time for app developers, so Gadget’s final recommendation is to not take the operational burden of staying up over BFCM lightly. It’s important that the humans behind the keyboards are well-rested, ready to respond to incidents, and survive the busy period to continue building after. To all the operators out there, we hope that your systems run incredibly smoothly and that you get a good night’s sleep! If something breaks, we’re rooting for you!
If you want to discuss any of these scaling approaches in more detail, or nerd out about a thorny operations problem, the Gadget staff is always here to help in the Gadget Discord server.