+++ Posted on Aug 17, 2018 at 9:30 EDT +++
On August 9th, 2018 between 21:00 UTC and 02:24 UTC Shopify was operating between 35-80% availability for all merchants. All merchants and customers experienced intermittent errors when browsing, checking out, or managing their stores. This is the most significant incident of this nature that Shopify has experienced and we apologize for the stress and concern this caused. We’d like to share more context around this issue and what measures we are taking to prevent this from happening in the future.
Besides reliability and performance, one of our core tenets is to deliver new features and products to you as quickly as possible. To move quickly, we employ a ‘continuous delivery’ model that allows us to release up to 40 new versions of Shopify daily without scheduling any downtime. This means instead of introducing large, complex changes at once, we gradually release and expand them over time. We’ve also transitioned to the Cloud to allow us to further scale internationally and bring you performance and speed globally.
Description and Root Cause
- On August 9, 2018 at 21:00 UTC several on-call engineers were notified by automatic monitoring systems that Shopify was experiencing higher than normal error rates. At this point, Shopify’s availability was 90-99%. All on-call and available engineers convened and our Cloud provider was notified of the issue to begin investigating on their end.
- By 21:20 UTC we were back at 100% availability.
- At 21:40 UTC error rates increased again with 15%-40% of traffic being impacted. We reviewed Shopify’s changelog to see if any changes that went out in the latest version of Shopify could be the cause. The changelog looked unrelated and we decided it was more likely to make matters worse to roll back to a previous version on an already unstable infrastructure than to keep the current version in production. Furthermore, the version had been in production for more than an hour and typically change-related issues surface immediately.
- At 21:50 UTC we suspected that this may be an issue with our Cloud provider after seeing errors related to disks in the logs. Cloud provider incidents usually span a single region, to protect Shopify against this type of incident we run from multiple Cloud regions for redundancy. The problem appeared contained to the central region, and we decided to evacuate central. We knew east was affected as well but not to the extent of central.
- Between 21:50 UTC and 22:00 UTC we evacuated central. Moving to east momentarily improved availability to 75%, but over the next half hour the health of east began to degrade and moved to 35% availability by 22:20 UTC. Until 00:50 UTC we hovered between 35% and 50% availability.
- At 22:15 UTC the incident looked like a rare cross-region issue and we decided to balance merchants among east and central again while continuing our root cause investigation.
- At 22:30 UTC engineers looked further into why the operating systems on our servers were rapidly locking up. Engineers were rebooting and spinning up new servers to regain capacity from the servers that were hung. Others were investigating what could have changed with the operating system to cause it to lock up.
- At 23:50 UTC we prioritized requests to the online store and checkout over the management of stores to stabilize the customer experience.
- Between 00:10 UTC and 01:00 UTC the investigating engineers discovered that one of the unrelated-looking changes in the changelog (in this case, an optimization to our product recommendation system) caused Shopify to crash abruptly but only on certain requests. When this type of failure happens, we move the state of the software at the time of the failure (‘core dump’) to a disk in order to investigate the root cause and fix the bug later. In this case, the system failed so often that the disks became overwhelmed, causing the operating system to enter the unrecoverable, locked up state.
- Between 01:00 UTC and 01:18 UTC as the fix went out, availability recovered to nearly 100%. Simultaneously, we worked to clean up all the broken servers after which all systems returned back to normal.
Remediation and Prevention
We have already conducted several Root Cause Analysis sessions to enhance the integrity of our platform and have already taken steps to prevent future similar incidents. Going forward, we will:
- Improve our continuous delivery infrastructure. Before this incident, new changes were released to all merchants in less than 10 minutes. While this has served us well, we will continue work on a new delivery model to more gradually release changes to our merchants and customers. Improvements have already been implemented and we will be dedicating further resources as we believe this will have the greatest impact on improving our reliability.
- Address the technical problems that led to the incident. For this incident, there are several technical action items such as investigating the operating system bug that caused the system to hang when disks were busy, limiting the number of error dumps on each server, getting to the bottom of why an automated test didn’t fail, surfacing error dumps through our standard error reporting tools, and investigating further isolation for disk issues that would’ve limited the impact of this type of issue. Engineers are already working on all these issues.
- Review all past incidents and monitoring on the previous infrastructure. On our previous datacenter infrastructure, we had monitoring for quick detection of this type of failure. As we discovered, we did not port this specific monitoring rule to our new Cloud infrastructure along with the others. We’ll be conducting a systematic review of all past incidents and monitoring rules to ensure all previous action items are correctly applied to the Cloud.
We know that you place a high level of trust that Shopify is able to support your business. We apologize for the outage and want to assure you that we are implementing the lessons that we have learned from this incident across the organization to prevent similar issues in the future.
+++ Posted on Aug 10, 2018 at 14:33 EDT +++
Yesterday, Shopify was down for many of you for an extended period of time. We know that you rely on Shopify for a stable, secure and transparent platform. The team is working on the post-mortem as we speak and we will share it in full detail as soon as it is ready. In the meantime, we’d like to share a letter from our head of engineering, which discusses yesterday’s outage.
Hi everyone, my name is Jean-Michel Lemieux and I lead engineering at Shopify.
I want to recognize and apologize for the stress, concern and impact the recent outages, including yesterday, have had on you and your businesses. Platform integrity and reliability is the number one priority of Shopify, and the reason a lot of you selected us as your platform.
The last several weeks have been challenging as we had some issues transitioning to our new cloud infrastructure. This move is fundamental in allowing us to scale internationally and bring you performance and speed globally. These challenges were compounded by the incident last night which was caused by a problem with the rollout of a new storefront product recommendation system. It took us too long to find the root cause and restore service. We know that commerce is changing quickly and we are continuously investing in bringing you the best and most advanced tools possible. None of these are excuses, just additional context.
As many of you plan for back to school and Black Friday, we want to assure you that Shopify is ready. We have put in additional procedures to ensure we minimize any potential disruptions to you and your business in the future. We know this doesn’t change the impact an outage has on your business, but hopefully our proven track record gives you confidence in our ability to come back stronger. We believe the businesses who have been with us for a while can attest to our dedication to platform integrity. But we have to, and will do, better. We are confident, and want you to be confident, that you have made the right choice in Shopify and that we will work very hard to ensure the reliability of the platform every day.
Again, I sincerely apologize for the recent issues and thank you for your patience and continued support.