When things go bad: How an online outage broke the (offline) app SkyTemple
Hey everyone!
This is a short announcement about an outage yesterday and an explanation for why SkyTemple or SkyTemple Randomizer may have not worked for you and the steps we are taking to prevent this from happening.
SkyTemple may not have worked for all users with internet access from around 9pm UTC yesterday to today around 7am UTC.
How the downloads page works
SkyTemple's download page is hosted by a web service, which has it's files (i. e. the downloadable executables but also the file containing all the metadata, such as which versions exist) hosted on online cloud storage provided by Wasabi. The cloud storage is replicated accross three regions, Europe, North America and Japan.
The download page service also provides the banner you sometimes see when you start the app, and the notice about a new version being available, if you are running an old version.
Why everything broke
This is exactly why yesterday, from around 9pm UTC to today around 7am UTC you could not start SkyTemple or the Randomizer.
SkyTemple and the Randomizer are checking for a banner and a potential update every time you start one of the apps. They wait for a response by the server, even if the server takes minutes or even longer to respond. Yesterday this happened: the download page service took multiple minutes to hours to respond to requests. But why?
Our cloud storage provider had an outage on the EU-CENTRAL-1 region. This outage caused all requests that the service made to fail after around 30 seconds, including the request to list all available versions of SkyTemple and the Randomizer. Every time a request came in from SkyTemple, the Randomizer or a user browsing the downloads page, the app queued the request, tried to refresh it's data from the cloud storage in region EU-CENTRAL-1, waited 30 seconds for a response, and then continued with the next request. This caused not only all requests to fail, but also created a huge backlog, since each request was only processed one after the other. So it could potentially take ages for the server to respond.
But why was this preventing SkyTemple from starting? Simple! SkyTemple and the Randomizer have no limit on how long they wait for the download page service to respond. And only after it responds, it continues to open. However this does not mean that SkyTemple or the Randomizer require internet access to function in general: If you disable your internet connection, SkyTemple realizes that the server is not reachable at all and immedieately starts.
How can we fix this?
So what now? We will take some steps to prevent this from happening again.
- Yesterday Frostbyte already contributed code that adds a maximum wait time of 4 seconds to the requests SkyTemple and the Randomizer make to the downloads page service. This will be included in the next releases.
- I will make changes to the downloads page service, to not just use EU-CENTRAL-1 as the only mirror when trying to load metadata. As I wrote initially we have the data available in three regions. If one region fails, this shouldn't cause the entire service to stop working. Code will be added to make sure data is loaded from the NA or JP mirrors if the EU one does not function.
That's all for today! Thanks for reading and sorry for any inconvenience this caused!
Capy