This morning, between 5:00 and 8:30 CET Bitvavo services were partially unavailable. Due to technical issues, the scheduled maintenance was not completed within the planned time of 30 minutes. This led to users being unable to login and view their balance between 5:00 and 8:30 CET.
The scheduled upgrade this morning at 05:00 consisted of upgrades to our database cluster, message broker (for communication between services) and matching engine. The purpose of this upgrade is to improve performance, security, and availability (especially in volatile market conditions). This upgrade was performed on a non-production environment first, so we could verify the runbook and estimate the time it would take for actions to complete. The scheduled upgrade took roughly 3 hours longer than expected due to a unique case where several different events occurred: an engine upgrade, combined with a failed recovery and failover during an upgrade.
No customer data or funds were lost, and all functionality of the Bitvavo platform has been restored. Availability and security are of critical importance for Bitvavo, and we would like to apologize for the unexpected extension of the planned maintenance time and the uncertainty and limited availability our users experienced during this downtime.
For any questions regarding this downtime or the security of your Bitvavo account, feel free to contact firstname.lastname@example.org.
04:55: Deposits and withdrawals were disabled. 05:00: Trading halted. 05:02: Pre-upgrade checks verified, and we initiated the various upgrades (database cluster, message broker, matching engine). 05:08: Matching engine update completed successfully. 05:11: Message broker update completed successfully. 05:14: The database cluster rebooted, attempting to perform post-migration checks. At this point it is still unknown if the upgrade succeeded. 05:30: The underlying machine was deemed unhealty (status checks failing), and the instance was removed from service. It appreard that the upgrade failed. In this case an attempt is made to recover by performing a failover to a healthy service. 05:32: A failover to one of the read-repliacas failed. This was supposed to promote one of the read replicas to a writer, and route traffic away from the unhealthy instance to healthy instances. 05:36: Both writers & readers came stuck in a reboot loop with the database engine failing to load. Together with engineers from AWS, we started troubleshooting. 06:01: We could not immediately identify what the issue was and how it could be resolved. We decided to initiate the process to recreate the database from a snapshot and continue troubleshooting. 07:10: AWS engineers discovered the issue and started to implement a fix. 07:38: The new cluster created from a snapshot was ready. 07:43: We started to perform checks on this new cluster. We verified that the new cluster was running the new engine version. 07:54: We finished changing the hostnames, so that our applications pointed towards the new cluster. Slowly we started to re-allow traffic to this cluster and monitored impact. 08:18: A fix was applied for our original cluster. We continued to move ahead with the new cluster, as we already verified that it was running the new engine version and passed other checks. 08:20: Most services were restored 08:28: Trading on AAVE-EUR was enabled. 08:30: Trading on all other markets was enabled.